Jonathan Hurley created AMBARI-18052:
----------------------------------------

             Summary: Starting a Component After Pausing An Upgrade Can Take 9 
Minutes
                 Key: AMBARI-18052
                 URL: https://issues.apache.org/jira/browse/AMBARI-18052
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 2.2.0
            Reporter: Jonathan Hurley
            Assignee: Jonathan Hurley
            Priority: Blocker
             Fix For: 2.4.0


STR:

- Begin an EU on a large cluster (say 900 hosts) with a full stack.
- At finalize, pause the upgrade and then attempt to start a component

Upon clicking on start the request was accepted, but it was not scheduled for 
the next 9 minutes. After which it was scheduled and started correctly. Further 
requests were responsive.

Request:
{code}
  "href" : "http://perf-a-1:8080/api/v1/clusters/perf/requests/77";,
  "Requests" : {
    "aborted_task_count" : 0,
    "cluster_name" : "perf",
    "completed_task_count" : 1,
    "create_time" : 1468613056719,
    "end_time" : 1468613636151,
    "exclusive" : false,
    "failed_task_count" : 0,
    "id" : 77,
    "inputs" : null,
    "operation_level" : null,
    "progress_percent" : 100.0,
    "queued_task_count" : 0,
    "request_context" : "Start NodeManager",
    "request_schedule" : null,
    "request_status" : "COMPLETED",
    "resource_filters" : [ ],
    "start_time" : 1468613625415,
    "task_count" : 1,
    "timed_out_task_count" : 0,
    "type" : "INTERNAL_REQUEST"
  },
{code}

{noformat}
created: 7/15/2016, 1:04:16 PM
started: 7/15/2016, 1:13:45 PM
finished: 7/15/2016, 1:13:56 PM
{noformat}

The root cause of this seems to be how an upgrade is paused/resumed. The 
{{UpgradeResourceProvider}} loads the entire request in memory to iterate over 
it. In this case, that contains about 11,000 {{HostRoleCommandEntity}} where 
each one is between 2 and 3MB. That means, that we're trying to load about 33GB 
of data into memory.

This causes threads to die slowly, including scheduler threads, until the JVM 
can recover and start scheduling things again. 

The real question is _why_ each HRCEntity is so large. In many cases, the 
output includes information from HDFS, such as the state of SafeMode. These 
messages include the entire state of the system which is being captured to the 
stdout. I see two workarounds here:

- Change how {{UpgradeResourceProvider}} handles the pause/resume feature to 
avoid loading the entire request. Maybe pagination, maybe a custom query ... 
but not the entire request all at once.

- Configure the {{HostRoleCommandEntity}} to lazy-load the {{stdout}} and 
{{stderr}} data.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to