[ 
https://issues.apache.org/jira/browse/AMBARI-18052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hurley updated AMBARI-18052:
-------------------------------------
    Status: Patch Available  (was: Open)

> Starting a Component After Pausing An Upgrade Can Take 9 Minutes
> ----------------------------------------------------------------
>
>                 Key: AMBARI-18052
>                 URL: https://issues.apache.org/jira/browse/AMBARI-18052
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.2.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Blocker
>             Fix For: 2.4.0
>
>         Attachments: AMBARI-18052.patch
>
>
> STR:
> - Begin an EU on a large cluster (say 900 hosts) with a full stack.
> - At finalize, pause the upgrade and then attempt to start a component
> Upon clicking on start the request was accepted, but it was not scheduled for 
> the next 9 minutes. After which it was scheduled and started correctly. 
> Further requests were responsive.
> Request:
> {code}
>   "href" : "http://perf-a-1:8080/api/v1/clusters/perf/requests/77";,
>   "Requests" : {
>     "aborted_task_count" : 0,
>     "cluster_name" : "perf",
>     "completed_task_count" : 1,
>     "create_time" : 1468613056719,
>     "end_time" : 1468613636151,
>     "exclusive" : false,
>     "failed_task_count" : 0,
>     "id" : 77,
>     "inputs" : null,
>     "operation_level" : null,
>     "progress_percent" : 100.0,
>     "queued_task_count" : 0,
>     "request_context" : "Start NodeManager",
>     "request_schedule" : null,
>     "request_status" : "COMPLETED",
>     "resource_filters" : [ ],
>     "start_time" : 1468613625415,
>     "task_count" : 1,
>     "timed_out_task_count" : 0,
>     "type" : "INTERNAL_REQUEST"
>   },
> {code}
> {noformat}
> created: 7/15/2016, 1:04:16 PM
> started: 7/15/2016, 1:13:45 PM
> finished: 7/15/2016, 1:13:56 PM
> {noformat}
> The root cause of this seems to be how an upgrade is paused/resumed. The 
> {{UpgradeResourceProvider}} loads the entire request in memory to iterate 
> over it. In this case, that contains about 11,000 {{HostRoleCommandEntity}} 
> where each one is between 2 and 3MB. That means, that we're trying to load 
> about 33GB of data into memory.
> This causes threads to die slowly, including scheduler threads, until the JVM 
> can recover and start scheduling things again. 
> The real question is _why_ each HRCEntity is so large. In many cases, the 
> output includes information from HDFS, such as the state of SafeMode. These 
> messages include the entire state of the system which is being captured to 
> the stdout. I see two workarounds here:
> - Change how {{UpgradeResourceProvider}} handles the pause/resume feature to 
> avoid loading the entire request. Maybe pagination, maybe a custom query ... 
> but not the entire request all at once.
> - Configure the {{HostRoleCommandEntity}} to lazy-load the {{stdout}} and 
> {{stderr}} data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to