[
https://issues.apache.org/jira/browse/AMBARI-18052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Hurley updated AMBARI-18052:
-------------------------------------
Attachment: AMBARI-18052.patch
> Starting a Component After Pausing An Upgrade Can Take 9 Minutes
> ----------------------------------------------------------------
>
> Key: AMBARI-18052
> URL: https://issues.apache.org/jira/browse/AMBARI-18052
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 2.2.0
> Reporter: Jonathan Hurley
> Assignee: Jonathan Hurley
> Priority: Blocker
> Fix For: 2.4.0
>
> Attachments: AMBARI-18052.patch
>
>
> STR:
> - Begin an EU on a large cluster (say 900 hosts) with a full stack.
> - At finalize, pause the upgrade and then attempt to start a component
> Upon clicking on start the request was accepted, but it was not scheduled for
> the next 9 minutes. After which it was scheduled and started correctly.
> Further requests were responsive.
> Request:
> {code}
> "href" : "http://perf-a-1:8080/api/v1/clusters/perf/requests/77",
> "Requests" : {
> "aborted_task_count" : 0,
> "cluster_name" : "perf",
> "completed_task_count" : 1,
> "create_time" : 1468613056719,
> "end_time" : 1468613636151,
> "exclusive" : false,
> "failed_task_count" : 0,
> "id" : 77,
> "inputs" : null,
> "operation_level" : null,
> "progress_percent" : 100.0,
> "queued_task_count" : 0,
> "request_context" : "Start NodeManager",
> "request_schedule" : null,
> "request_status" : "COMPLETED",
> "resource_filters" : [ ],
> "start_time" : 1468613625415,
> "task_count" : 1,
> "timed_out_task_count" : 0,
> "type" : "INTERNAL_REQUEST"
> },
> {code}
> {noformat}
> created: 7/15/2016, 1:04:16 PM
> started: 7/15/2016, 1:13:45 PM
> finished: 7/15/2016, 1:13:56 PM
> {noformat}
> The root cause of this seems to be how an upgrade is paused/resumed. The
> {{UpgradeResourceProvider}} loads the entire request in memory to iterate
> over it. In this case, that contains about 11,000 {{HostRoleCommandEntity}}
> where each one is between 2 and 3MB. That means, that we're trying to load
> about 33GB of data into memory.
> This causes threads to die slowly, including scheduler threads, until the JVM
> can recover and start scheduling things again.
> The real question is _why_ each HRCEntity is so large. In many cases, the
> output includes information from HDFS, such as the state of SafeMode. These
> messages include the entire state of the system which is being captured to
> the stdout. I see two workarounds here:
> - Change how {{UpgradeResourceProvider}} handles the pause/resume feature to
> avoid loading the entire request. Maybe pagination, maybe a custom query ...
> but not the entire request all at once.
> - Configure the {{HostRoleCommandEntity}} to lazy-load the {{stdout}} and
> {{stderr}} data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)