Jonathan Hurley created AMBARI-18052:
----------------------------------------
Summary: Starting a Component After Pausing An Upgrade Can Take 9
Minutes
Key: AMBARI-18052
URL: https://issues.apache.org/jira/browse/AMBARI-18052
Project: Ambari
Issue Type: Bug
Components: ambari-server
Affects Versions: 2.2.0
Reporter: Jonathan Hurley
Assignee: Jonathan Hurley
Priority: Blocker
Fix For: 2.4.0
STR:
- Begin an EU on a large cluster (say 900 hosts) with a full stack.
- At finalize, pause the upgrade and then attempt to start a component
Upon clicking on start the request was accepted, but it was not scheduled for
the next 9 minutes. After which it was scheduled and started correctly. Further
requests were responsive.
Request:
{code}
"href" : "http://perf-a-1:8080/api/v1/clusters/perf/requests/77",
"Requests" : {
"aborted_task_count" : 0,
"cluster_name" : "perf",
"completed_task_count" : 1,
"create_time" : 1468613056719,
"end_time" : 1468613636151,
"exclusive" : false,
"failed_task_count" : 0,
"id" : 77,
"inputs" : null,
"operation_level" : null,
"progress_percent" : 100.0,
"queued_task_count" : 0,
"request_context" : "Start NodeManager",
"request_schedule" : null,
"request_status" : "COMPLETED",
"resource_filters" : [ ],
"start_time" : 1468613625415,
"task_count" : 1,
"timed_out_task_count" : 0,
"type" : "INTERNAL_REQUEST"
},
{code}
{noformat}
created: 7/15/2016, 1:04:16 PM
started: 7/15/2016, 1:13:45 PM
finished: 7/15/2016, 1:13:56 PM
{noformat}
The root cause of this seems to be how an upgrade is paused/resumed. The
{{UpgradeResourceProvider}} loads the entire request in memory to iterate over
it. In this case, that contains about 11,000 {{HostRoleCommandEntity}} where
each one is between 2 and 3MB. That means, that we're trying to load about 33GB
of data into memory.
This causes threads to die slowly, including scheduler threads, until the JVM
can recover and start scheduling things again.
The real question is _why_ each HRCEntity is so large. In many cases, the
output includes information from HDFS, such as the state of SafeMode. These
messages include the entire state of the system which is being captured to the
stdout. I see two workarounds here:
- Change how {{UpgradeResourceProvider}} handles the pause/resume feature to
avoid loading the entire request. Maybe pagination, maybe a custom query ...
but not the entire request all at once.
- Configure the {{HostRoleCommandEntity}} to lazy-load the {{stdout}} and
{{stderr}} data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)