[
https://issues.apache.org/jira/browse/AMBARI-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lysnichenko resolved AMBARI-4992.
----------------------------------------
Resolution: Fixed
committed to trunk
> Sometimes cluster installation pauses for few minutes between tasks
> -------------------------------------------------------------------
>
> Key: AMBARI-4992
> URL: https://issues.apache.org/jira/browse/AMBARI-4992
> Project: Ambari
> Issue Type: Improvement
> Components: agent
> Affects Versions: 1.5.0
> Reporter: Vitaliy Semenyk
> Assignee: Dmitry Lysnichenko
>
> h2. The problem
> Primarily affects pluggable (python-based) services.
> During cluster installation, there may be a few significant pauses between
> task execution. At this time, the previous task shows ip as completed at UI,
> and the next task shows up as not started yet. This effect may be noticed 1-3
> times during installation when installing entire cluster, taking in some
> cases around 3 minutes for one pause.
> Initial analysis shows that this time is consumed by executing service checks
> that has been queued during cluster installation.
> h2. Some background:
> Server issues a big set of EXECUTION_COMMANDs at once few times during
> cluster installation. Typically, all commands for one set are sent to agent
> at once. At agent, status and execution commands are stored at the same
> queue. While cluster is installed, status commands are appended to the end of
> the queue. So when the last command for INSTALL is completed, we have a large
> number of status commands at the queue (hundreds?). Executing them may take
> around 3 minutes. START commands that have been issued by the server will not
> be scheduled for execution until all STATUS_COMMANDs at the queue are
> perform. At UI, installation it looks like installation hang up.
> h2. Why it became noticeable at pluggable services:
> It's due to few factors:
> - python services install faster
> - status commands ran a bit slower because we invoke a separate subprocess to
> determine every status, and also perform more IO
> I've attached a relevant log (The interesting part is after text
> {code}
> INFO 2013-12-18 13:43:44,163 Heartbeat.py:76 - Sending heartbeat with
> response id: 419 and timestamp: 1387374224161. Command(s) in progress: True.
> Components mapped: True
> {code}
> Zookeeper start has been finished and after that, only status commands have
> been executing for few minutes (the START task for the next component just
> showed up as scheduled, but not started yet at UI).
> h2. Selected solution
> I prefer the approach of checking if the command queue is empty and then
> picking status commands from last_status. It is better as it can be done
> every 2 seconds whereas status commands are send by the server only every
> minute. I assume we still do not store duplicate commands in last_status.
--
This message was sent by Atlassian JIRA
(v6.2#6252)