[ 
https://issues.apache.org/jira/browse/AMBARI-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13933273#comment-13933273
 ] 

Dmytro Sen commented on AMBARI-4992:
------------------------------------

+1

> Sometimes cluster installation pauses for few minutes between tasks
> -------------------------------------------------------------------
>
>                 Key: AMBARI-4992
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4992
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent
>    Affects Versions: 1.5.0
>            Reporter: Vitaliy Semenyk
>            Assignee: Dmitry Lysnichenko
>
> h2. The problem
> Primarily affects pluggable (python-based) services.
> During cluster installation, there may be a few significant pauses between 
> task execution. At this time, the previous task shows ip as completed at UI, 
> and the next task shows up as not started yet. This effect may be noticed 1-3 
> times during installation when installing entire cluster, taking in some 
> cases around 3 minutes for one pause. 
> Initial analysis shows that this time is consumed by executing service checks 
> that has been queued during cluster installation. 
> h2. Some background:
> Server issues a big set of EXECUTION_COMMANDs at once few times during 
> cluster installation. Typically, all commands for one set are sent to agent 
> at once. At agent, status and execution commands are stored at the same 
> queue. While cluster is installed, status commands are appended to the end of 
> the queue. So when the last command for INSTALL is completed, we have a large 
> number of status commands at the queue (hundreds?). Executing them may take 
> around 3 minutes. START commands that have been issued by the server will not 
> be scheduled for execution until all STATUS_COMMANDs at the queue are 
> perform. At UI, installation it looks like installation hang up.
> h2. Why it became noticeable at pluggable services:
>  It's due to few factors:
> - python services install faster
> - status commands ran a bit slower because we invoke a separate subprocess to 
> determine every status, and also perform more IO
> I've attached a relevant log (The interesting part is after text 
> {code}
> INFO 2013-12-18 13:43:44,163 Heartbeat.py:76 - Sending heartbeat with 
> response id: 419 and timestamp: 1387374224161. Command(s) in progress: True. 
> Components mapped: True
> {code}
> Zookeeper start has been finished and after that,  only status commands have 
> been executing for few minutes (the START task for the next component just 
> showed up as scheduled, but not started yet at UI).
> h2. Selected solution
>  I prefer the approach of checking if the command queue is empty and then 
> picking status commands from last_status. It is better as it can be done 
> every 2 seconds whereas status commands are send by the server only every 
> minute. I assume we still do not store duplicate commands in last_status.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to