[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488802#comment-13488802
 ] 

Robert Joseph Evans commented on MAPREDUCE-4749:
------------------------------------------------

If there are lots and lots of events for a job that is localizing then there 
could be a pause for each of these events, and yes it would slow the queue down 
even to the point prior to MAPREDUCE-4088 when all events would wait for the 
job to finish localizing. But the common case is much faster then the worst 
case, not that it is much comfort when you hit the worst case :). We could 
mitigate this by dropping the wait time to something smaller like 100ms so it 
would take 50 times as many events to slow it down the same amount.

I also agree that the tight loop will only happen when *ALL* the present 
actions in the queue are tainted. But I don't agree that it should be rare.  I 
think it is quite common to have a single event in the queue, or to have all of 
the events in the queue to be for a single job that is localizing.  Especially 
if all of the other jobs on this node are done localizing so their events get 
processed quickly and removed from the queue. The only time the thread would 
not be running is when the queue is empty.  I have not collected any real world 
numbers so I don't know how often that actually is in practice, or what 
percentage of the running time is just for checking.  If you feel that the 
extra CPU utilization is worth this then go ahead and remove the wait.  I am 
not opposed to it. I just wanted to point out the consequences of doing so. 
Also if you remove the wait, we should look at if we can remove the notify 
calls from the job as well.  If no one is ever going to wait the notifys become 
dead code. 

That being said, I agree with you Vinod that having separate queues is a better 
solution over all, but it is also a much larger change.  One that I don't know 
would provide that much more benefit compared to the risk of such a change.
                
> Killing multiple attempts of a task taker longer as more attempts are killed
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4749
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4749
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Arpit Gupta
>            Assignee: Arpit Gupta
>         Attachments: MAPREDUCE-4749.branch-1.patch
>
>
> The following was noticed on a mr job running on hadoop 1.1.0
> 1. Start an mr job with 1 mapper
> 2. Wait for a min
> 3. Kill the first attempt of the mapper and then subsequently kill the other 
> 3 attempts in order to fail the job
> The time taken to kill the task grew exponentially.
> 1st attempt was killed immediately.
> 2nd attempt took a little over a min
> 3rd attempt took approx. 20 mins
> 4th attempt took around 3 hrs.
> The command used to kill the attempt was "hadoop job -fail-task"
> Note that the command returned immediately as soon as the fail attempt was 
> accepted but the time the attempt was actually killed was as stated above.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to