[jira] Commented: (MAPREDUCE-2177) The wait for spill completion should call Condition.awaitNanos(long nanosTimeout)

Chris Douglas (JIRA) Sun, 07 Nov 2010 17:18:47 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929423#action_12929423
 ]


Chris Douglas commented on MAPREDUCE-2177:
------------------------------------------

It is forced to block because the buffer is full. Returning from collect 
without serializing the emitted record would be an error, as would serializing 
the record over data allocated to the spill. Changing the call as you suggest 
would affect correctness, unless you're arguing that the task should fail if 
the spill takes more than some set amount of time. If the task timeout is 
killing the task, then it's working as designed, and equivalently to the 
proposed mechanism.

There are many reasons the spill could take a long time. Running with a 
combiner, using a non-{{RawComparator}}, spilling to a failing/slow disk, etc. 
It's possible you're seeing a race condition that causes the collection thread 
to miss the signal, but the fix would not be to add a timeout to the wait, but 
to fix the locking. Can you get a stack trace from a map task stuck in this 
state? If the job is rerun over the same data, do the same tasks hang? Do the 
timeouts occur on particular machines? Does the task succeed on later attempts 
on different machines?

> The wait for spill completion should call Condition.awaitNanos(long 
> nanosTimeout)
> ---------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2177
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2177
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.20.2
>            Reporter: Ted Yu
>
> We sometimes saw maptask timeout in cdh3b2. Here is log from one of the 
> maptasks:
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: Spilling map 
> output: buffer full= true
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 119534169; bufend = 59763857; bufvoid = 298844160
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 438913; kvend = 585320; length = 983040
> 2010-11-04 10:34:41,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 3
> 2010-11-04 10:35:45,352 INFO org.apache.hadoop.mapred.MapTask: Spilling map 
> output: buffer full= true
> 2010-11-04 10:35:45,547 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 59763857; bufend = 298837899; bufvoid = 298844160
> 2010-11-04 10:35:45,547 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 585320; kvend = 731585; length = 983040
> 2010-11-04 10:45:41,289 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 4
> Note how long the last spill took.
> In MapTask.java, the following code waits for spill to finish:
> while (kvstart != kvend) { reporter.progress(); spillDone.await(); }
> In trunk code, code is similar.
> There is no timeout mechanism for Condition.await(). In case the SpillThread 
> takes long before calling spillDone.signal(), we would see timeout.
> Condition.awaitNanos(long nanosTimeout) should be called.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2177) The wait for spill completion should call Condition.awaitNanos(long nanosTimeout)

Reply via email to