[
https://issues.apache.org/jira/browse/MAPREDUCE-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929423#action_12929423
]
Chris Douglas commented on MAPREDUCE-2177:
------------------------------------------
It is forced to block because the buffer is full. Returning from collect
without serializing the emitted record would be an error, as would serializing
the record over data allocated to the spill. Changing the call as you suggest
would affect correctness, unless you're arguing that the task should fail if
the spill takes more than some set amount of time. If the task timeout is
killing the task, then it's working as designed, and equivalently to the
proposed mechanism.
There are many reasons the spill could take a long time. Running with a
combiner, using a non-{{RawComparator}}, spilling to a failing/slow disk, etc.
It's possible you're seeing a race condition that causes the collection thread
to miss the signal, but the fix would not be to add a timeout to the wait, but
to fix the locking. Can you get a stack trace from a map task stuck in this
state? If the job is rerun over the same data, do the same tasks hang? Do the
timeouts occur on particular machines? Does the task succeed on later attempts
on different machines?
> The wait for spill completion should call Condition.awaitNanos(long
> nanosTimeout)
> ---------------------------------------------------------------------------------
>
> Key: MAPREDUCE-2177
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2177
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: tasktracker
> Affects Versions: 0.20.2
> Reporter: Ted Yu
>
> We sometimes saw maptask timeout in cdh3b2. Here is log from one of the
> maptasks:
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: Spilling map
> output: buffer full= true
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> 119534169; bufend = 59763857; bufvoid = 298844160
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 438913; kvend = 585320; length = 983040
> 2010-11-04 10:34:41,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill
> 3
> 2010-11-04 10:35:45,352 INFO org.apache.hadoop.mapred.MapTask: Spilling map
> output: buffer full= true
> 2010-11-04 10:35:45,547 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> 59763857; bufend = 298837899; bufvoid = 298844160
> 2010-11-04 10:35:45,547 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 585320; kvend = 731585; length = 983040
> 2010-11-04 10:45:41,289 INFO org.apache.hadoop.mapred.MapTask: Finished spill
> 4
> Note how long the last spill took.
> In MapTask.java, the following code waits for spill to finish:
> while (kvstart != kvend) { reporter.progress(); spillDone.await(); }
> In trunk code, code is similar.
> There is no timeout mechanism for Condition.await(). In case the SpillThread
> takes long before calling spillDone.signal(), we would see timeout.
> Condition.awaitNanos(long nanosTimeout) should be called.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.