[
https://issues.apache.org/jira/browse/HADOOP-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607499#action_12607499
]
Arun C Murthy commented on HADOOP-3604:
---------------------------------------
Ok, this turned very murky indeed...
Looks like lots of threads in JVM are hung in a call to
Mutex::lock_without_safepoint_check; when I ran jstack over the hung reducer
with
$ jstack -F -m <pid>
I saw:
{noformat}
----------------- 22175 -----------------
0xffffe410 ????????
0x00000002 ????????
0x064e5344 _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57 _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x064fc204 _ZN2os5sleepEP6Threadxb + 0x244
0x063d5d20 JVM_Sleep + 0x320
0xf4d863aa * java.lang.Thread.sleep(long) bci:0 (Interpreted frame)
0xf4d7ef0d * org.apache.hadoop.mapred.Task$1.run() bci:31 line:301
(Interpreted frame)
0xf4d7f3e9 * java.lang.Thread.run() bci:11 line:619 (Interpreted frame)
...
...
----------------- 22178 -----------------
0xffffe410 ????????
0x00000002 ????????
0x064e5344 _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57 _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x065b79f4 _ZN10JavaThread44check_safepoint_and_suspend_for_native_transEPS_ +
0x2b4
0x06398e9c jni_SetByteArrayRegion + 0x7c
0xd034b55b Java_java_net_SocketInputStream_socketRead0 + 0xfb
0xf4d863aa * java.net.SocketInputStream.socketRead0(java.io.FileDescriptor,
byte[], int, int, int) bci:0 (Interpreted frame)
0xf4d7ede9 * java.net.SocketInputStream.read(byte[], int, int) bci:84 line:129
(Interpreted frame)
0xf4d7ede9 * java.io.BufferedInputStream.fill() bci:175 line:218 (Interpreted
frame)
0xf4d7ef0d * java.io.BufferedInputStream.read1(byte[], int, int) bci:44
line:258 (Interpreted frame)
0xf4d7ede9 * java.io.BufferedInputStream.read(byte[], int, int) bci:49
line:317 (Interpreted frame)
0xf4d7ede9 *
sun.net.www.http.HttpClient.parseHTTPHeader(sun.net.www.MessageHeader,
sun.net.ProgressSource, sun.net.www.protocol.http.HttpURLConnection) bci:51
line:687 (Interpreted frame)
0xf4d7ece1 * sun.net.www.http.HttpClient.parseHTTP(sun.net.www.MessageHeader,
sun.net.ProgressSource, sun.net.www.protocol.http.HttpURLConnection) bci:30
line:632 (Interpreted frame)
0xf4d7ece1 * sun.net.www.protocol.http.HttpURLConnection.getInputStream()
bci:290 line:1000 (Interpreted frame)
0xf4d7eda7 *
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(java.net.URLConnection,
int, int) bci:71 line:1230 (Interpreted frame)
0xf4d7eda7 *
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
org.apache.hadoop.fs.Path) bci:54 line:1068 (Interpreted frame)
...
...
----------------- 22182 -----------------
0xffffe410 ????????0x00000002 ????????
0x064e5344 _ZN5Mutex28lock_without_safepoint_checkEv + 0x440x06557f57
_ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x06594250 _ZN13ObjectMonitor5enterEP6Thread + 0x2a0
0x0659253a _ZN18ObjectSynchronizer10slow_enterE6HandleP9BasicLockP6Thread +
0x6a
0x06365bde
_ZN18InterpreterRuntime12monitorenterEP10JavaThreadP15BasicObjectLock + 0x6e
0xf4d8f263 *
org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent)
bci:13 line:202 (Interpreted frame)
0xf4d7ef0d * org.apache.log4j.Category.forcedLog(java.lang.String,
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) bci:14
line:388 (Interpreted frame)
0xf4d7ef0d * org.apache.log4j.Category.log(java.lang.String,
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) bci:34
line:853 (Interpreted frame)
0xf4d7ef0d *
org.apache.commons.logging.impl.Log4JLogger.info(java.lang.Object) bci:21
line:133 (Interpreted frame)
0xf4d7f3e9 *
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
org.apache.hadoop.fs.Path) bci:311 line:1114 (Interpreted frame)
0xf4d7eda7 *
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation)
bci:181 line:976 (Interpreted frame)
...
...
----------------- 22185 -----------------
0xffffe410 ????????
0x00000002 ????????
0x064e5344 _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57 _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x063639cc _ZN18InterpreterRuntime8newarrayEP10JavaThread9BasicTypei + 0xfc
0xf4d8eb97 * java.lang.AbstractStringBuilder.<init>(int) bci:6 line:45
(Interpreted frame)
0xf4d7ef0d * java.lang.StringBuffer.<init>() bci:3 line:79 (Interpreted frame)
0xf4d7ef0d * java.text.NumberFormat.format(long) bci:6 line:280 (Interpreted
frame)
0xf4d7eda7 * org.apache.hadoop.mapred.JobID.toStringWOPrefix() bci:30 line:107
(Interpreted frame)
0xf4d7eda7 * org.apache.hadoop.mapred.TaskID.toStringWOPrefix() bci:13
line:135 (Interpreted frame)
0xf4d7eda7 * org.apache.hadoop.mapred.TaskAttemptID.toStringWOPrefix() bci:13
line:124 (Interpreted frame)
0xf4d7eda7 * org.apache.hadoop.mapred.TaskAttemptID.toString() bci:20 line:118
(Interpreted frame)
0xf4d7eda7 * java.lang.String.valueOf(java.lang.Object) bci:10 line:2827
(Interpreted frame)
0xf4d7eda7 * java.lang.StringBuilder.append(java.lang.Object) bci:2 line:115
(Interpreted frame)
0xf4d7eda7 *
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
org.apache.hadoop.fs.Path) bci:372 line:1119 (Interpreted frame)
0xf4d7eda7 *
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation)
bci:181 line:976 (Interpreted frame)
0xf4d7ee2b *
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run() bci:93
line:925 (Interpreted frame)
...
...
{noformat}
Sigh! This could possibly mean that we are running into JVM bug - maybe I
should upgrade to the latest, stable jvm and check.
Oh, the one other point to note is that the native codecs are compiled on a
machine with 1.6.0 jvm and are run on nodes with 1.6.0_06...
> Reduce stuck at shuffling phase
> -------------------------------
>
> Key: HADOOP-3604
> URL: https://issues.apache.org/jira/browse/HADOOP-3604
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.18.0
> Reporter: Runping Qi
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.18.0
>
> Attachments: HADOOP-3604_0_20080623.patch
>
>
> I was running gridmix with Hadoop 0.18.
> I set the map output compression to true.
> Most of the jobs completed just fine.
> Three jobs, however, got stuck.
> Each has one reducer stuck at shuffling phase.
> Here is the log:
> 2008-06-20 00:06:01,264 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=SHUFFLE, sessionId=
> 2008-06-20 00:06:01,415 INFO org.apache.hadoop.streaming.PipeMapRed:
> PipeMapRed exec [/bin/cat]
> 2008-06-20 00:06:01,463 INFO org.apache.hadoop.mapred.ReduceTask:
> ShuffleRamManager: MemoryLimit=134217728, MaxSingleShuffleLimit=33554432
> 2008-06-20 00:06:01,474 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded
> the native-hadoop library
> 2008-06-20 00:06:01,475 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory:
> Successfully loaded & initialized native-zlib library
> 2008-06-20 00:06:01,476 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,477 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,477 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,478 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,478 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,486 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,486 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,487 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,487 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,488 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,488 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,493 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,497 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,497 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2008-06-20 00:06:01,500 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200806192318_0450_r_000016_0 Thread started: Thread for merging
> on-disk files
> 2008-06-20 00:06:01,500 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200806192318_0450_r_000016_0 Thread waiting: Thread for merging
> on-disk files
> 2008-06-20 00:06:01,502 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200806192318_0450_r_000016_0 Need another 270 map output(s) where 0
> is already in progress
> 2008-06-20 00:06:01,503 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200806192318_0450_r_000016_0 Thread started: Thread for merging in
> memory files
> 2008-06-20 00:06:01,503 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200806192318_0450_r_000016_0: Got 0 new map-outputs & number of known
> map outputs is 0
> 2008-06-20 00:06:01,504 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200806192318_0450_r_000016_0 Scheduled 0 of 0 known outputs (0 slow
> hosts and 0 dup hosts)
> 2008-06-20 00:06:06,654 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200806192318_0450_r_000016_0: Got 269 new map-outputs & number of
> known map outputs is 269
> 2008-06-20 00:06:06,656 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200806192318_0450_r_000016_0 Scheduled 229 of 269 known outputs (0
> slow hosts and 40 dup hosts)
> 2008-06-20 00:06:07,163 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling
> 846183 bytes (210104 raw bytes) into RAM-FS from
> attempt_200806192318_0450_m_000089_0
> 2008-06-20 00:06:07,163 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling
> 820890 bytes (204371 raw bytes) into RAM-FS from
> attempt_200806192318_0450_m_000083_0
> 2008-06-20 00:06:07,166 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling
> 835672 bytes (208085 raw bytes) into RAM-FS from
> attempt_200806192318_0450_m_000122_0
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.