[jira] Commented: (HADOOP-3604) Reduce stuck at shuffling phase

Arun C Murthy (JIRA) Tue, 24 Jun 2008 00:08:37 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607499#action_12607499
 ]


Arun C Murthy commented on HADOOP-3604:
---------------------------------------

Ok, this turned very murky indeed... 

Looks like lots of threads in JVM are hung in a call to 
Mutex::lock_without_safepoint_check; when I ran jstack over the hung reducer 
with

$ jstack -F -m <pid>

I saw:

{noformat}
----------------- 22175 -----------------
0xffffe410  ????????
0x00000002  ????????
0x064e5344  _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57  _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x064fc204  _ZN2os5sleepEP6Threadxb + 0x244
0x063d5d20  JVM_Sleep + 0x320
0xf4d863aa  * java.lang.Thread.sleep(long) bci:0 (Interpreted frame)
0xf4d7ef0d  * org.apache.hadoop.mapred.Task$1.run() bci:31 line:301 
(Interpreted frame)
0xf4d7f3e9  * java.lang.Thread.run() bci:11 line:619 (Interpreted frame)
...
...


----------------- 22178 -----------------
0xffffe410  ????????
0x00000002  ????????
0x064e5344  _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57  _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x065b79f4  _ZN10JavaThread44check_safepoint_and_suspend_for_native_transEPS_ + 
0x2b4
0x06398e9c  jni_SetByteArrayRegion + 0x7c
0xd034b55b  Java_java_net_SocketInputStream_socketRead0 + 0xfb
0xf4d863aa  * java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, 
byte[], int, int, int) bci:0 (Interpreted frame)
0xf4d7ede9  * java.net.SocketInputStream.read(byte[], int, int) bci:84 line:129 
(Interpreted frame)
0xf4d7ede9  * java.io.BufferedInputStream.fill() bci:175 line:218 (Interpreted 
frame)
0xf4d7ef0d  * java.io.BufferedInputStream.read1(byte[], int, int) bci:44 
line:258 (Interpreted frame)
0xf4d7ede9  * java.io.BufferedInputStream.read(byte[], int, int) bci:49 
line:317 (Interpreted frame)
0xf4d7ede9  * 
sun.net.www.http.HttpClient.parseHTTPHeader(sun.net.www.MessageHeader, 
sun.net.ProgressSource, sun.net.www.protocol.http.HttpURLConnection) bci:51 
line:687 (Interpreted frame)
0xf4d7ece1  * sun.net.www.http.HttpClient.parseHTTP(sun.net.www.MessageHeader, 
sun.net.ProgressSource, sun.net.www.protocol.http.HttpURLConnection) bci:30 
line:632 (Interpreted frame)
0xf4d7ece1  * sun.net.www.protocol.http.HttpURLConnection.getInputStream() 
bci:290 line:1000 (Interpreted frame)
0xf4d7eda7  * 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(java.net.URLConnection,
 int, int) bci:71 line:1230 (Interpreted frame)
0xf4d7eda7  * 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
 org.apache.hadoop.fs.Path) bci:54 line:1068 (Interpreted frame)
...
...


----------------- 22182 -----------------
0xffffe410  ????????0x00000002  ????????
0x064e5344  _ZN5Mutex28lock_without_safepoint_checkEv + 0x440x06557f57  
_ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x06594250  _ZN13ObjectMonitor5enterEP6Thread + 0x2a0
0x0659253a  _ZN18ObjectSynchronizer10slow_enterE6HandleP9BasicLockP6Thread + 
0x6a
0x06365bde  
_ZN18InterpreterRuntime12monitorenterEP10JavaThreadP15BasicObjectLock + 0x6e
0xf4d8f263  * 
org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) 
bci:13 line:202 (Interpreted frame)
0xf4d7ef0d  * org.apache.log4j.Category.forcedLog(java.lang.String, 
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) bci:14 
line:388 (Interpreted frame)
0xf4d7ef0d  * org.apache.log4j.Category.log(java.lang.String, 
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) bci:34 
line:853 (Interpreted frame)
0xf4d7ef0d  * 
org.apache.commons.logging.impl.Log4JLogger.info(java.lang.Object) bci:21 
line:133 (Interpreted frame)
0xf4d7f3e9  * 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
 org.apache.hadoop.fs.Path) bci:311 line:1114 (Interpreted frame)
0xf4d7eda7  * 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation)
 bci:181 line:976 (Interpreted frame)
...
...


----------------- 22185 -----------------
0xffffe410  ????????
0x00000002  ????????
0x064e5344  _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57  _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x063639cc  _ZN18InterpreterRuntime8newarrayEP10JavaThread9BasicTypei + 0xfc
0xf4d8eb97  * java.lang.AbstractStringBuilder.<init>(int) bci:6 line:45 
(Interpreted frame)
0xf4d7ef0d  * java.lang.StringBuffer.<init>() bci:3 line:79 (Interpreted frame)
0xf4d7ef0d  * java.text.NumberFormat.format(long) bci:6 line:280 (Interpreted 
frame)
0xf4d7eda7  * org.apache.hadoop.mapred.JobID.toStringWOPrefix() bci:30 line:107 
(Interpreted frame)
0xf4d7eda7  * org.apache.hadoop.mapred.TaskID.toStringWOPrefix() bci:13 
line:135 (Interpreted frame)
0xf4d7eda7  * org.apache.hadoop.mapred.TaskAttemptID.toStringWOPrefix() bci:13 
line:124 (Interpreted frame)
0xf4d7eda7  * org.apache.hadoop.mapred.TaskAttemptID.toString() bci:20 line:118 
(Interpreted frame)
0xf4d7eda7  * java.lang.String.valueOf(java.lang.Object) bci:10 line:2827 
(Interpreted frame)
0xf4d7eda7  * java.lang.StringBuilder.append(java.lang.Object) bci:2 line:115 
(Interpreted frame)
0xf4d7eda7  * 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
 org.apache.hadoop.fs.Path) bci:372 line:1119 (Interpreted frame)
0xf4d7eda7  * 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation)
 bci:181 line:976 (Interpreted frame)
0xf4d7ee2b  * 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run() bci:93 
line:925 (Interpreted frame)
...
...
{noformat}


Sigh! This could possibly mean that we are running into JVM bug - maybe I 
should upgrade to the latest, stable jvm and check.

Oh, the one other point to note is that the native codecs are compiled on a 
machine with 1.6.0 jvm and are run on nodes with 1.6.0_06...

> Reduce stuck at shuffling phase
> -------------------------------
>
>                 Key: HADOOP-3604
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3604
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Runping Qi
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3604_0_20080623.patch
>
>
> I was running gridmix with Hadoop 0.18.
> I set the map output compression to true.
> Most of the jobs completed just fine.
> Three jobs, however, got stuck.
> Each has one reducer stuck at shuffling phase.
> Here is the log:
> 2008-06-20 00:06:01,264 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> Initializing JVM Metrics with processName=SHUFFLE, sessionId=
> 2008-06-20 00:06:01,415 INFO org.apache.hadoop.streaming.PipeMapRed: 
> PipeMapRed exec [/bin/cat]
> 2008-06-20 00:06:01,463 INFO org.apache.hadoop.mapred.ReduceTask: 
> ShuffleRamManager: MemoryLimit=134217728, MaxSingleShuffleLimit=33554432
> 2008-06-20 00:06:01,474 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded 
> the native-hadoop library
> 2008-06-20 00:06:01,475 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: 
> Successfully loaded & initialized native-zlib library
> 2008-06-20 00:06:01,476 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,477 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,477 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,478 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,478 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,486 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,486 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,487 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,487 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,488 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,488 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,493 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,497 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,497 INFO org.apache.hadoop.io.compress.CodecPool: Got 
> brand-new decompressor
> 2008-06-20 00:06:01,500 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200806192318_0450_r_000016_0 Thread started: Thread for merging 
> on-disk files
> 2008-06-20 00:06:01,500 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200806192318_0450_r_000016_0 Thread waiting: Thread for merging 
> on-disk files
> 2008-06-20 00:06:01,502 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200806192318_0450_r_000016_0 Need another 270 map output(s) where 0 
> is already in progress
> 2008-06-20 00:06:01,503 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200806192318_0450_r_000016_0 Thread started: Thread for merging in 
> memory files
> 2008-06-20 00:06:01,503 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200806192318_0450_r_000016_0: Got 0 new map-outputs & number of known 
> map outputs is 0
> 2008-06-20 00:06:01,504 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200806192318_0450_r_000016_0 Scheduled 0 of 0 known outputs (0 slow 
> hosts and 0 dup hosts)
> 2008-06-20 00:06:06,654 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200806192318_0450_r_000016_0: Got 269 new map-outputs & number of 
> known map outputs is 269
> 2008-06-20 00:06:06,656 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200806192318_0450_r_000016_0 Scheduled 229 of 269 known outputs (0 
> slow hosts and 40 dup hosts)
> 2008-06-20 00:06:07,163 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 846183 bytes (210104 raw bytes) into RAM-FS from 
> attempt_200806192318_0450_m_000089_0
> 2008-06-20 00:06:07,163 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 820890 bytes (204371 raw bytes) into RAM-FS from 
> attempt_200806192318_0450_m_000083_0
> 2008-06-20 00:06:07,166 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 835672 bytes (208085 raw bytes) into RAM-FS from 
> attempt_200806192318_0450_m_000122_0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3604) Reduce stuck at shuffling phase

Reply via email to