Re: 0.16.4 DataNode problem...

C G Tue, 27 May 2008 11:05:17 -0700

After stopping/restarting DFS an fsck shows that the file system is healthy 
with no under/over replicated data, and no missing files/blocks.


Ted Dunning <[EMAIL PROTECTED]> wrote:  
What does hadoop dfs -fsck show you?


On 5/27/08 8:58 AM, "C G" 
wrote:

> Hi All:
> 
> I'm seeing an inability to run one of our applications over a reasonably
> small dataset (~200G input) while running 0.16.4. Previously we were on
> 0.15.0 and the same application ran fine with the same dataset.
> 
> A lengthy description follows, including log file output, etc. The failure
> mode smells like a bug in 0.16.4, but I'm not 100% positive about that.
> 
> My questions are:
> 
> 1. Any known issues upgrading from 0.15.0 to 0.16.4? Our code runs just
> fine over small datasets, but dies on these larger ones. We followed the
> upgrade instructions in the wiki, etc.
> 
> 2. Would an upgrade to 0.17.0 help resolve these problems?
> 
> 3. Would a re-format/re-load of HDFS help correct these issues? This is
> the thing I hope for the least in that I have 3T of data on-board HDFS and it
> will take days to dump it all and reload it.
> 
> 4. Any other advice or help?
> 
> I've been looking at this for the past few days and have been unable to make
> progress of solving it. I would hate to have to fall back to 0.15.0 (see
> above regarding 3T data reloads, not to mention being stuck on an old
> release). Any help, thoughts, comments, etc., would be very helpful.
> Thanks!
> 
> Description:
> Following an upgrade from 0.15.0 to 0.16.4 (and after recompiling our apps,
> etc.), a job that used to run correctly on our grid now fails. The failure
> occurs after the map phase is complete, and about 2/3rds of the way through
> the reduce job. The error which gets kicked out from the application
> perspective is:
> 
> 08/05/27 11:30:08 INFO mapred.JobClient: map 100% reduce 89%
> 08/05/27 11:30:41 INFO mapred.JobClient: map 100% reduce 90%
> 08/05/27 11:32:45 INFO mapred.JobClient: map 100% reduce 86%
> 08/05/27 11:32:45 INFO mapred.JobClient: Task Id :
> task_200805271056_0001_r_000007_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
> at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanode
> Error(DFSClient.java:1832)
> at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:148
> 7)
> at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.jav
> a:1579)
> 
> 
> I then discovered that 1 or more DataNode instances on the slave nodes are
> down (we run 1 DataNode instance per machine). The cause for at least some of
> the DataNode failures is a JVM internal error that gets raised due to a
> complete out-of-memory scenario (on a 4G, 4-way machine).
> 
> Watching the DataNodes run, I can see them consuming more and more memory.
> For those failures for which there is a JVM traceback, I see (in part):
> #
> # java.lang.OutOfMemoryError: requested 16 bytes for CHeapObj-new. Out of swap
> space?
> #
> # Internal Error (414C4C4F434154494F4E0E494E4C494E450E4850500017), pid=4246,
> tid=2283883408
> #
> # Java VM: Java HotSpot(TM) Server VM (1.6.0_02-b05 mixed mode)
> # If you would like to submit a bug report, please visit:
> # http://java.sun.com/webapps/bugreport/crash.jsp
> #
> --------------- T H R E A D ---------------
> Current thread (0x8a942000): JavaThread
> "[EMAIL PROTECTED]" daemon [_thread_in_Java,
> id=15064]
> Stack: [0x881c4000,0x88215000), sp=0x882139e0, free space=318k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> V [libjvm.so+0x53b707]
> V [libjvm.so+0x225fe1]
> V [libjvm.so+0x16fdc5]
> V [libjvm.so+0x22aef3]
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> v blob 0xf4f235a7
> J java.io.DataInputStream.readInt()I
> j 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(Ljava/io/DataOutputS
> tream;Ljava/io/DataInputStream;Ljava/io/DataOutputStream;Ljava/lang/String;Lor
> g/a
> pache/hadoop/dfs/DataNode$Throttler;I)V+126
> j 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(Ljava/io/DataInputStream
> ;)V+746
> j org.apache.hadoop.dfs.DataNode$DataXceiver.run()V+174
> j java.lang.Thread.run()V+11
> v ~StubRoutines::call_stub
> --------------- P R O C E S S ---------------
> Java Threads: ( => current thread )
> 0x0ae3f400 JavaThread "process reaper" daemon [_thread_blocked, id=26870]
> 0x852e6000 JavaThread "[EMAIL PROTECTED]"
> daemon [_thread_in_vm, id=26869]
> 0x08a1cc00 JavaThread "PacketResponder 0 for Block blk_-6186975972786687394"
> daemon [_thread_blocked, id=26769]
> 0x852e5000 JavaThread "[EMAIL PROTECTED]"
> daemon [_thread_in_native, id=26768]
> 0x0956e000 JavaThread "PacketResponder 0 for Block blk_-2322514873363546651"
> daemon [_thread_blocked, id=26767]
> 0x852e4400 JavaThread "[EMAIL PROTECTED]"
> daemon [_thread_in_native, id=26766]
> 0x09d3a400 JavaThread "PacketResponder 0 for Block blk_8926941945313450801"
> daemon [_thread_blocked, id=26764]
> 0x852e3c00 JavaThread "[EMAIL PROTECTED]"
> daemon [_thread_in_native, id=26763]
> 0x0953d000 JavaThread "PacketResponder 0 for Block blk_4785883052769066976"
> daemon [_thread_blocked, id=26762]
> 0xb13a5c00 JavaThread "[EMAIL PROTECTED]"
> daemon [_thread_in_native, id=26761]
> 
> The interesting part here is that if I count the number of JavaThreads
> running org.apache.hadoop.dfs.DataNode I see 4,538 (!) in the traceback. The
> number of threads was surprising.
> 
> Other DataNodes just exit without panicking the JVM. In either failure
> mode, the last few lines of the DataNode log file is apparently innocuous:
> 
> 2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 got
> response for connect ack from downstream datanode with firstbadlink as
> 2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:48,268 INFO org.apache.hadoop.dfs.DataNode: Receiving block
> blk_-2241766430103062484 src: /10.2.14.10:33626 dest: /10.2.14.10:50010
> 2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Receiving block
> blk_313239508245918539 src: /10.2.14.24:37836 dest: /10.2.14.24:50010
> 2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Receiving block
> blk_1684581399908730353 src: /10.2.14.16:51605 dest: /10.2.14.16:50010
> 2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:49,509 INFO org.apache.hadoop.dfs.DataNode: Receiving block
> blk_2493969670086107736 src: /10.2.14.18:47557 dest: /10.2.14.18:50010
> 2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1 got
> response for connect ack from downstream datanode with firstbadlink as
> 2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1
> forwarding connect ack to upstream firstbadlink is
> 
> Finally, the task-level output (in userlogs) doesn't reveal much either:
> 
> 2008-05-27 11:38:30,724 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Need 34 map output(s)
> 2008-05-27 11:38:30,753 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 done copying
> task_200805271056_0001_m_001976_0 output from worker9.
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Got 33 known map output location(s);
> scheduling...
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Scheduled 1 of 33 known outputs (0 slow
> hosts and 32 dup hosts)
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Copying task_200805271056_0001_m_001248_0
> output from worker8.
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Need 33 map output(s)
> 2008-05-27 11:38:31,752 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 done copying
> task_200805271056_0001_m_001248_0 output from worker8.
> 
> 
> 
>

Re: 0.16.4 DataNode problem...

Reply via email to