After stopping/restarting DFS an fsck shows that the file system is healthy with no under/over replicated data, and no missing files/blocks.
Ted Dunning <[EMAIL PROTECTED]> wrote: What does hadoop dfs -fsck show you? On 5/27/08 8:58 AM, "C G" wrote: > Hi All: > > I'm seeing an inability to run one of our applications over a reasonably > small dataset (~200G input) while running 0.16.4. Previously we were on > 0.15.0 and the same application ran fine with the same dataset. > > A lengthy description follows, including log file output, etc. The failure > mode smells like a bug in 0.16.4, but I'm not 100% positive about that. > > My questions are: > > 1. Any known issues upgrading from 0.15.0 to 0.16.4? Our code runs just > fine over small datasets, but dies on these larger ones. We followed the > upgrade instructions in the wiki, etc. > > 2. Would an upgrade to 0.17.0 help resolve these problems? > > 3. Would a re-format/re-load of HDFS help correct these issues? This is > the thing I hope for the least in that I have 3T of data on-board HDFS and it > will take days to dump it all and reload it. > > 4. Any other advice or help? > > I've been looking at this for the past few days and have been unable to make > progress of solving it. I would hate to have to fall back to 0.15.0 (see > above regarding 3T data reloads, not to mention being stuck on an old > release). Any help, thoughts, comments, etc., would be very helpful. > Thanks! > > Description: > Following an upgrade from 0.15.0 to 0.16.4 (and after recompiling our apps, > etc.), a job that used to run correctly on our grid now fails. The failure > occurs after the map phase is complete, and about 2/3rds of the way through > the reduce job. The error which gets kicked out from the application > perspective is: > > 08/05/27 11:30:08 INFO mapred.JobClient: map 100% reduce 89% > 08/05/27 11:30:41 INFO mapred.JobClient: map 100% reduce 90% > 08/05/27 11:32:45 INFO mapred.JobClient: map 100% reduce 86% > 08/05/27 11:32:45 INFO mapred.JobClient: Task Id : > task_200805271056_0001_r_000007_0, Status : FAILED > java.io.IOException: Could not get block locations. Aborting... > at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanode > Error(DFSClient.java:1832) > at > org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:148 > 7) > at > org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.jav > a:1579) > > > I then discovered that 1 or more DataNode instances on the slave nodes are > down (we run 1 DataNode instance per machine). The cause for at least some of > the DataNode failures is a JVM internal error that gets raised due to a > complete out-of-memory scenario (on a 4G, 4-way machine). > > Watching the DataNodes run, I can see them consuming more and more memory. > For those failures for which there is a JVM traceback, I see (in part): > # > # java.lang.OutOfMemoryError: requested 16 bytes for CHeapObj-new. Out of swap > space? > # > # Internal Error (414C4C4F434154494F4E0E494E4C494E450E4850500017), pid=4246, > tid=2283883408 > # > # Java VM: Java HotSpot(TM) Server VM (1.6.0_02-b05 mixed mode) > # If you would like to submit a bug report, please visit: > # http://java.sun.com/webapps/bugreport/crash.jsp > # > --------------- T H R E A D --------------- > Current thread (0x8a942000): JavaThread > "[EMAIL PROTECTED]" daemon [_thread_in_Java, > id=15064] > Stack: [0x881c4000,0x88215000), sp=0x882139e0, free space=318k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > V [libjvm.so+0x53b707] > V [libjvm.so+0x225fe1] > V [libjvm.so+0x16fdc5] > V [libjvm.so+0x22aef3] > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) > v blob 0xf4f235a7 > J java.io.DataInputStream.readInt()I > j > org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(Ljava/io/DataOutputS > tream;Ljava/io/DataInputStream;Ljava/io/DataOutputStream;Ljava/lang/String;Lor > g/a > pache/hadoop/dfs/DataNode$Throttler;I)V+126 > j > org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(Ljava/io/DataInputStream > ;)V+746 > j org.apache.hadoop.dfs.DataNode$DataXceiver.run()V+174 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > --------------- P R O C E S S --------------- > Java Threads: ( => current thread ) > 0x0ae3f400 JavaThread "process reaper" daemon [_thread_blocked, id=26870] > 0x852e6000 JavaThread "[EMAIL PROTECTED]" > daemon [_thread_in_vm, id=26869] > 0x08a1cc00 JavaThread "PacketResponder 0 for Block blk_-6186975972786687394" > daemon [_thread_blocked, id=26769] > 0x852e5000 JavaThread "[EMAIL PROTECTED]" > daemon [_thread_in_native, id=26768] > 0x0956e000 JavaThread "PacketResponder 0 for Block blk_-2322514873363546651" > daemon [_thread_blocked, id=26767] > 0x852e4400 JavaThread "[EMAIL PROTECTED]" > daemon [_thread_in_native, id=26766] > 0x09d3a400 JavaThread "PacketResponder 0 for Block blk_8926941945313450801" > daemon [_thread_blocked, id=26764] > 0x852e3c00 JavaThread "[EMAIL PROTECTED]" > daemon [_thread_in_native, id=26763] > 0x0953d000 JavaThread "PacketResponder 0 for Block blk_4785883052769066976" > daemon [_thread_blocked, id=26762] > 0xb13a5c00 JavaThread "[EMAIL PROTECTED]" > daemon [_thread_in_native, id=26761] > > The interesting part here is that if I count the number of JavaThreads > running org.apache.hadoop.dfs.DataNode I see 4,538 (!) in the traceback. The > number of threads was surprising. > > Other DataNodes just exit without panicking the JVM. In either failure > mode, the last few lines of the DataNode log file is apparently innocuous: > > 2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 got > response for connect ack from downstream datanode with firstbadlink as > 2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 > forwarding connect ack to upstream firstbadlink is > 2008-05-27 11:31:48,268 INFO org.apache.hadoop.dfs.DataNode: Receiving block > blk_-2241766430103062484 src: /10.2.14.10:33626 dest: /10.2.14.10:50010 > 2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Receiving block > blk_313239508245918539 src: /10.2.14.24:37836 dest: /10.2.14.24:50010 > 2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Datanode 0 > forwarding connect ack to upstream firstbadlink is > 2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Receiving block > blk_1684581399908730353 src: /10.2.14.16:51605 dest: /10.2.14.16:50010 > 2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Datanode 0 > forwarding connect ack to upstream firstbadlink is > 2008-05-27 11:31:49,509 INFO org.apache.hadoop.dfs.DataNode: Receiving block > blk_2493969670086107736 src: /10.2.14.18:47557 dest: /10.2.14.18:50010 > 2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1 got > response for connect ack from downstream datanode with firstbadlink as > 2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1 > forwarding connect ack to upstream firstbadlink is > > Finally, the task-level output (in userlogs) doesn't reveal much either: > > 2008-05-27 11:38:30,724 INFO org.apache.hadoop.mapred.ReduceTask: > task_200805271056_0001_r_000007_1 Need 34 map output(s) > 2008-05-27 11:38:30,753 INFO org.apache.hadoop.mapred.ReduceTask: > task_200805271056_0001_r_000007_1 done copying > task_200805271056_0001_m_001976_0 output from worker9. > 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask: > task_200805271056_0001_r_000007_1: Got 0 new map-outputs & 0 obsolete > map-outputs from tasktracker and 0 map-outputs from previous failures > 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask: > task_200805271056_0001_r_000007_1 Got 33 known map output location(s); > scheduling... > 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask: > task_200805271056_0001_r_000007_1 Scheduled 1 of 33 known outputs (0 slow > hosts and 32 dup hosts) > 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask: > task_200805271056_0001_r_000007_1 Copying task_200805271056_0001_m_001248_0 > output from worker8. > 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask: > task_200805271056_0001_r_000007_1 Need 33 map output(s) > 2008-05-27 11:38:31,752 INFO org.apache.hadoop.mapred.ReduceTask: > task_200805271056_0001_r_000007_1 done copying > task_200805271056_0001_m_001248_0 output from worker8. > > > >
