Still more weirdness. When I run my code on a single node hadoop installation it runs fine, albeit performance is terrible. When I move to 2 nods, I see the various error messages I've previously described. The system is also writing directories of stuff to my output when I am only expecting to see reducer output, and logging those writes in the jobtracker log thusly: 2007-10-30 16:14:33,264 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200710301552_0002_r_000007_0' to tip tip_200710301552_0002_r_000007, for tracker 'tracker_localhost.localdomain:50050' 2007-10-30 16:14:33,628 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_localhost.localdomain:50050' 2007-10-30 16:14:41,359 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_localhost.localdomain:50050' 2007-10-30 16:14:41,722 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200710301552_0002_r_000007_0' has completed tip_200710301552_0002_r_000007 successfully. 2007-10-30 16:14:41,729 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of task 'task_200710301552_0002_r_000007_0' to hdfs://10.2.11.1:54310/import/raw_logs/20070929 So instead of reducer output, there is a combination of reducer output and task directories: /import/raw_logs/20070929/_task_200710301552_0002_r_000001_0 <dir> 2007-10-30 16:15 /import/raw_logs/20070929/_task_200710301552_0002_r_000002_0 <dir> 2007-10-30 16:14 /import/raw_logs/20070929/_task_200710301552_0002_r_000003_0 <dir> 2007-10-30 16:14 /import/raw_logs/20070929/_task_200710301552_0002_r_000004_0 <dir> 2007-10-30 16:14 /import/raw_logs/20070929/_task_200710301552_0002_r_000006_0 <dir> 2007-10-30 16:14 /import/raw_logs/20070929/part-00000 <r 3> 2323852 2007-10-30 16:13 /import/raw_logs/20070929/part-00001 <r 3> 2271699 2007-10-30 16:13 /import/raw_logs/20070929/part-00003 <r 3> 2358158 2007-10-30 16:14 /import/raw_logs/20070929/part-00004 <r 3> 2341299 2007-10-30 16:14 /import/raw_logs/20070929/part-00005 <r 3> 2265270 2007-10-30 16:14 /import/raw_logs/20070929/part-00006 <r 3> 2305368 2007-10-30 16:14 /import/raw_logs/20070929/part-00007 <r 3> 2292513 2007-10-30 16:14
those _task directories contain output that appears at least in file size to be identical to the reducer output: bin/hadoop dfs -ls /import/raw_logs/20070929/_task_200710301552_0002_r_000001_0 Found 1 items /import/raw_logs/20070929/_task_200710301552_0002_r_000001_0/part-00001 <r 3> 2271699 2007-10-30 16:15 Has anybody seen this behavior before? This is happening for any map/reduce program I run - we have several. All the code is basic stuff - nothing exotic happening at all. Any advice much appreciated... Thanks, C G C G <[EMAIL PROTECTED]> wrote: Each node runs 1 datanode instance. I did notice this on "node4" trying to send to "node2": 2007-10-30 14:35:16,876 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_7852052048957161201 to 10.2.11.2:50010 got java.net.SocketExcep tion: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280) at java.lang.Thread.run(Thread.java:619) and then on node2: 2007-10-30 14:35:16,918 INFO org.apache.hadoop.dfs.DataNode: Received block blk_7852052048957161201 from /10.2.11.3 2007-10-30 14:35:16,919 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_7852052048957161201 is valid, and cannot be written to. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:515) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:822) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727) at java.lang.Thread.run(Thread.java:619) Ted Dunning wrote: Is it possible you have some nodes running multiple instances of datanodes? On 10/30/07 11:06 AM, "C G" wrote: > Hi All: > > Environment: 4 node grid running hadoop-0.14.1. > > With the system shutdown I wiped out the old HDFS directory structure and > created an empty directory. Did a namenode format, and then brought up the > system with start-all.sh. > > I then set up a directory structure, and ran the first job. The job runs > 100% of the map jobs, completes ~ 87% of the reduce jobs, and then hangs. > There are no user-level error messages. All systems go to idle. > > I started looking at the Hadoop logs, first strange message from the > namenode log: > > 2007-10-30 13:48:01,991 WARN org.apache.hadoop.dfs.StateChange: DIR* > NameSystem.startFile: failed to create file /import/raw_logs/20070929/_t > ask_200710301345_0001_r_000001_0/part-00001 for > DFSClient_task_200710301345_0001_r_000001_0 on client 10.2.11.4 because > current leaseholder i > s trying to recreate file. > 2007-10-30 13:48:01,992 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 9 on 54310, call create(/import/raw_logs/20070929/_task_2007103 > 01345_0001_r_000001_0/part-00001, DFSClient_task_200710301345_0001_r_000001_0, > true, 3, 67108864) from 10.2.11.4:34016: error: org.apache.had > oop.dfs.AlreadyBeingCreatedException: failed to create file > /import/raw_logs/20070929/_task_200710301345_0001_r_000001_0/part-00001 for > DFSCl > ient_task_200710301345_0001_r_000001_0 on client 10.2.11.4 because current > leaseholder is trying to recreate file. > org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file > /import/raw_logs/20070929/_task_200710301345_0001_r_000001_0/part-0 > 0001 for DFSClient_task_200710301345_0001_r_000001_0 on client 10.2.11.4 > because current leaseholder is trying to recreate file. > at > org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:788) > at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:725) > at org.apache.hadoop.dfs.NameNode.create(NameNode.java:307) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j > ava:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566) > > Second strange error message comes from the jobtracker log: > 2007-10-30 13:59:26,190 INFO org.apache.hadoop.mapred.JobTracker: Ignoring > 'duplicate' heartbeat from 'tracker_localhost.localdomain:50050' > > I'm curious about how to proceed. I suspect that my code is OK as I've run > it numerous times in both single node and multiple grid environments. I've > never seen these error messages before. > > Any help much appreciated.... > > Thanks, > C G > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
