Re: Job hangs, strange error messages...help sought...

C G Tue, 30 Oct 2007 12:24:26 -0800

Still more weirdness.  When I run my code on a single node hadoop installation 
it runs fine, albeit performance is terrible.  When I move to 2 nods, I see the 
various error messages I've previously described.  The system is also writing 
directories of stuff to my output when I am only expecting to see reducer 
output, and logging those writes in the jobtracker log thusly:
  
2007-10-30 16:14:33,264 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'task_200710301552_0002_r_000007_0' to tip tip_200710301552_0002_r_000007, for 
tracker 'tracker_localhost.localdomain:50050'
2007-10-30 16:14:33,628 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 'tracker_localhost.localdomain:50050'
2007-10-30 16:14:41,359 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 'tracker_localhost.localdomain:50050'
2007-10-30 16:14:41,722 INFO org.apache.hadoop.mapred.JobInProgress: Task 
'task_200710301552_0002_r_000007_0' has completed 
tip_200710301552_0002_r_000007 successfully.
2007-10-30 16:14:41,729 INFO org.apache.hadoop.mapred.TaskRunner: Saved output 
of task 'task_200710301552_0002_r_000007_0' to 
hdfs://10.2.11.1:54310/import/raw_logs/20070929
   
  So instead of reducer output, there is a combination of reducer output and 
task directories:
  /import/raw_logs/20070929/_task_200710301552_0002_r_000001_0 <dir>           
2007-10-30 16:15
/import/raw_logs/20070929/_task_200710301552_0002_r_000002_0 <dir>           
2007-10-30 16:14
/import/raw_logs/20070929/_task_200710301552_0002_r_000003_0 <dir>           
2007-10-30 16:14
/import/raw_logs/20070929/_task_200710301552_0002_r_000004_0 <dir>           
2007-10-30 16:14
/import/raw_logs/20070929/_task_200710301552_0002_r_000006_0 <dir>           
2007-10-30 16:14
/import/raw_logs/20070929/part-00000 <r 3>   2323852 2007-10-30 16:13
/import/raw_logs/20070929/part-00001 <r 3>   2271699 2007-10-30 16:13
/import/raw_logs/20070929/part-00003 <r 3>   2358158 2007-10-30 16:14
/import/raw_logs/20070929/part-00004 <r 3>   2341299 2007-10-30 16:14
/import/raw_logs/20070929/part-00005 <r 3>   2265270 2007-10-30 16:14
/import/raw_logs/20070929/part-00006 <r 3>   2305368 2007-10-30 16:14
/import/raw_logs/20070929/part-00007 <r 3>   2292513 2007-10-30 16:14


  those _task directories contain output that appears at least in file size to 
be identical to the reducer output:
   
  bin/hadoop dfs -ls 
/import/raw_logs/20070929/_task_200710301552_0002_r_000001_0
Found 1 items
/import/raw_logs/20070929/_task_200710301552_0002_r_000001_0/part-00001      <r 
3>   2271699 2007-10-30 16:15
   
  Has anybody seen this behavior before?   This is happening for any map/reduce 
program I run - we have several.   All the code is basic stuff - nothing exotic 
happening at all.
   
  Any advice much appreciated...
   
  Thanks,
  C G
  
C G <[EMAIL PROTECTED]> wrote:
  Each node runs 1 datanode instance. I did notice this on "node4" trying to 
send to "node2":

2007-10-30 14:35:16,876 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_7852052048957161201 to 10.2.11.2:50010 got java.net.SocketExcep
tion: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280)
at java.lang.Thread.run(Thread.java:619)

and then on node2:
2007-10-30 14:35:16,918 INFO org.apache.hadoop.dfs.DataNode: Received block 
blk_7852052048957161201 from /10.2.11.3
2007-10-30 14:35:16,919 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: 
java.io.IOException: Block blk_7852052048957161201 is valid, and cannot
be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:515)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:822)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727)
at java.lang.Thread.run(Thread.java:619)

Ted Dunning wrote:

Is it possible you have some nodes running multiple instances of datanodes?


On 10/30/07 11:06 AM, "C G" 
wrote:

> Hi All:
> 
> Environment: 4 node grid running hadoop-0.14.1.
> 
> With the system shutdown I wiped out the old HDFS directory structure and
> created an empty directory. Did a namenode format, and then brought up the
> system with start-all.sh.
> 
> I then set up a directory structure, and ran the first job. The job runs
> 100% of the map jobs, completes ~ 87% of the reduce jobs, and then hangs.
> There are no user-level error messages. All systems go to idle.
> 
> I started looking at the Hadoop logs, first strange message from the
> namenode log:
> 
> 2007-10-30 13:48:01,991 WARN org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.startFile: failed to create file /import/raw_logs/20070929/_t
> ask_200710301345_0001_r_000001_0/part-00001 for
> DFSClient_task_200710301345_0001_r_000001_0 on client 10.2.11.4 because
> current leaseholder i
> s trying to recreate file.
> 2007-10-30 13:48:01,992 INFO org.apache.hadoop.ipc.Server: IPC Server handler
> 9 on 54310, call create(/import/raw_logs/20070929/_task_2007103
> 01345_0001_r_000001_0/part-00001, DFSClient_task_200710301345_0001_r_000001_0,
> true, 3, 67108864) from 10.2.11.4:34016: error: org.apache.had
> oop.dfs.AlreadyBeingCreatedException: failed to create file
> /import/raw_logs/20070929/_task_200710301345_0001_r_000001_0/part-00001 for
> DFSCl
> ient_task_200710301345_0001_r_000001_0 on client 10.2.11.4 because current
> leaseholder is trying to recreate file.
> org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file
> /import/raw_logs/20070929/_task_200710301345_0001_r_000001_0/part-0
> 0001 for DFSClient_task_200710301345_0001_r_000001_0 on client 10.2.11.4
> because current leaseholder is trying to recreate file.
> at 
> org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:788)
> at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:725)
> at org.apache.hadoop.dfs.NameNode.create(NameNode.java:307)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
> ava:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566)
> 
> Second strange error message comes from the jobtracker log:
> 2007-10-30 13:59:26,190 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
> 'duplicate' heartbeat from 'tracker_localhost.localdomain:50050'
> 
> I'm curious about how to proceed. I suspect that my code is OK as I've run
> it numerous times in both single node and multiple grid environments. I've
> never seen these error messages before.
> 
> Any help much appreciated....
> 
> Thanks,
> C G 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com 



__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Job hangs, strange error messages...help sought...

Reply via email to