Re: Job failed when writing a huge file

tienduc_dinh Sun, 25 Jan 2009 16:48:43 -0800

In my cluster I have node04 as Namenode and the other nodes are data nodes.


- If successful, it looks like:

d...@node06:~/v-0.18.0$ hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write
-fileSize 10000 -nrFiles 1
TestFDSIO.0.0.4
09/01/25 21:37:49 INFO mapred.FileInputFormat: nrFiles = 1
09/01/25 21:37:49 INFO mapred.FileInputFormat: fileSize (MB) = 10000
09/01/25 21:37:49 INFO mapred.FileInputFormat: bufferSize = 1000000
09/01/25 21:37:50 INFO mapred.FileInputFormat: creating control file: 10000
mega bytes, 1 files
09/01/25 21:37:50 INFO mapred.FileInputFormat: created control files for: 1
files
09/01/25 21:37:50 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
09/01/25 21:37:50 INFO mapred.FileInputFormat: Total input paths to process
: 1
09/01/25 21:37:50 INFO mapred.FileInputFormat: Total input paths to process
: 1
09/01/25 21:37:51 INFO mapred.JobClient: Running job: job_200901252026_0003
09/01/25 21:37:52 INFO mapred.JobClient:  map 0% reduce 0%
09/01/25 21:43:04 INFO mapred.JobClient:  map 100% reduce 0%
09/01/25 21:43:16 INFO mapred.JobClient: Job complete: job_200901252026_0003
09/01/25 21:43:16 INFO mapred.JobClient: Counters: 16
09/01/25 21:43:16 INFO mapred.JobClient:   File Systems
09/01/25 21:43:16 INFO mapred.JobClient:     HDFS bytes read=113
09/01/25 21:43:16 INFO mapred.JobClient:     HDFS bytes written=10485760079
09/01/25 21:43:16 INFO mapred.JobClient:     Local bytes read=113
09/01/25 21:43:16 INFO mapred.JobClient:     Local bytes written=262
09/01/25 21:43:16 INFO mapred.JobClient:   Job Counters 
09/01/25 21:43:16 INFO mapred.JobClient:     Launched reduce tasks=1
09/01/25 21:43:16 INFO mapred.JobClient:     Rack-local map tasks=1
09/01/25 21:43:16 INFO mapred.JobClient:     Launched map tasks=1
09/01/25 21:43:16 INFO mapred.JobClient:   Map-Reduce Framework
09/01/25 21:43:16 INFO mapred.JobClient:     Reduce input groups=5
09/01/25 21:43:16 INFO mapred.JobClient:     Combine output records=10
09/01/25 21:43:16 INFO mapred.JobClient:     Map input records=1
09/01/25 21:43:16 INFO mapred.JobClient:     Reduce output records=5
09/01/25 21:43:16 INFO mapred.JobClient:     Map output bytes=89
09/01/25 21:43:16 INFO mapred.JobClient:     Map input bytes=27
09/01/25 21:43:16 INFO mapred.JobClient:     Combine input records=10
09/01/25 21:43:16 INFO mapred.JobClient:     Map output records=5
09/01/25 21:43:16 INFO mapred.JobClient:     Reduce input records=5
09/01/25 21:43:16 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
09/01/25 21:43:16 INFO mapred.FileInputFormat:            Date & time: Sun
Jan 25 21:43:16 CET 2009
09/01/25 21:43:16 INFO mapred.FileInputFormat:        Number of files: 1
09/01/25 21:43:16 INFO mapred.FileInputFormat: Total MBytes processed: 10000
09/01/25 21:43:16 INFO mapred.FileInputFormat:      Throughput mb/sec:
32.34801286156991
09/01/25 21:43:16 INFO mapred.FileInputFormat: Average IO rate mb/sec:
32.3480110168457
09/01/25 21:43:16 INFO mapred.FileInputFormat:  IO rate std deviation:
0.004232947670390486
09/01/25 21:43:16 INFO mapred.FileInputFormat:     Test exec time sec:
326.033
09/01/25 21:43:16 INFO mapred.FileInputFormat: 

+ The chunks (with the standard size 64 MB) spread on the nodes are:

node04  0 
node05  4
node06  4
node07  8
node08  3
node09  137

-> so it means, we have for 10.000 MB 156 chunks and those chunks are spread
on all data nodes ! The script for this output is :

for i in `seq 4 9`;do
  ssh node0$i "echo -n node0$i ' '"
  ssh node0$i 'ls -lah `find /tmp/hadoop-dinh -type f` | grep 64M | wc -l'
done


- If unsuccessful :

d...@node06:~/v-0.18.0$ hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write
-fileSize 10000 -nrFiles 1 
TestFDSIO.0.0.4
09/01/25 22:28:21 INFO mapred.FileInputFormat: nrFiles = 1
09/01/25 22:28:21 INFO mapred.FileInputFormat: fileSize (MB) = 10000
09/01/25 22:28:21 INFO mapred.FileInputFormat: bufferSize = 1000000
09/01/25 22:28:22 INFO mapred.FileInputFormat: creating control file: 10000
mega bytes, 1 files
09/01/25 22:28:22 INFO mapred.FileInputFormat: created control files for: 1
files
09/01/25 22:28:22 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
09/01/25 22:28:22 INFO mapred.FileInputFormat: Total input paths to process
: 1
09/01/25 22:28:22 INFO mapred.FileInputFormat: Total input paths to process
: 1
09/01/25 22:28:23 INFO mapred.JobClient: Running job: job_200901252228_0001
09/01/25 22:28:24 INFO mapred.JobClient:  map 0% reduce 0%
###################### TaskAttemptID ######################
java.io.IOException: All datanodes 10.0.0.9:50010 are bad. Aborting...
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

attempt_200901252228_0001_m_000000_0: ###################### TaskAttemptID
######################
...
attempt_200901252228_0001_m_000000_0: ###################### TaskAttemptID
######################
attempt_200901252228_0001_m_000000_0: Exception closing file
/benchmarks/TestDFSIO/io_data/test_io_0
attempt_200901252228_0001_m_000000_0: java.io.IOException: All datanodes
10.0.0.9:50010 are bad. Aborting...
attempt_200901252228_0001_m_000000_0:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)
attempt_200901252228_0001_m_000000_0:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
attempt_200901252228_0001_m_000000_0:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

09/01/25 22:39:17 INFO mapred.JobClient: Task Id :
attempt_200901252228_0001_m_000000_1, Status : FAILED
###################### TaskAttemptID ######################
java.io.IOException: All datanodes 10.0.0.6:50010 are bad. Aborting...
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

attempt_200901252228_0001_m_000000_1: ###################### TaskAttemptID
######################
...
attempt_200901252228_0001_m_000000_1: ###################### TaskAttemptID
######################
attempt_200901252228_0001_m_000000_1: Exception closing file
/benchmarks/TestDFSIO/io_data/test_io_0
attempt_200901252228_0001_m_000000_1: java.io.IOException: All datanodes
10.0.0.6:50010 are bad. Aborting...
attempt_200901252228_0001_m_000000_1:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)
attempt_200901252228_0001_m_000000_1:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
attempt_200901252228_0001_m_000000_1:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

09/01/25 22:46:13 INFO mapred.JobClient: Task Id :
attempt_200901252228_0001_m_000000_2, Status : FAILED
###################### TaskAttemptID ######################
java.io.IOException: All datanodes 10.0.0.8:50010 are bad. Aborting...
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

attempt_200901252228_0001_m_000000_2: ###################### TaskAttemptID
######################
...
attempt_200901252228_0001_m_000000_2: ###################### TaskAttemptID
######################
attempt_200901252228_0001_m_000000_2: Exception closing file
/benchmarks/TestDFSIO/io_data/test_io_0
attempt_200901252228_0001_m_000000_2: java.io.IOException: All datanodes
10.0.0.8:50010 are bad. Aborting...
attempt_200901252228_0001_m_000000_2:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)
attempt_200901252228_0001_m_000000_2:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
attempt_200901252228_0001_m_000000_2:   at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1118)
        at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:247)
        at org.apache.hadoop.fs.TestDFSIO.writeTest(TestDFSIO.java:219)
        at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:450)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

-> First Hadoop tried to write the file on node09, then node06, node08 and
the job was failed !

The output of the chunks on all nodes is:

node04  0
node05  97
node06  0
node07  0
node08  0
node09  0

The point I don't understand is sometimes it works for writing a huge file
and sometimes not. 

Thanks for the link, I've read it before and the source code, but still
don't get it.


-- 
View this message in context: 
http://www.nabble.com/Job-failed-when-writing-a-huge-file-tp21647888p21658995.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Job failed when writing a huge file

Reply via email to