Hello,
I am running nutch 0.9 currently.
I am running on 4 nodes, one is the master, in
addition to being a slave.
I am running the nutch crawl command.
Everything runs fine until it gets to the dedup
command. The output from the command is as follows:
-----
Dedup: starting
Dedup: adding indexes in: /var/nutch/crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
-----
Can anyone please point me in the direction of getting this
to work? I have excerpts of the interesting logs below. I
have read for hours posts on these errors, if I could find any.
It appears from many of the posts some of these are inocuous,
due to the WARN message type.
I did turn on the debug for log4j for the dedup process, so I
could see if I could find anything else amiss. However, I
was unable to determine the cause of the problem.
Everything worked great when we had everything on a single
machine, everything set to local, no distributed file system.
Thank you in advance for any assistance or pointers you can
provide.
The namenode log on the master has the following errors
which occurred at approximately the same time::
-----
2008-01-10 18:28:03,358 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/var/nutch/crawl/indexes/part-00012 because it does not exist
2008-01-10 18:28:07,145 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/var/nutch/crawl/indexes/part-00011 because it does not exist
2008-01-10 18:28:10,562 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/var/nutch/crawl/indexes/part-00015 because it does not exist
2008-01-10 18:28:12,616 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/var/nutch/crawl/indexes/part-00013 because it does not exist
2008-01-10 18:28:13,955 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove
/var/nutch/crawl/indexes/part-00014 because it does not exist
2008-01-10 18:28:16,526 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove /var/mapred/system/job_0018
because it does not exist
2008-01-10 18:28:22,028 WARN fs.FSNamesystem - Not able to place enough
replicas, still in need of 1
2008-01-10 18:28:22,114 WARN fs.FSNamesystem - Not able to place enough
replicas, still in need of 1
2008-01-10 18:28:22,207 WARN fs.FSNamesystem - Not able to place enough
replicas, still in need of 1
2008-01-10 18:29:16,724 WARN dfs.StateChange - DIR*
FSDirectory.unprotectedDelete: failed to remove /var/mapred/system/job_0019
because it does not exist
-----
The datanode log on the master has the following errors
which occurred at approximately the same time::
-----
2008-01-10 18:28:29,742 WARN dfs.DataNode - Failed to transfer
blk_-2596562194274011404 to /76.250.98.171:50010
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1020)
at java.lang.Thread.run(Thread.java:619)
2008-01-10 18:28:31,412 WARN dfs.DataNode - Failed to transfer
blk_-2596562194274011404 to /76.250.98.171:50010
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1020)
at java.lang.Thread.run(Thread.java:619)
-----
The jobtracker, tasktracker, and secondarynamenode logs appear to be normal.
The hadoop.log file contains the following interesting entries:
(I have filtered out the thousands of debug ipc calls and results.)
-----
2008-01-10 18:28:18,233 INFO indexer.DeleteDuplicates - Dedup: starting
2008-01-10 18:28:18,234 DEBUG conf.Configuration - java.io.IOException:
config(config)
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:102)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:77)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:88)
at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:27)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:418)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
2008-01-10 18:28:18,367 INFO indexer.DeleteDuplicates - Dedup: adding indexes
in: /var/nutch/crawl/indexes
2008-01-10 18:28:18,382 DEBUG mapred.JobClient - default FileSystem:
hdfs://sunset2:50000
2008-01-10 18:28:21,672 INFO mapred.InputFormatBase - Total input paths to
process : 16
2008-01-10 18:28:21,674 DEBUG mapred.JobClient - Creating splits at
hdfs://sunset2:50000/var/mapred/system/submit_qb31lw/job.split
2008-01-10 18:28:24,145 INFO mapred.JobClient - Running job: job_0019
2008-01-10 18:28:25,156 INFO mapred.JobClient - map 0% reduce 0%
2008-01-10 18:28:33,267 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:33,304 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)
2008-01-10 18:28:33,516 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:33,553 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)
2008-01-10 18:28:35,485 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)
2008-01-10 18:28:35,657 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)
2008-01-10 18:28:36,858 DEBUG mapred.MapTask - Started thread: Sort progress
reporter for task task_0019_m_000004_0
2008-01-10 18:28:37,406 DEBUG mapred.MapTask - Started thread: Sort progress
reporter for task task_0019_m_000000_0
2008-01-10 18:28:38,133 WARN mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:38,787 DEBUG mapred.MapTask - opened spill0.out
2008-01-10 18:28:39,335 INFO mapred.JobClient - map 6% reduce 0%
2008-01-10 18:28:41,142 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:41,179 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)
2008-01-10 18:28:41,358 INFO mapred.JobClient - Task Id :
task_0019_m_000001_0, Status : FAILED
2008-01-10 18:28:41,494 INFO mapred.JobClient - Task Id :
task_0019_m_000004_0, Status : FAILED
2008-01-10 18:28:42,738 INFO mapred.JobClient - Task Id :
task_0019_m_000005_0, Status : FAILED
2008-01-10 18:28:42,757 INFO mapred.JobClient - Task Id :
task_0019_m_000002_0, Status : FAILED
2008-01-10 18:28:43,338 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)
2008-01-10 18:28:43,716 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:43,758 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)
2008-01-10 18:28:44,494 DEBUG mapred.MapTask - Started thread: Sort progress
reporter for task task_0019_m_000007_0
2008-01-10 18:28:44,798 INFO mapred.JobClient - Task Id :
task_0019_m_000006_0, Status : FAILED
2008-01-10 18:28:45,749 WARN mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:45,912 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)
2008-01-10 18:28:47,047 DEBUG mapred.MapTask - Started thread: Sort progress
reporter for task task_0019_m_000001_1
2008-01-10 18:28:48,253 WARN mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:49,879 INFO mapred.JobClient - Task Id :
task_0019_m_000007_0, Status : FAILED
2008-01-10 18:28:50,908 INFO mapred.JobClient - Task Id :
task_0019_m_000008_0, Status : FAILED
2008-01-10 18:28:50,920 INFO mapred.JobClient - Task Id :
task_0019_m_000004_1, Status : FAILED
2008-01-10 18:28:50,949 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:50,986 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)
2008-01-10 18:28:51,938 INFO mapred.JobClient - Task Id :
task_0019_m_000001_1, Status : FAILED
2008-01-10 18:28:52,969 INFO mapred.JobClient - Task Id :
task_0019_m_000005_1, Status : FAILED
2008-01-10 18:28:53,123 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)
2008-01-10 18:28:53,713 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:53,753 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)
2008-01-10 18:28:54,009 INFO mapred.JobClient - Task Id :
task_0019_m_000009_0, Status : FAILED
2008-01-10 18:28:54,317 DEBUG mapred.MapTask - Started thread: Sort progress
reporter for task task_0019_m_000006_1
2008-01-10 18:28:55,614 WARN mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:55,960 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)
2008-01-10 18:28:57,080 DEBUG mapred.MapTask - Started thread: Sort progress
reporter for task task_0019_m_000008_1
2008-01-10 18:28:58,067 INFO mapred.JobClient - Task Id :
task_0019_m_000003_0, Status : FAILED
2008-01-10 18:28:58,303 WARN mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:59,087 INFO mapred.JobClient - Task Id :
task_0019_m_000007_1, Status : FAILED
2008-01-10 18:28:59,099 INFO mapred.JobClient - Task Id :
task_0019_m_000006_1, Status : FAILED
2008-01-10 18:28:59,112 INFO mapred.JobClient - Task Id :
task_0019_m_000002_1, Status : FAILED
2008-01-10 18:29:02,157 INFO mapred.JobClient - Task Id :
task_0019_m_000008_1, Status : FAILED
2008-01-10 18:29:02,168 INFO mapred.JobClient - Task Id :
task_0019_m_000001_2, Status : FAILED
2008-01-10 18:29:08,247 INFO mapred.JobClient - Task Id :
task_0019_m_000004_2, Status : FAILED
2008-01-10 18:29:17,365 INFO mapred.JobClient - map 100% reduce 100%
2008-01-10 18:29:17,367 INFO mapred.JobClient - Task Id :
task_0019_m_000001_3, Status : FAILED
2008-01-10 18:29:20,870 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:910)
2008-01-10 18:29:25,582 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:910)
-----
If you need me to post log excerpts from the other slaves, please
let me know and I'll put them up.
Thanks!
JohnM
--
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services