Hello,

I am running nutch 0.9 currently.
I am running on 4 nodes, one is the master, in
addition to being a slave.

I am running the nutch crawl command.
Everything runs fine until it gets to the dedup
command.  The output from the command is as follows:

-----
Dedup: starting
Dedup: adding indexes in: /var/nutch/crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at 
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
-----

Can anyone please point me in the direction of getting this
to work?  I have excerpts of the interesting logs below.  I
have read for hours posts on these errors, if I could find any.
It appears from many of the posts some of these are inocuous,
due to the WARN message type.

I did turn on the debug for log4j for the dedup process, so I
could see if I could find anything else amiss.  However, I
was unable to determine the cause of the problem.

Everything worked great when we had everything on a single
machine, everything set to local, no distributed file system.

Thank you in advance for any assistance or pointers you can
provide.

The namenode log on the master has the following errors
which occurred at approximately the same time::

-----
2008-01-10 18:28:03,358 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/var/nutch/crawl/indexes/part-00012 because it does not exist
2008-01-10 18:28:07,145 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/var/nutch/crawl/indexes/part-00011 because it does not exist
2008-01-10 18:28:10,562 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/var/nutch/crawl/indexes/part-00015 because it does not exist
2008-01-10 18:28:12,616 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/var/nutch/crawl/indexes/part-00013 because it does not exist
2008-01-10 18:28:13,955 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/var/nutch/crawl/indexes/part-00014 because it does not exist
2008-01-10 18:28:16,526 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove /var/mapred/system/job_0018 
because it does not exist
2008-01-10 18:28:22,028 WARN  fs.FSNamesystem - Not able to place enough 
replicas, still in need of 1
2008-01-10 18:28:22,114 WARN  fs.FSNamesystem - Not able to place enough 
replicas, still in need of 1
2008-01-10 18:28:22,207 WARN  fs.FSNamesystem - Not able to place enough 
replicas, still in need of 1
2008-01-10 18:29:16,724 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove /var/mapred/system/job_0019 
because it does not exist
-----

The datanode log on the master has the following errors
which occurred at approximately the same time::

-----
2008-01-10 18:28:29,742 WARN  dfs.DataNode - Failed to transfer 
blk_-2596562194274011404 to /76.250.98.171:50010
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1020)
        at java.lang.Thread.run(Thread.java:619)
2008-01-10 18:28:31,412 WARN  dfs.DataNode - Failed to transfer 
blk_-2596562194274011404 to /76.250.98.171:50010
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1020)
        at java.lang.Thread.run(Thread.java:619)
-----

The jobtracker, tasktracker, and secondarynamenode logs appear to be normal.

The hadoop.log file contains the following interesting entries:
(I have filtered out the thousands of debug ipc calls and results.)

-----
2008-01-10 18:28:18,233 INFO  indexer.DeleteDuplicates - Dedup: starting
2008-01-10 18:28:18,234 DEBUG conf.Configuration - java.io.IOException: 
config(config)
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:102)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:77)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:88)
        at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:27)
        at 
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:418)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

2008-01-10 18:28:18,367 INFO  indexer.DeleteDuplicates - Dedup: adding indexes 
in: /var/nutch/crawl/indexes
2008-01-10 18:28:18,382 DEBUG mapred.JobClient - default FileSystem: 
hdfs://sunset2:50000
2008-01-10 18:28:21,672 INFO  mapred.InputFormatBase - Total input paths to 
process : 16
2008-01-10 18:28:21,674 DEBUG mapred.JobClient - Creating splits at 
hdfs://sunset2:50000/var/mapred/system/submit_qb31lw/job.split
2008-01-10 18:28:24,145 INFO  mapred.JobClient - Running job: job_0019
2008-01-10 18:28:25,156 INFO  mapred.JobClient -  map 0% reduce 0%
2008-01-10 18:28:33,267 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:33,304 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)

2008-01-10 18:28:33,516 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:33,553 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)

2008-01-10 18:28:35,485 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)

2008-01-10 18:28:35,657 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)

2008-01-10 18:28:36,858 DEBUG mapred.MapTask - Started thread: Sort progress 
reporter for task task_0019_m_000004_0
2008-01-10 18:28:37,406 DEBUG mapred.MapTask - Started thread: Sort progress 
reporter for task task_0019_m_000000_0
2008-01-10 18:28:38,133 WARN  mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:38,787 DEBUG mapred.MapTask - opened spill0.out
2008-01-10 18:28:39,335 INFO  mapred.JobClient -  map 6% reduce 0%
2008-01-10 18:28:41,142 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:41,179 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)

2008-01-10 18:28:41,358 INFO  mapred.JobClient - Task Id : 
task_0019_m_000001_0, Status : FAILED
2008-01-10 18:28:41,494 INFO  mapred.JobClient - Task Id : 
task_0019_m_000004_0, Status : FAILED
2008-01-10 18:28:42,738 INFO  mapred.JobClient - Task Id : 
task_0019_m_000005_0, Status : FAILED
2008-01-10 18:28:42,757 INFO  mapred.JobClient - Task Id : 
task_0019_m_000002_0, Status : FAILED
2008-01-10 18:28:43,338 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)

2008-01-10 18:28:43,716 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:43,758 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)

2008-01-10 18:28:44,494 DEBUG mapred.MapTask - Started thread: Sort progress 
reporter for task task_0019_m_000007_0
2008-01-10 18:28:44,798 INFO  mapred.JobClient - Task Id : 
task_0019_m_000006_0, Status : FAILED
2008-01-10 18:28:45,749 WARN  mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:45,912 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)

2008-01-10 18:28:47,047 DEBUG mapred.MapTask - Started thread: Sort progress 
reporter for task task_0019_m_000001_1
2008-01-10 18:28:48,253 WARN  mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:49,879 INFO  mapred.JobClient - Task Id : 
task_0019_m_000007_0, Status : FAILED
2008-01-10 18:28:50,908 INFO  mapred.JobClient - Task Id : 
task_0019_m_000008_0, Status : FAILED
2008-01-10 18:28:50,920 INFO  mapred.JobClient - Task Id : 
task_0019_m_000004_1, Status : FAILED
2008-01-10 18:28:50,949 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:50,986 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)

2008-01-10 18:28:51,938 INFO  mapred.JobClient - Task Id : 
task_0019_m_000001_1, Status : FAILED
2008-01-10 18:28:52,969 INFO  mapred.JobClient - Task Id : 
task_0019_m_000005_1, Status : FAILED
2008-01-10 18:28:53,123 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)

2008-01-10 18:28:53,713 DEBUG mapred.TaskTracker - Child starting
2008-01-10 18:28:53,753 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:58)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1425)

2008-01-10 18:28:54,009 INFO  mapred.JobClient - Task Id : 
task_0019_m_000009_0, Status : FAILED
2008-01-10 18:28:54,317 DEBUG mapred.MapTask - Started thread: Sort progress 
reporter for task task_0019_m_000006_1
2008-01-10 18:28:55,614 WARN  mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:55,960 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:107)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:99)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1435)

2008-01-10 18:28:57,080 DEBUG mapred.MapTask - Started thread: Sort progress 
reporter for task task_0019_m_000008_1
2008-01-10 18:28:58,067 INFO  mapred.JobClient - Task Id : 
task_0019_m_000003_0, Status : FAILED
2008-01-10 18:28:58,303 WARN  mapred.TaskTracker - Error running child
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2008-01-10 18:28:59,087 INFO  mapred.JobClient - Task Id : 
task_0019_m_000007_1, Status : FAILED
2008-01-10 18:28:59,099 INFO  mapred.JobClient - Task Id : 
task_0019_m_000006_1, Status : FAILED
2008-01-10 18:28:59,112 INFO  mapred.JobClient - Task Id : 
task_0019_m_000002_1, Status : FAILED
2008-01-10 18:29:02,157 INFO  mapred.JobClient - Task Id : 
task_0019_m_000008_1, Status : FAILED
2008-01-10 18:29:02,168 INFO  mapred.JobClient - Task Id : 
task_0019_m_000001_2, Status : FAILED
2008-01-10 18:29:08,247 INFO  mapred.JobClient - Task Id : 
task_0019_m_000004_2, Status : FAILED
2008-01-10 18:29:17,365 INFO  mapred.JobClient -  map 100% reduce 100%
2008-01-10 18:29:17,367 INFO  mapred.JobClient - Task Id : 
task_0019_m_000001_3, Status : FAILED
2008-01-10 18:29:20,870 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:910)

2008-01-10 18:29:25,582 DEBUG conf.Configuration - java.io.IOException: config()
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:910)
-----

If you need me to post log excerpts from the other slaves, please
let me know and I'll put them up.

Thanks!

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

Reply via email to