Hi,
I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data
and 4.6M fetched urls in my crawldb. I'm using the mapred code from
trunk (revision 374061, Wed, 01 Feb 2006).
I was able to generate the indexes from the crawldb and linkdb, but I
started to see this error recently while running a dedup on my indexes:
....
060210 061707 reduce 9%
060210 061710 reduce 10%
060210 061713 reduce 11%
060210 061717 reduce 12%
060210 061719 reduce 11%
060210 061723 reduce 10%
060210 061725 reduce 11%
060210 061726 reduce 10%
060210 061729 reduce 11%
060210 061730 reduce 9%
060210 061732 reduce 10%
060210 061736 reduce 11%
060210 061739 reduce 12%
060210 061742 reduce 10%
060210 061743 reduce 9%
060210 061745 reduce 10%
060210 061746 reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329)
at
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349)
I can see a lot of these messages in the jobtracker log on the master:
...
060210 061743 Task 'task_r_4t50k4' has been lost.
060210 061743 Task 'task_r_79vn7i' has been lost.
...
On every single slave, I get this file not found exception in the
tasktracker log:
060210 061749 Server handler 0 on 50040 caught:
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_273opj/part-4.out
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_273opj/part-4.out
at
org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:121)
at
org.apache.nutch.fs.NFSDataInputStream$Checker.<init>(NFSDataInputStream.java:45)
at
org.apache.nutch.fs.NFSDataInputStream.<init>(NFSDataInputStream.java:226)
at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
at
org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:93)
at
org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:121)
at org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:68)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:215)
I used to be able to complete the index dedupping successfully when my
segments/crawldb was smaller, but I don't see why this would be related
to the FileNotFoundException. I'm by far not running out of disk space
and my hard discs work properly.
Has anyone encountered a similar issue or has a clue about what's happening?
Thanks,
Florent
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general