(nutch-nightly, hadoop 0.9.1)

Got this in a nightly crawl of 40K more pages to a ~150K nutch db.  
Crawl has run fine the past five nights with same settings and script.
The error happened during the nutch mergesegs part of the re-crawl  
cycle. It crashes the mergesegs with

Exception in thread "main" java.io.IOException: Job failed!
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
399)
         at org.apache.nutch.segment.SegmentMerger.merge 
(SegmentMerger.java:547)
         at org.apache.nutch.segment.SegmentMerger.main 
(SegmentMerger.java:595)

And then the following re-crawl commands (invert, index, dedup) fail,  
leaving me with a corrupt webdb/index.

The error below is in my hadoop log.
The file indicated (bad_files/data.-931801681) is a 255MB binary file  
-- running strings on it shows a lot of URIs. There's also a  
2MB .data.crc-931801681 file, all binary.

Any idea how this happened or how to avoid?


2007-01-15 01:56:52,303 INFO  mapred.MapTask - opened part-0.out
2007-01-15 01:56:52,696 WARN  dfs.DistributedFileSystem - Moving bad  
file /array/nutch-nightly/crawl/segments/20070114192132/content/ 
part-00000/data to /array/nutch-nightly/bad_files/data.-931801681
2007-01-15 01:56:52,739 WARN  mapred.LocalJobRunner - job_u5iokg
org.apache.hadoop.fs.ChecksumException: Checksum error: /array/nutch- 
nightly/crawl/segments/20070114192132/content/part-00000/data at 2387968
         at org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum 
(FSDataInputStream.java:138)
         at org.apache.hadoop.fs.FSDataInputStream$Checker.read 
(FSDataInputStream.java:114)
         at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read 
(FSDataInputStream.java:189)
         at java.io.BufferedInputStream.read1 
(BufferedInputStream.java:254)
         at java.io.BufferedInputStream.read(BufferedInputStream.java: 
313)
         at java.io.DataInputStream.readFully(DataInputStream.java:176)
         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write 
(DataOutputBuffer.java:57)
         at org.apache.hadoop.io.DataOutputBuffer.write 
(DataOutputBuffer.java:91)
         at org.apache.hadoop.io.SequenceFile$Reader.next 
(SequenceFile.java:1280)
         at org.apache.hadoop.io.SequenceFile$Reader.next 
(SequenceFile.java:1191)
         at org.apache.hadoop.io.SequenceFile$Reader.next 
(SequenceFile.java:1237)
         at org.apache.hadoop.mapred.SequenceFileRecordReader.next 
(SequenceFileRecordReader.java:71)
         at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat 
$1.next(SegmentMerger.java:123)
         at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:203)
         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215)


--
http://variogr.am/
[EMAIL PROTECTED]




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to