Nutch does not work with a big amount of urls.  And it uses several tens of
hdd gigabytes (about 80gb) to fetch, merge segments and index about 30-40k
urls. Is it ok?
Please let me know where the problem is.

I see this in hadoop.log:
2009-03-05 01:39:10,227 INFO  crawl.CrawlDb - CrawlDb update: done
2009-03-05 01:39:12,150 INFO  segment.SegmentMerger - Merging 2 segments to
crawl/MERGEDsegments/20090305013912
2009-03-05 01:39:12,160 INFO  segment.SegmentMerger - SegmentMerger:  
adding crawl/segments/20090304102421
2009-03-05 01:39:12,190 INFO  segment.SegmentMerger - SegmentMerger:  
adding crawl/segments/20090304165203
2009-03-05 01:39:12,195 INFO  segment.SegmentMerger - SegmentMerger: using
segment data from: content crawl_generate crawl_fetch crawl_parse parse_data
parse_text
2009-03-05 01:39:12,250 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2009-03-05 15:52:23,822 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
2009-03-05 15:52:28,831 INFO  crawl.LinkDb - LinkDb: starting
2009-03-05 15:52:28,832 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
2009-03-05 15:52:28,832 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2009-03-05 15:52:28,833 INFO  crawl.LinkDb - LinkDb: URL filter: true
2009-03-05 15:52:28,873 INFO  crawl.LinkDb - LinkDb: adding segment:
crawl/segments/20090305013912
2009-03-05 15:52:32,950 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/fss/nutch/crawl/segments/20090305013912/parse_data/part-00000/data
at 6814720
        at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:211)
        at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:238)
        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177)
        at
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:194)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
2009-03-05 15:52:33,937 FATAL crawl.LinkDb - LinkDb: java.io.IOException:
Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)

2009-03-05 15:52:34,808 INFO  indexer.Indexer - Indexer: starting
2009-03-05 15:52:34,822 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/crawldb
2009-03-05 15:52:34,822 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: crawl/linkdb
2009-03-05 15:52:34,822 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20090305013912
2009-03-05 15:52:36,797 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-03-05 15:52:36,898 WARN  mime.MimeTypesReader - Not a <mime-info/>
configuration document
2009-03-05 15:52:36,898 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2009-03-05 15:52:37,620 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/fss/nutch/crawl/segments/20090305013912/crawl_fetch/part-00000/data
at 1047552
        at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:211)
        at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:238)
        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177)
        at
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:194)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
2009-03-05 15:52:37,627 FATAL indexer.Indexer - Indexer:
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
        at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)

-- 
View this message in context: 
http://www.nabble.com/Segments-merging-and-indexing-errors-tp22361996p22361996.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to