Nutch does not work with a big amount of urls. And it uses several tens of hdd gigabytes (about 80gb) to fetch, merge segments and index about 30-40k urls. Is it ok? Please let me know where the problem is.
I see this in hadoop.log: 2009-03-05 01:39:10,227 INFO crawl.CrawlDb - CrawlDb update: done 2009-03-05 01:39:12,150 INFO segment.SegmentMerger - Merging 2 segments to crawl/MERGEDsegments/20090305013912 2009-03-05 01:39:12,160 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20090304102421 2009-03-05 01:39:12,190 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20090304165203 2009-03-05 01:39:12,195 INFO segment.SegmentMerger - SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text 2009-03-05 01:39:12,250 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-03-05 15:52:23,822 WARN mapred.LocalJobRunner - job_local_0001 java.lang.OutOfMemoryError: Java heap space 2009-03-05 15:52:28,831 INFO crawl.LinkDb - LinkDb: starting 2009-03-05 15:52:28,832 INFO crawl.LinkDb - LinkDb: linkdb: crawl/linkdb 2009-03-05 15:52:28,832 INFO crawl.LinkDb - LinkDb: URL normalize: true 2009-03-05 15:52:28,833 INFO crawl.LinkDb - LinkDb: URL filter: true 2009-03-05 15:52:28,873 INFO crawl.LinkDb - LinkDb: adding segment: crawl/segments/20090305013912 2009-03-05 15:52:32,950 WARN mapred.LocalJobRunner - job_local_0001 org.apache.hadoop.fs.ChecksumException: Checksum error: file:/home/fss/nutch/crawl/segments/20090305013912/parse_data/part-00000/data at 6814720 at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:211) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:238) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:194) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) 2009-03-05 15:52:33,937 FATAL crawl.LinkDb - LinkDb: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) 2009-03-05 15:52:34,808 INFO indexer.Indexer - Indexer: starting 2009-03-05 15:52:34,822 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb 2009-03-05 15:52:34,822 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb 2009-03-05 15:52:34,822 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20090305013912 2009-03-05 15:52:36,797 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2009-03-05 15:52:36,898 WARN mime.MimeTypesReader - Not a <mime-info/> configuration document 2009-03-05 15:52:36,898 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2009-03-05 15:52:37,620 WARN mapred.LocalJobRunner - job_local_0001 org.apache.hadoop.fs.ChecksumException: Checksum error: file:/home/fss/nutch/crawl/segments/20090305013912/crawl_fetch/part-00000/data at 1047552 at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:211) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:238) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:194) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) 2009-03-05 15:52:37,627 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:101) -- View this message in context: http://www.nabble.com/Segments-merging-and-indexing-errors-tp22361996p22361996.html Sent from the Nutch - User mailing list archive at Nabble.com.