I received the following error during the linkdb stage of indexing. Has
anyone encountered this before? Is there a way of increasing memory for
this stage in config file? Is there a known linkdb memory leak problem?
2007-10-09 10:56:37,787 INFO crawl.LinkDb - LinkDb: starting
2007-10-09 10:56:37,788 INFO crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
2007-10-09 10:56:37,788 INFO crawl.LinkDb - LinkDb: URL normalize: true
2007-10-09 10:56:37,788 INFO crawl.LinkDb - LinkDb: URL filter: true
2007-10-09 10:56:37,886 INFO crawl.LinkDb - LinkDb: adding segment:
/user/daclark/crawl/segments/20071008185033
2007-10-09 10:56:39,977 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2007-10-09 10:56:42,495 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2007-10-09 10:56:51,415 WARN mapred.TaskTracker - Error running child
java.lang.OutOfMemoryError: Java heap space
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:95)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.Text.writeString(Text.java:399)
at org.apache.nutch.crawl.Inlink.write(Inlink.java:48)
at org.apache.nutch.crawl.Inlinks.write(Inlinks.java:54)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:315)
at org.apache.nutch.crawl.LinkDb.map(LinkDb.java:167)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
2007-10-09 10:57:40,654 FATAL crawl.LinkDb - LinkDb: java.io.IOException:
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:377)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:333)
~~~~~~~~~~~~~~~~~~~~~
Daniel Clark, President
DAC Systems, Inc.
(703) 403-0340
~~~~~~~~~~~~~~~~~~~~~