Hi Doğacan,
thanks for your reply!
I applied the patch, but I still get the same error message.
I also tried to merge the old crawldb in a new one and then to a readdb, but
even the merge step fails with the following error message:
2009-02-11 08:35:31,520 INFO jvm.JvmMetrics - Initializing JVM Metrics with
processName=JobTracker, sessionId=
2009-02-11 08:35:31,707 INFO mapred.FileInputFormat - Total input paths to
process : 1
2009-02-11 08:35:32,004 INFO mapred.JobClient - Running job: job_local_0001
2009-02-11 08:35:32,004 INFO mapred.FileInputFormat - Total input paths to
process : 1
2009-02-11 08:35:32,082 INFO mapred.MapTask - numReduceTasks: 1
2009-02-11 08:35:32,082 INFO mapred.MapTask - io.sort.mb = 100
2009-02-11 08:35:32,191 INFO mapred.MapTask - data buffer = 79691776/99614720
2009-02-11 08:35:32,191 INFO mapred.MapTask - record buffer = 262144/327680
2009-02-11 08:35:32,222 WARN mapred.LocalJobRunner - job_local_0001
java.lang.RuntimeException: java.lang.NullPointerException
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: java.lang.NullPointerException
at
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
... 13 more
2009-02-11 08:35:33,003 FATAL crawl.CrawlDbMerger - CrawlDb merge:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:119)
at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:178)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:150)
I ran the merge step in debug mode and saw that the new code lines of
CrawlDbMerger are never read. The error occurs earlier somewhere in the merge
method.
Kind regards,
Martina
-----Ursprüngliche Nachricht-----
Von: Doğacan Güney [mailto:[email protected]]
Gesendet: Dienstag, 10. Februar 2009 22:54
An: [email protected]
Betreff: Re: "old" crawldb not readable with current trunk
On Tue, Feb 10, 2009 at 4:47 PM, Koch Martina <[email protected]> wrote:
> Hi,
>
> I just upgraded from trunk version 28.12.2008 to trunk version 04.02.2009.
> Now, I'm trying to read my old crawldb's e.g. by using the command "bin/nutch
> readdb <crawldb> -stats" , but I always get the following error:
>
> 2009-02-10 15:41:05,541 DEBUG mapred.MapTask - Writing local split to
> /tmp/CRAWLNAME.default.xyz/mapred/local/localRunner/split.dta
> 2009-02-10 15:41:05,588 DEBUG mapred.TaskRunner -
> attempt_local_0001_m_000000_0 Progress/ping thread started
> 2009-02-10 15:41:05,588 INFO mapred.MapTask - numReduceTasks: 1
> 2009-02-10 15:41:05,588 INFO mapred.MapTask - io.sort.mb = 100
> 2009-02-10 15:41:05,698 INFO mapred.MapTask - data buffer = 79691776/99614720
> 2009-02-10 15:41:05,698 INFO mapred.MapTask - record buffer = 262144/327680
> 2009-02-10 15:41:05,713 DEBUG mapred.Counters - Creating group
> org.apache.hadoop.mapred.Task$Counter with bundle
> 2009-02-10 15:41:05,713 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> 2009-02-10 15:41:05,713 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
> 2009-02-10 15:41:05,713 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS
> 2009-02-10 15:41:05,713 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS
> 2009-02-10 15:41:05,713 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> 2009-02-10 15:41:05,713 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> 2009-02-10 15:41:05,729 WARN mapred.LocalJobRunner - job_local_0001
> java.lang.RuntimeException: java.lang.NullPointerException
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
> at
> org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
> at
> org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
> at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
> at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
> at
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
> at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
> at
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
> at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> Caused by: java.lang.NullPointerException
> at
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
> ... 13 more
>
> With the older version oft he trunk I can read the crawldb without difficulty.
>
> Are the old files not readable with the new trunk version since the upgrade
> to lucene 2.4?
> Is there anything I can do to re-use my old data with the new version?
>
Try again in a couple of days. This is a known bug (NUTCH-683). I will
commit that patch very
soon. Meanwhile, you can apply patch there manually.
> Kind regards,
> Martina
>
--
Doğacan Güney