Hi Andrzej,
thanks for your advice!
I've introduced a new metadatum which I've added to the CrawlDb for each record
using CrawlDatum.getMetaData().put(new Text("_fft_"), new
Text(String.valueOf(firstFoundLong)));
This works as expected and during debugging I see all the metadata in each
CrawlDatum. But when I now use the CrawlDbReader to dump the records, it throws
a NullPointerException when a CrawlDatum contains more than one entry in the
MapWritable of metadata:
java.lang.RuntimeException: java.lang.NullPointerException
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: java.lang.NullPointerException
at
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
... 13 more
Surprisingly, a CrawlDatum, which didn't store any metadata initially, is
dumped without any errors even after I have added the "_fft_" field. Do you
have any ideas what I may have implemented wrong?
Thanks in advance.
Kind regards,
Martina
-----Ursprüngliche Nachricht-----
Von: Andrzej Bialecki [mailto:[email protected]]
Gesendet: Freitag, 8. Mai 2009 23:14
An: [email protected]
Betreff: Re: Add new field to CrawlDatum
Koch Martina wrote:
> Hi all,
>
> I'd like to add a new field to the CrawlDatum to capture the date when an URL
> was found first. The field should be called FoundFirst. Can anyone tell me
> which classes I need to modify in order to achieve this? In my opinion, it
> should be sufficient to change the CrawlDatum and CrawlDbReader class, but I
> think, I've missed something beacause the CrawlDbMerger crashes now. I know
> that I lose the compatibility to Nutch, but still...
The easiest (and compatible) way to do this is to use
CrawlDatum.getMetaData(), which is a MapWritable that can store
arbitrary key/value pairs.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com