AW: Add new field to CrawlDatum

Koch Martina Mon, 11 May 2009 02:43:42 -0700

Hi  Andrzej,

thanks for your advice!


I've introduced a new metadatum which I've added to the CrawlDb for each record 
using CrawlDatum.getMetaData().put(new Text("_fft_"), new 
Text(String.valueOf(firstFoundLong)));

This works as expected and during debugging I see all the metadata in each 
CrawlDatum. But when I now use the CrawlDbReader to dump the records, it throws 
a NullPointerException when a CrawlDatum contains more than one entry in the 
MapWritable of metadata:

java.lang.RuntimeException: java.lang.NullPointerException
        at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
        at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at 
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
        at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
        at 
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
        at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: java.lang.NullPointerException
        at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
        at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
        ... 13 more

Surprisingly, a CrawlDatum, which didn't store any metadata initially, is 
dumped without any errors even after I have added the "_fft_" field. Do you 
have any ideas what I may have implemented wrong?

Thanks in advance.

Kind regards,
Martina


-----Ursprüngliche Nachricht-----
Von: Andrzej Bialecki [mailto:[email protected]] 
Gesendet: Freitag, 8. Mai 2009 23:14
An: [email protected]
Betreff: Re: Add new field to CrawlDatum

Koch Martina wrote:
> Hi all,
> 
> I'd like to add a new field to the CrawlDatum to capture the date when an URL 
> was found first. The field should be called FoundFirst. Can anyone tell me 
> which classes I need to modify in order to achieve this? In my opinion, it 
> should be sufficient to change the CrawlDatum and CrawlDbReader class, but I 
> think, I've missed something beacause the CrawlDbMerger crashes now. I know 
> that I lose the compatibility to Nutch, but still...

The easiest (and compatible) way to do this is to use 
CrawlDatum.getMetaData(), which is a MapWritable that can store 
arbitrary key/value pairs.




-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

AW: Add new field to CrawlDatum

Reply via email to