RE: problem ending crawl nutch 1.0 - DeleteDuplicates

BELLINI ADAM Sun, 04 Oct 2009 09:21:50 -0700

hi,
any idea !!



> From: mbel...@msn.com
> To: nutch-user@lucene.apache.org
> Subject: problem ending crawl nutch 1.0 - DeleteDuplicates
> Date: Fri, 2 Oct 2009 19:36:06 +0000
> 
> 
> 
> Hi,
> 
> i tryed 2 days ago to change the name of 2 meta fields in  index-basic plugin:
> i renamed the 2 fields  'url' and 'content' as  'web.url' and 'web.content' 
> in the BasicIndexingFilter.java :
> 
> 
> 
> After that i run 'ANT' to build the project.
> 
> i copied the plugin folder 'index-basic'   from  nutch-1.0/build/plugins/    
> to  /nutch-1.0/plugins
> 
> 
> 
> and since that changes i have this error when crawling :
> 
> 
> 
> 2009-10-02 15:15:44,145 INFO  indexer.DeleteDuplicates - Dedup: starting
> 
> 2009-10-02 15:15:44,147 INFO  indexer.DeleteDuplicates - Dedup: adding 
> indexes in: crawl_dc/indexes
> 
> 2009-10-02 15:15:44,153 WARN  mapred.JobClient - Use
> GenericOptionsParser for parsing the arguments. Applications should
> implement Tool for the same.
> 
> 2009-10-02 15:15:45,518 WARN  mapred.LocalJobRunner - job_local_0013
> 
> java.lang.NullPointerException
> 
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
> 
>         at org.apache.hadoop.io.Text.set(Text.java:178)
> 
>         at 
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:191)
> 
>         at 
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:157)
> 
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> 
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> 
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> 
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> 
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> 
> 
> 
> 
> 
> even when making a rollback still have the same problem....
> 
> what shoud i do plz !!!
> 
> 
> 
> public class BasicIndexingFilter implements IndexingFilter {
> 
> .....
> 
>  doc.add("web.url", reprUrlString == null ? urlString : reprUrlString);
>  doc.add("web.content", parse.getText());
> 
> ....
> 
> public void addIndexBackendOptions(Configuration conf) {
> 
> ....
> 
>  // url is both stored and indexed, so it's both searchable and returned
>     LuceneWriter.addFieldOptions("web.url", LuceneWriter.STORE.YES,
>         LuceneWriter.INDEX.TOKENIZED, conf);
> 
>     // content is indexed, so that it's searchable, but not stored in index
>     LuceneWriter.addFieldOptions("web.content", LuceneWriter.STORE.NO,
>         LuceneWriter.INDEX.TOKENIZED, conf);
> 
> 
> .....} // end of method addIndexBackendOptions
> 
> 
> 
> ....} //end of class
> 
> 
> 
> thx
> 
>                                         
> _________________________________________________________________
> We are your photos. Share us now with Windows Live Photos.
> http://go.microsoft.com/?linkid=9666047
                                          
_________________________________________________________________
Click less, chat more: Messenger on MSN.ca
http://go.microsoft.com/?linkid=9677404

RE: problem ending crawl nutch 1.0 - DeleteDuplicates

Reply via email to