RE: problem ending crawl nutch 1.0 - DeleteDuplicates

BELLINI ADAM Tue, 06 Oct 2009 06:59:54 -0700

hi,

i forget to say that when the errors happen, and the crawling stops it creates 
the folder  'dedup-urls-485515157'
can some one tell me when using  'ant' what will we do after that ?? concerning 
jars , build ...etc


thx



> From: mbel...@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: problem ending crawl nutch 1.0 - DeleteDuplicates
> Date: Sun, 4 Oct 2009 16:21:13 +0000
> 
> 
> hi,
> any idea !! 
> 
> 
> 
> > From: mbel...@msn.com
> > To: nutch-user@lucene.apache.org
> > Subject: problem ending crawl nutch 1.0 - DeleteDuplicates
> > Date: Fri, 2 Oct 2009 19:36:06 +0000
> > 
> > 
> > 
> > Hi,
> > 
> > i tryed 2 days ago to change the name of 2 meta fields in  index-basic 
> > plugin:
> > i renamed the 2 fields  'url' and 'content' as  'web.url' and 'web.content' 
> > in the BasicIndexingFilter.java :
> > 
> > 
> > 
> > After that i run 'ANT' to build the project.
> > 
> > i copied the plugin folder 'index-basic'   from  nutch-1.0/build/plugins/   
> >  to  /nutch-1.0/plugins
> > 
> > 
> > 
> > and since that changes i have this error when crawling :
> > 
> > 
> > 
> > 2009-10-02 15:15:44,145 INFO  indexer.DeleteDuplicates - Dedup: starting
> > 
> > 2009-10-02 15:15:44,147 INFO  indexer.DeleteDuplicates - Dedup: adding 
> > indexes in: crawl_dc/indexes
> > 
> > 2009-10-02 15:15:44,153 WARN  mapred.JobClient - Use
> > GenericOptionsParser for parsing the arguments. Applications should
> > implement Tool for the same.
> > 
> > 2009-10-02 15:15:45,518 WARN  mapred.LocalJobRunner - job_local_0013
> > 
> > java.lang.NullPointerException
> > 
> >         at org.apache.hadoop.io.Text.encode(Text.java:388)
> > 
> >         at org.apache.hadoop.io.Text.set(Text.java:178)
> > 
> >         at 
> > org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:191)
> > 
> >         at 
> > org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:157)
> > 
> >         at 
> > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> > 
> >         at 
> > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> > 
> >         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > 
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> > 
> >         at 
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> > 
> > 
> > 
> > 
> > 
> > even when making a rollback still have the same problem....
> > 
> > what shoud i do plz !!!
> > 
> > 
> > 
> > public class BasicIndexingFilter implements IndexingFilter {
> > 
> > .....
> > 
> >  doc.add("web.url", reprUrlString == null ? urlString : reprUrlString);
> >  doc.add("web.content", parse.getText());
> > 
> > ....
> > 
> > public void addIndexBackendOptions(Configuration conf) {
> > 
> > ....
> > 
> >  // url is both stored and indexed, so it's both searchable and returned
> >     LuceneWriter.addFieldOptions("web.url", LuceneWriter.STORE.YES,
> >         LuceneWriter.INDEX.TOKENIZED, conf);
> > 
> >     // content is indexed, so that it's searchable, but not stored in index
> >     LuceneWriter.addFieldOptions("web.content", LuceneWriter.STORE.NO,
> >         LuceneWriter.INDEX.TOKENIZED, conf);
> > 
> > 
> > .....} // end of method addIndexBackendOptions
> > 
> > 
> > 
> > ....} //end of class
> > 
> > 
> > 
> > thx
> > 
> >                                       
> > _________________________________________________________________
> > We are your photos. Share us now with Windows Live Photos.
> > http://go.microsoft.com/?linkid=9666047
>                                         
> _________________________________________________________________
> Click less, chat more: Messenger on MSN.ca
> http://go.microsoft.com/?linkid=9677404
                                          
_________________________________________________________________
Click less, chat more: Messenger on MSN.ca
http://go.microsoft.com/?linkid=9677404

RE: problem ending crawl nutch 1.0 - DeleteDuplicates

Reply via email to