[Nutch-dev] Re: Problem with latest SVN during reduce phase

Byron Miller Fri, 13 Jan 2006 06:18:14 -0800

I'll pull it down today and give it a shot.

thanks,
-byron


--- Lukas Vlcek <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Get the latest svn version. Andrzej commited some
> patches yesterday
> and now this issue is gone (at least it warks fine
> for me). I believe
> that revision# 368167 is what we were about.
> 
> Regards,
> Lukas
> 
> On 1/13/06, Pashabhai <[EMAIL PROTECTED]>
> wrote:
> > Hi ,
> >
> >    You are right, Parse object is not null even
> though
> > page has no content and title.
> >
> >    Could it be FetcherOutput Object ???
> >
> >
> > P
> >
> > --- Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > > I think this issue can be more complex. If I
> > > remember my test
> > > correctly then parse object was not null. Also
> > > parse.getText() was not
> > > null (it just contained empty String).
> > > If document is not parsed correctly then "empty"
> > > parse is returned
> > > instead: parseStatus.getEmptyParse(); which
> should
> > > be OK, but I didn't
> > > have a chance to check if this can cause any
> > > troubles during index
> > > index optimization.
> > > Lukas
> > >
> > > On 1/12/06, Pashabhai <[EMAIL PROTECTED]>
> > > wrote:
> > > > Hi ,
> > > >
> > > >    The very similar exception occurs while
> > > indexing a
> > > > page which do not have body content (and title
> > > > sometimes).
> > > >
> > > > 051223 194717 Optimizing index.
> > > > java.lang.NullPointerException
> > > >         at
> > > >
> > >
> >
>
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
> > > >
> > > >         at
> > > >
> > >
> >
>
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
> > > >
> > > >         at
> > > >
> > >
> >
>
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
> > > >
> > > >         at
> > > >
> > >
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > >
> > > >         at
> > > >
> > > >
> > > >  Looking into the source code of
> > > BasicIndexingFilter.
> > > > it is trying to
> > > > doc.add(Field.UnStored("content",
> > > parse.getText()));
> > > >
> > > > I guess adding check for null on parse object
> > > > if(parse!=null)   should solve the problem.
> > > >
> > > > Can confirm when tested locally.
> > > >
> > > > Thanks
> > > > P
> > > >
> > > >
> > > >
> > > >
> > > > --- Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > Hi,
> > > > > I am facing this error as well. Now I
> located
> > > one
> > > > > particular document
> > > > > which is causing it (it is msword document
> which
> > > > > can't be properly
> > > > > parsed by parser). I have sent it to Andrzej
> in
> > > > > separed email. Let's
> > > > > see if that helps...
> > > > > Lukas
> > > > >
> > > > > On 1/11/06, Dominik Friedrich
> > > > > <[EMAIL PROTECTED]> wrote:
> > > > > > I got this exception a lot, too. I haven't
> > > tested
> > > > > the patch by Andrzej
> > > > > > yet but instead I just put the doc.add()
> lines
> > > in
> > > > > the indexer reduce
> > > > > > function in a try-catch block . This way
> the
> > > > > indexing finishes even with
> > > > > > a null value and i can see which documents
> > > haven't
> > > > > been indexed in the
> > > > > > log file.
> > > > > >
> > > > > > Wouldn't it be a good idea to catch every
> > > > > exceptions that only affect
> > > > > > one document in loops like this? At least
> I
> > > don't
> > > > > like it if an indexing
> > > > > > process dies after a few hours because one
> > > > > document triggers such an
> > > > > > exception.
> > > > > >
> > > > > > best regards,
> > > > > > Dominik
> > > > > >
> > > > > > Byron Miller wrote:
> > > > > > > 60111 103432 reduce > reduce
> > > > > > > 060111 103432 Optimizing index.
> > > > > > > 060111 103433 closing > reduce
> > > > > > > 060111 103434 closing > reduce
> > > > > > > 060111 103435 closing > reduce
> > > > > > > java.lang.NullPointerException: value
> cannot
> > > be
> > > > > null
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.lucene.document.Field.<init>(Field.java:469)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.lucene.document.Field.<init>(Field.java:412)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > > > > Exception in thread "main"
> > > java.io.IOException:
> > > > > Job
> > > > > > > failed!
> > > > > > >         at
> > > > > > >
> 
=== message truncated ===



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Problem with latest SVN during reduce phase

Reply via email to