Hi,
Get the latest svn version. Andrzej commited some patches yesterday
and now this issue is gone (at least it warks fine for me). I believe
that revision# 368167 is what we were about.
Regards,
Lukas
On 1/13/06, Pashabhai <[EMAIL PROTECTED]> wrote:
> Hi ,
>
> You are right, Parse object is not null even though
> page has no content and title.
>
> Could it be FetcherOutput Object ???
>
>
> P
>
> --- Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> > I think this issue can be more complex. If I
> > remember my test
> > correctly then parse object was not null. Also
> > parse.getText() was not
> > null (it just contained empty String).
> > If document is not parsed correctly then "empty"
> > parse is returned
> > instead: parseStatus.getEmptyParse(); which should
> > be OK, but I didn't
> > have a chance to check if this can cause any
> > troubles during index
> > index optimization.
> > Lukas
> >
> > On 1/12/06, Pashabhai <[EMAIL PROTECTED]>
> > wrote:
> > > Hi ,
> > >
> > > The very similar exception occurs while
> > indexing a
> > > page which do not have body content (and title
> > > sometimes).
> > >
> > > 051223 194717 Optimizing index.
> > > java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
> > >
> > > at
> > >
> >
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
> > >
> > > at
> > >
> >
> org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
> > >
> > > at
> > >
> >
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > >
> > > at
> > >
> > >
> > > Looking into the source code of
> > BasicIndexingFilter.
> > > it is trying to
> > > doc.add(Field.UnStored("content",
> > parse.getText()));
> > >
> > > I guess adding check for null on parse object
> > > if(parse!=null) should solve the problem.
> > >
> > > Can confirm when tested locally.
> > >
> > > Thanks
> > > P
> > >
> > >
> > >
> > >
> > > --- Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hi,
> > > > I am facing this error as well. Now I located
> > one
> > > > particular document
> > > > which is causing it (it is msword document which
> > > > can't be properly
> > > > parsed by parser). I have sent it to Andrzej in
> > > > separed email. Let's
> > > > see if that helps...
> > > > Lukas
> > > >
> > > > On 1/11/06, Dominik Friedrich
> > > > <[EMAIL PROTECTED]> wrote:
> > > > > I got this exception a lot, too. I haven't
> > tested
> > > > the patch by Andrzej
> > > > > yet but instead I just put the doc.add() lines
> > in
> > > > the indexer reduce
> > > > > function in a try-catch block . This way the
> > > > indexing finishes even with
> > > > > a null value and i can see which documents
> > haven't
> > > > been indexed in the
> > > > > log file.
> > > > >
> > > > > Wouldn't it be a good idea to catch every
> > > > exceptions that only affect
> > > > > one document in loops like this? At least I
> > don't
> > > > like it if an indexing
> > > > > process dies after a few hours because one
> > > > document triggers such an
> > > > > exception.
> > > > >
> > > > > best regards,
> > > > > Dominik
> > > > >
> > > > > Byron Miller wrote:
> > > > > > 60111 103432 reduce > reduce
> > > > > > 060111 103432 Optimizing index.
> > > > > > 060111 103433 closing > reduce
> > > > > > 060111 103434 closing > reduce
> > > > > > 060111 103435 closing > reduce
> > > > > > java.lang.NullPointerException: value cannot
> > be
> > > > null
> > > > > > at
> > > > > >
> > > >
> > >
> >
> org.apache.lucene.document.Field.<init>(Field.java:469)
> > > > > > at
> > > > > >
> > > >
> > >
> >
> org.apache.lucene.document.Field.<init>(Field.java:412)
> > > > > > at
> > > > > >
> > > >
> > >
> >
> org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > > > > at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > > > > at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > > > > at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > > > Exception in thread "main"
> > java.io.IOException:
> > > > Job
> > > > > > failed!
> > > > > > at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > > > > > at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > > > > > at
> > > > > >
> > > >
> > org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > > > > [EMAIL PROTECTED]:/data/nutch/trunk$
> > > > > >
> > > > > >
> > > > > > Pulled todays build and got above error. No
> > > > problems
> > > > > > running out of disk space or anything like
> > that.
> > > > This
> > > > > > is a single instance, local file systems.
> > > > > >
> > > > > > Anyway to recover the crawl/finish the
> > reduce
> > > > job from
> > > > > > where it failed?
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam? Yahoo! Mail has the best spam
> > protection around
> > > http://mail.yahoo.com
> > >
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>