Hi Markus,

Just so I am understanding here, are the problems you've highlighted
acceptable considering what we know about Nutch behaviour?

We know the problem with parse-feed, parser time-outs, can you explainw hat
you mean by bad pdf's?

Thank you

On Mon, Nov 21, 2011 at 8:01 PM, Markus Jelsma
<[email protected]>wrote:

> **
>
> I can't dump the DB right now since it's far too large for a single node
> but from log output i can see that these records without signature were not
> parsable with Tika such as RSS feeds, bad PDF 's or timed out parses.
>
>  > > On 15/11/2011 20:33, Markus Jelsma wrote:
>
> > > > It's back again! Last try if someone has a pointer for this.
>
> > > > Cheers
>
> > > >
>
> > > >> After some DB updates, they're gone! Anyone recognizes this
>
> > > >> phenomenon?
>
> > > >>
>
> > > >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote:
>
> > > >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote:
>
> > > >>>> Hi guys,
>
> > > >>>>
>
> > > >>>> I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED
> records
>
> > > >>>> and their signatures. I had to add a sanity check on signature to
>
> > > >>>> avoid a NPE. I had the assumption any record with such DB_ status
>
> > > >>>> has to have a signature, right?
>
> > > >>>>
>
> > > >>>> Why does roughly 0.0001625% of my records exit without a
> signature?
>
> > > >>>
>
> > > >>> Now with correct metrics:
>
> > > >>> Why does roughly 0.000084% of my records exist without a signature?
>
> > >
>
> > > This could be somehow related to pages that come from redirects so that
>
> > > when they are fetched they are accounted for under different urls,
> which
>
> > > in turn may confuse the update code in CrawlDbReducer... Do you notice
>
> > > any pattern to these pages? What's their origin?
>
> >
>
> > Ah, this seems like a useful pointer. I'll add debug lines to identifiy
> the
>
> > bad records and check them with a CrawlDB dump.
>
> >
>
> > Can't use the reader since it seems to stumble over records with meta
> data.
>
> >
>
> > Will report back here and maybe with a new ticket.
>
> >
>
> > Thanks
>



-- 
*Lewis*

Reply via email to