Hi Markus, Just so I am understanding here, are the problems you've highlighted acceptable considering what we know about Nutch behaviour?
We know the problem with parse-feed, parser time-outs, can you explainw hat you mean by bad pdf's? Thank you On Mon, Nov 21, 2011 at 8:01 PM, Markus Jelsma <[email protected]>wrote: > ** > > I can't dump the DB right now since it's far too large for a single node > but from log output i can see that these records without signature were not > parsable with Tika such as RSS feeds, bad PDF 's or timed out parses. > > > > On 15/11/2011 20:33, Markus Jelsma wrote: > > > > > It's back again! Last try if someone has a pointer for this. > > > > > Cheers > > > > > > > > > >> After some DB updates, they're gone! Anyone recognizes this > > > > >> phenomenon? > > > > >> > > > > >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote: > > > > >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote: > > > > >>>> Hi guys, > > > > >>>> > > > > >>>> I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED > records > > > > >>>> and their signatures. I had to add a sanity check on signature to > > > > >>>> avoid a NPE. I had the assumption any record with such DB_ status > > > > >>>> has to have a signature, right? > > > > >>>> > > > > >>>> Why does roughly 0.0001625% of my records exit without a > signature? > > > > >>> > > > > >>> Now with correct metrics: > > > > >>> Why does roughly 0.000084% of my records exist without a signature? > > > > > > > > This could be somehow related to pages that come from redirects so that > > > > when they are fetched they are accounted for under different urls, > which > > > > in turn may confuse the update code in CrawlDbReducer... Do you notice > > > > any pattern to these pages? What's their origin? > > > > > > Ah, this seems like a useful pointer. I'll add debug lines to identifiy > the > > > bad records and check them with a CrawlDB dump. > > > > > > Can't use the reader since it seems to stumble over records with meta > data. > > > > > > Will report back here and maybe with a new ticket. > > > > > > Thanks > -- *Lewis*

