I can't dump the DB right now since it's far too large for a single node but from log output i can see that these records without signature were not parsable with Tika such as RSS feeds, bad PDF 's or timed out parses.
> > On 15/11/2011 20:33, Markus Jelsma wrote: > > > It's back again! Last try if someone has a pointer for this. > > > Cheers > > > > > >> After some DB updates, they're gone! Anyone recognizes this > > >> phenomenon? > > >> > > >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote: > > >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote: > > >>>> Hi guys, > > >>>> > > >>>> I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records > > >>>> and their signatures. I had to add a sanity check on signature to > > >>>> avoid a NPE. I had the assumption any record with such DB_ status > > >>>> has to have a signature, right? > > >>>> > > >>>> Why does roughly 0.0001625% of my records exit without a signature? > > >>> > > >>> Now with correct metrics: > > >>> Why does roughly 0.000084% of my records exist without a signature? > > > > This could be somehow related to pages that come from redirects so that > > when they are fetched they are accounted for under different urls, which > > in turn may confuse the update code in CrawlDbReducer... Do you notice > > any pattern to these pages? What's their origin? > > Ah, this seems like a useful pointer. I'll add debug lines to identifiy the > bad records and check them with a CrawlDB dump. > > Can't use the reader since it seems to stumble over records with meta data. > > Will report back here and maybe with a new ticket. > > Thanks

