> On 15/11/2011 20:33, Markus Jelsma wrote: > > It's back again! Last try if someone has a pointer for this. > > Cheers > > > >> After some DB updates, they're gone! Anyone recognizes this phenomenon? > >> > >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote: > >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote: > >>>> Hi guys, > >>>> > >>>> I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records > >>>> and their signatures. I had to add a sanity check on signature to > >>>> avoid a NPE. I had the assumption any record with such DB_ status has > >>>> to have a signature, right? > >>>> > >>>> Why does roughly 0.0001625% of my records exit without a signature? > >>> > >>> Now with correct metrics: > >>> Why does roughly 0.000084% of my records exist without a signature? > > This could be somehow related to pages that come from redirects so that > when they are fetched they are accounted for under different urls, which > in turn may confuse the update code in CrawlDbReducer... Do you notice > any pattern to these pages? What's their origin?
Ah, this seems like a useful pointer. I'll add debug lines to identifiy the bad records and check them with a CrawlDB dump. Can't use the reader since it seems to stumble over records with meta data. Will report back here and maybe with a new ticket. Thanks

