> On 15/11/2011 20:33, Markus Jelsma wrote:
> > It's back again! Last try if someone has a pointer for this.
> > Cheers
> > 
> >> After some DB updates, they're gone! Anyone recognizes this phenomenon?
> >> 
> >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote:
> >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote:
> >>>> Hi guys,
> >>>> 
> >>>> I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records
> >>>> and their signatures. I had to add a sanity check on signature to
> >>>> avoid a NPE. I had the assumption any record with such DB_ status has
> >>>> to have a signature, right?
> >>>> 
> >>>> Why does roughly 0.0001625% of my records exit without a signature?
> >>> 
> >>> Now with correct metrics:
> >>> Why does roughly 0.000084% of my records exist without a signature?
> 
> This could be somehow related to pages that come from redirects so that
> when they are fetched they are accounted for under different urls, which
> in turn may confuse the update code in CrawlDbReducer... Do you notice
> any pattern to these pages? What's their origin?

Ah, this seems like a useful pointer. I'll add debug lines to identifiy the 
bad records and check them with a CrawlDB dump.

Can't use the reader since it seems to stumble over records with meta data.

Will report back here and maybe with a new ticket.

Thanks


Reply via email to