I can't dump the DB right now since it's far too large for a single node but 
from log output i can see that these records without signature were not 
parsable with Tika such as RSS feeds, bad PDF 's or timed out parses.


> > On 15/11/2011 20:33, Markus Jelsma wrote:
> > > It's back again! Last try if someone has a pointer for this.
> > > Cheers
> > > 
> > >> After some DB updates, they're gone! Anyone recognizes this
> > >> phenomenon?
> > >> 
> > >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote:
> > >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote:
> > >>>> Hi guys,
> > >>>> 
> > >>>> I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records
> > >>>> and their signatures. I had to add a sanity check on signature to
> > >>>> avoid a NPE. I had the assumption any record with such DB_ status
> > >>>> has to have a signature, right?
> > >>>> 
> > >>>> Why does roughly 0.0001625% of my records exit without a signature?
> > >>> 
> > >>> Now with correct metrics:
> > >>> Why does roughly 0.000084% of my records exist without a signature?
> > 
> > This could be somehow related to pages that come from redirects so that
> > when they are fetched they are accounted for under different urls, which
> > in turn may confuse the update code in CrawlDbReducer... Do you notice
> > any pattern to these pages? What's their origin?
> 
> Ah, this seems like a useful pointer. I'll add debug lines to identifiy the
> bad records and check them with a CrawlDB dump.
> 
> Can't use the reader since it seems to stumble over records with meta data.
> 
> Will report back here and maybe with a new ticket.
> 
> Thanks

Reply via email to