On Tuesday 22 November 2011 12:13:51 Lewis John Mcgibbney wrote:
> Hi Markus,
> 
> Just so I am understanding here, are the problems you've highlighted
> acceptable considering what we know about Nutch behaviour?
> 
> We know the problem with parse-feed, parser time-outs, can you explainw hat
> you mean by bad pdf's?

Records throwing a parse exception, failing to parse a document such as some 
PDF's. These records do not seem to get a signature although at first glance 
code paths tell me they should.

I do not know whether this is intended behaviour or acceptable.

> 
> Thank you
> 
> On Mon, Nov 21, 2011 at 8:01 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > **
> > 
> > I can't dump the DB right now since it's far too large for a single node
> > but from log output i can see that these records without signature were
> > not parsable with Tika such as RSS feeds, bad PDF 's or timed out
> > parses.
> > 
> >  > > On 15/11/2011 20:33, Markus Jelsma wrote:
> > > > > It's back again! Last try if someone has a pointer for this.
> > > > > 
> > > > > Cheers
> > > > > 
> > > > >> After some DB updates, they're gone! Anyone recognizes this
> > > > >> 
> > > > >> phenomenon?
> > > > >> 
> > > > >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote:
> > > > >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote:
> > > > >>>> Hi guys,
> > > > >>>> 
> > > > >>>> 
> > > > >>>> 
> > > > >>>> I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED
> > 
> > records
> > 
> > > > >>>> and their signatures. I had to add a sanity check on signature
> > > > >>>> to
> > > > >>>> 
> > > > >>>> avoid a NPE. I had the assumption any record with such DB_
> > > > >>>> status
> > > > >>>> 
> > > > >>>> has to have a signature, right?
> > > > >>>> 
> > > > >>>> 
> > > > >>>> 
> > > > >>>> Why does roughly 0.0001625% of my records exit without a
> > 
> > signature?
> > 
> > > > >>> Now with correct metrics:
> > > > >>> 
> > > > >>> Why does roughly 0.000084% of my records exist without a
> > > > >>> signature?
> > > > 
> > > > This could be somehow related to pages that come from redirects so
> > > > that
> > > > 
> > > > when they are fetched they are accounted for under different urls,
> > 
> > which
> > 
> > > > in turn may confuse the update code in CrawlDbReducer... Do you
> > > > notice
> > > > 
> > > > any pattern to these pages? What's their origin?
> > > 
> > > Ah, this seems like a useful pointer. I'll add debug lines to identifiy
> > 
> > the
> > 
> > > bad records and check them with a CrawlDB dump.
> > > 
> > > 
> > > 
> > > Can't use the reader since it seems to stumble over records with meta
> > 
> > data.
> > 
> > > Will report back here and maybe with a new ticket.
> > > 
> > > 
> > > 
> > > Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to