On Tuesday 22 November 2011 12:13:51 Lewis John Mcgibbney wrote: > Hi Markus, > > Just so I am understanding here, are the problems you've highlighted > acceptable considering what we know about Nutch behaviour? > > We know the problem with parse-feed, parser time-outs, can you explainw hat > you mean by bad pdf's?
Records throwing a parse exception, failing to parse a document such as some PDF's. These records do not seem to get a signature although at first glance code paths tell me they should. I do not know whether this is intended behaviour or acceptable. > > Thank you > > On Mon, Nov 21, 2011 at 8:01 PM, Markus Jelsma > > <[email protected]>wrote: > > ** > > > > I can't dump the DB right now since it's far too large for a single node > > but from log output i can see that these records without signature were > > not parsable with Tika such as RSS feeds, bad PDF 's or timed out > > parses. > > > > > > On 15/11/2011 20:33, Markus Jelsma wrote: > > > > > It's back again! Last try if someone has a pointer for this. > > > > > > > > > > Cheers > > > > > > > > > >> After some DB updates, they're gone! Anyone recognizes this > > > > >> > > > > >> phenomenon? > > > > >> > > > > >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote: > > > > >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote: > > > > >>>> Hi guys, > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED > > > > records > > > > > > >>>> and their signatures. I had to add a sanity check on signature > > > > >>>> to > > > > >>>> > > > > >>>> avoid a NPE. I had the assumption any record with such DB_ > > > > >>>> status > > > > >>>> > > > > >>>> has to have a signature, right? > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Why does roughly 0.0001625% of my records exit without a > > > > signature? > > > > > > >>> Now with correct metrics: > > > > >>> > > > > >>> Why does roughly 0.000084% of my records exist without a > > > > >>> signature? > > > > > > > > This could be somehow related to pages that come from redirects so > > > > that > > > > > > > > when they are fetched they are accounted for under different urls, > > > > which > > > > > > in turn may confuse the update code in CrawlDbReducer... Do you > > > > notice > > > > > > > > any pattern to these pages? What's their origin? > > > > > > Ah, this seems like a useful pointer. I'll add debug lines to identifiy > > > > the > > > > > bad records and check them with a CrawlDB dump. > > > > > > > > > > > > Can't use the reader since it seems to stumble over records with meta > > > > data. > > > > > Will report back here and maybe with a new ticket. > > > > > > > > > > > > Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

