Re: nutch fetch of redirects not ending up in index

John Mendenhall Thu, 04 Dec 2008 12:56:20 -0800

Dennis,

> I will need to look deeper but I think there is a subtle logic bug in 
> Fetcher.
> 
> Redirect statuses, both temp and perm get output by the fetchers, even 
> when redirecting immediately, so if you have multiple redirects you 
> would have multiple outputs in the crawl_fetch output from segments. 
> The outputs are by url, but the urls change with redirects.
> 
> When updating to crawldb the latest (in terms of time) update is used. 
> But that is per url and I don't think it updates the redirected urls. 
> Meaning the updated crawldb would contain the successfully fetched urls 
> and the redirected intermediate urls.  At least that is what I think is 
> happening.
> 
> The final number indexed should be the successfully fetched urls, which 
> would be db_fetched.
> 
> Dennis


Anything I can do to help debug this?

JohnM




> John Mendenhall wrote:
> >On Thu, 04 Dec 2008, Dennis Kubes wrote:
> >
> >>Forget my last email.  I went back and read your original email.  What 
> >>type of webpages are you trying to fetch?  This doesn't seem like a 
> >>configuration issue to me.
> >
> >Most of this particular url set are redirects.
> >The pages are dynamic pages, all from a single
> >website, served by hundreds of web servers,
> >requested through a local proxy server (local to
> >nutch server).
> >
> >I believe the urlfilter is set correctly.
> >The logs are stating the pages are getting
> >fetched now.  I have debug set for fetcher
> >and generator.  I am not seeing messages
> >of the urls being skipped.  Perhaps I need
> >to debug elsewhere to find out why we are
> >not getting the urls fetched and indexed.
> >
> >I did increase topN to 1000000.  We got quite
> >a bit more, but still not all of the pages.
> >The current status breakdown after an additional
> >cycle is:
> >
> >status 1 (db_unfetched):      19265
> >status 2 (db_fetched):        159912
> >status 3 (db_gone):   2
> >status 4 (db_redir_temp):     299024
> >status 5 (db_redir_perm):     230154
> >status 6 (db_notmodified):    159418
> >
> >Searchable results is now about 287k.
> >Total urls in crawldb is listed as 867775.
> >
> >Thanks for your assistance.
> >
> >JohnM
> >
> >
> >>Dennis Kubes wrote:
> >>>Hi John,
> >>>
> >>>If the http.redirect.max config variable in nutch-*.xml is set to 0 then 
> >>>any redirect is queued to be fetched during the next fetching round 
> >>>similar to new urls we parse off of a webpage.  Try setting it to 3 and 
> >>>your redirects should go down.
> >>>
> >>>Dennis
> >>>
> >>>John Mendenhall wrote:
> >>>>>We are using nutch version nutch-2008-07-22_04-01-29.
> >>>>>We have a crawldb with over 500k urls.
> >>>>>
> >>>>>The status breakdown was as follows:
> >>>>>
> >>>>>status 1 (db_unfetched):        19261
> >>>>>status 2 (db_fetched):  71628
> >>>>>status 4 (db_redir_temp):       274899
> >>>>>status 5 (db_redir_perm):       148220
> >>>>>status 6 (db_notmodified):      822
> >>>>>
> >>>>>We had set the http.redirect.max property to 7, in
> >>>>>the nutch-site.xml file.  We are currently using
> >>>>>bin/nutch fetch.
> >>>>>
> >>>>>We have set logging level to debug for the Fetcher.  We see the 
> >>>>>fetching log entries.  We see the protocol
> >>>>>redirect to log entries.  However, after a complete
> >>>>>cycle of fetch/update/merge/index, etc, all the numbers
> >>>>>above stayed the same.
> >>>>>
> >>>>>We originally fetched with the obsolete setting
> >>>>>of db.default.fetch.interval set to 365 days.
> >>>>>We now have db.fetch.interval.default set to 35 days.
> >>>>>We run the generate cycle with -add-days 370 to
> >>>>>refetch all those urls.  This does not appear to be
> >>>>>working.
> >>>>>
> >>>>>We found db.update.additions.allowed, set to false
> >>>>>from a previous run type.  We tried setting it to
> >>>>>true.  Now, after several more cycles, our status
> >>>>>breakdown is now:
> >>>>>
> >>>>>status 1 (db_unfetched):        19270
> >>>>>status 2 (db_fetched):  100230
> >>>>>status 3 (db_gone):     1
> >>>>>status 4 (db_redir_temp):       278559
> >>>>>status 5 (db_redir_perm):       168816
> >>>>>status 6 (db_notmodified):      64366
> >>>>>
> >>>>>The number of fetched and searchable pages has gone
> >>>>>from about 72000 to about 140k (per the search
> >>>>>interface).
> >>>>>
> >>>>>Our question is, what can we do to get these redirected
> >>>>>pages indexed?  Do we need to increase add-days?
> >>>>>Do we need to increase our topN (currently using
> >>>>>100k)?  Or, do we just need to start over?
> >>>>After a couple more days, our status breakdown is as
> >>>>follows:
> >>>>
> >>>>status 1 (db_unfetched):        19261
> >>>>status 2 (db_fetched):  100198
> >>>>status 3 (db_gone):     2
> >>>>status 4 (db_redir_temp):       279591
> >>>>status 5 (db_redir_perm):       171076
> >>>>status 6 (db_notmodified):      80227
> >>>>
> >>>>Fetched has gone down, with not modified going up.
> >>>>Search results through the search interface has gone up
> >>>>to about 148500 results.
> >>>>
> >>>>There must be some configuration variable we have not
> >>>>set properly.  We use topN of 100000.  We run it through
> >>>>the cycle about 8 times per day, with only small incremental
> >>>>progress each round.  Should topN be higher?
> >>>>
> >>>>Or, do we need to rebuild the entire crawl database?
> >>>>
> >>>>Please let me know if there is any information I need to
> >>>>provide.
> >>>>
> >>>>Thanks in advance for any assistance provided.
> >>>>
> >>>>JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

Re: nutch fetch of redirects not ending up in index

Reply via email to