Re: nutch fetch of redirects not ending up in index

John Mendenhall Thu, 04 Dec 2008 12:10:13 -0800

On Thu, 04 Dec 2008, Dennis Kubes wrote:

> Forget my last email.  I went back and read your original email.  What 
> type of webpages are you trying to fetch?  This doesn't seem like a 
> configuration issue to me.


Most of this particular url set are redirects.
The pages are dynamic pages, all from a single
website, served by hundreds of web servers,
requested through a local proxy server (local to
nutch server).

I believe the urlfilter is set correctly.
The logs are stating the pages are getting
fetched now.  I have debug set for fetcher
and generator.  I am not seeing messages
of the urls being skipped.  Perhaps I need
to debug elsewhere to find out why we are
not getting the urls fetched and indexed.

I did increase topN to 1000000.  We got quite
a bit more, but still not all of the pages.
The current status breakdown after an additional
cycle is:

status 1 (db_unfetched):      19265
status 2 (db_fetched):        159912
status 3 (db_gone):   2
status 4 (db_redir_temp):     299024
status 5 (db_redir_perm):     230154
status 6 (db_notmodified):    159418

Searchable results is now about 287k.
Total urls in crawldb is listed as 867775.

Thanks for your assistance.

JohnM


> Dennis Kubes wrote:
> >Hi John,
> >
> >If the http.redirect.max config variable in nutch-*.xml is set to 0 then 
> >any redirect is queued to be fetched during the next fetching round 
> >similar to new urls we parse off of a webpage.  Try setting it to 3 and 
> >your redirects should go down.
> >
> >Dennis
> >
> >John Mendenhall wrote:
> >>>We are using nutch version nutch-2008-07-22_04-01-29.
> >>>We have a crawldb with over 500k urls.
> >>>
> >>>The status breakdown was as follows:
> >>>
> >>>status 1 (db_unfetched):        19261
> >>>status 2 (db_fetched):  71628
> >>>status 4 (db_redir_temp):       274899
> >>>status 5 (db_redir_perm):       148220
> >>>status 6 (db_notmodified):      822
> >>>
> >>>We had set the http.redirect.max property to 7, in
> >>>the nutch-site.xml file.  We are currently using
> >>>bin/nutch fetch.
> >>>
> >>>We have set logging level to debug for the Fetcher.  We see the 
> >>>fetching log entries.  We see the protocol
> >>>redirect to log entries.  However, after a complete
> >>>cycle of fetch/update/merge/index, etc, all the numbers
> >>>above stayed the same.
> >>>
> >>>We originally fetched with the obsolete setting
> >>>of db.default.fetch.interval set to 365 days.
> >>>We now have db.fetch.interval.default set to 35 days.
> >>>We run the generate cycle with -add-days 370 to
> >>>refetch all those urls.  This does not appear to be
> >>>working.
> >>>
> >>>We found db.update.additions.allowed, set to false
> >>>from a previous run type.  We tried setting it to
> >>>true.  Now, after several more cycles, our status
> >>>breakdown is now:
> >>>
> >>>status 1 (db_unfetched):        19270
> >>>status 2 (db_fetched):  100230
> >>>status 3 (db_gone):     1
> >>>status 4 (db_redir_temp):       278559
> >>>status 5 (db_redir_perm):       168816
> >>>status 6 (db_notmodified):      64366
> >>>
> >>>The number of fetched and searchable pages has gone
> >>>from about 72000 to about 140k (per the search
> >>>interface).
> >>>
> >>>Our question is, what can we do to get these redirected
> >>>pages indexed?  Do we need to increase add-days?
> >>>Do we need to increase our topN (currently using
> >>>100k)?  Or, do we just need to start over?
> >>
> >>After a couple more days, our status breakdown is as
> >>follows:
> >>
> >>status 1 (db_unfetched):        19261
> >>status 2 (db_fetched):  100198
> >>status 3 (db_gone):     2
> >>status 4 (db_redir_temp):       279591
> >>status 5 (db_redir_perm):       171076
> >>status 6 (db_notmodified):      80227
> >>
> >>Fetched has gone down, with not modified going up.
> >>Search results through the search interface has gone up
> >>to about 148500 results.
> >>
> >>There must be some configuration variable we have not
> >>set properly.  We use topN of 100000.  We run it through
> >>the cycle about 8 times per day, with only small incremental
> >>progress each round.  Should topN be higher?
> >>
> >>Or, do we need to rebuild the entire crawl database?
> >>
> >>Please let me know if there is any information I need to
> >>provide.
> >>
> >>Thanks in advance for any assistance provided.
> >>
> >>JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

Re: nutch fetch of redirects not ending up in index

Reply via email to