Re: nutch fetch of redirects not ending up in index

Dennis Kubes Thu, 04 Dec 2008 13:01:57 -0800

I was going to inject a single simple page that redirects to anotherpage with zero links on the pages themselves. And fetch/update thisagainst a clean crawldb. Then see what the dump output of the crawldbis after being updated.

If it works the way I think it does then there should be multiple urlsinto the crawldb. If you want to try that and let me know the output itwould be great. I am trying to finish up getting patches committed forthe 1.0 release so am a little busy.


Dennis

John Mendenhall wrote:

Dennis,

I will need to look deeper but I think there is a subtle logic bug inFetcher.
Redirect statuses, both temp and perm get output by the fetchers, evenwhen redirecting immediately, so if you have multiple redirects youwould have multiple outputs in the crawl_fetch output from segments.The outputs are by url, but the urls change with redirects.
When updating to crawldb the latest (in terms of time) update is used.But that is per url and I don't think it updates the redirected urls.Meaning the updated crawldb would contain the successfully fetched urlsand the redirected intermediate urls. At least that is what I think ishappening.
The final number indexed should be the successfully fetched urls, whichwould be db_fetched.
Dennis


Anything I can do to help debug this?

JohnM

John Mendenhall wrote:

On Thu, 04 Dec 2008, Dennis Kubes wrote:

Forget my last email. I went back and read your original email. Whattype of webpages are you trying to fetch? This doesn't seem like aconfiguration issue to me.

Most of this particular url set are redirects.
The pages are dynamic pages, all from a single
website, served by hundreds of web servers,
requested through a local proxy server (local to
nutch server).

I believe the urlfilter is set correctly.
The logs are stating the pages are getting
fetched now.  I have debug set for fetcher
and generator.  I am not seeing messages
of the urls being skipped.  Perhaps I need
to debug elsewhere to find out why we are
not getting the urls fetched and indexed.

I did increase topN to 1000000.  We got quite
a bit more, but still not all of the pages.
The current status breakdown after an additional
cycle is:

status 1 (db_unfetched):      19265
status 2 (db_fetched):        159912
status 3 (db_gone):   2
status 4 (db_redir_temp):     299024
status 5 (db_redir_perm):     230154
status 6 (db_notmodified):    159418

Searchable results is now about 287k.
Total urls in crawldb is listed as 867775.

Thanks for your assistance.

JohnM

Dennis Kubes wrote:

Hi John,

If the http.redirect.max config variable in nutch-*.xml is set to 0 thenany redirect is queued to be fetched during the next fetching roundsimilar to new urls we parse off of a webpage. Try setting it to 3 andyour redirects should go down.


Dennis

John Mendenhall wrote:

We are using nutch version nutch-2008-07-22_04-01-29.
We have a crawldb with over 500k urls.

The status breakdown was as follows:

status 1 (db_unfetched):        19261
status 2 (db_fetched):  71628
status 4 (db_redir_temp):       274899
status 5 (db_redir_perm):       148220
status 6 (db_notmodified):      822

We had set the http.redirect.max property to 7, in
the nutch-site.xml file.  We are currently using
bin/nutch fetch.

We have set logging level to debug for the Fetcher. We see thefetching log entries. We see the protocol

redirect to log entries.  However, after a complete
cycle of fetch/update/merge/index, etc, all the numbers
above stayed the same.

We originally fetched with the obsolete setting
of db.default.fetch.interval set to 365 days.
We now have db.fetch.interval.default set to 35 days.
We run the generate cycle with -add-days 370 to
refetch all those urls.  This does not appear to be
working.

We found db.update.additions.allowed, set to false

>from a previous run type.  We tried setting it to

true.  Now, after several more cycles, our status
breakdown is now:

status 1 (db_unfetched):        19270
status 2 (db_fetched):  100230
status 3 (db_gone):     1
status 4 (db_redir_temp):       278559
status 5 (db_redir_perm):       168816
status 6 (db_notmodified):      64366

The number of fetched and searchable pages has gone

>from about 72000 to about 140k (per the search

interface).

Our question is, what can we do to get these redirected
pages indexed?  Do we need to increase add-days?
Do we need to increase our topN (currently using
100k)?  Or, do we just need to start over?

After a couple more days, our status breakdown is as
follows:

status 1 (db_unfetched):        19261
status 2 (db_fetched):  100198
status 3 (db_gone):     2
status 4 (db_redir_temp):       279591
status 5 (db_redir_perm):       171076
status 6 (db_notmodified):      80227

Fetched has gone down, with not modified going up.
Search results through the search interface has gone up
to about 148500 results.

There must be some configuration variable we have not
set properly.  We use topN of 100000.  We run it through
the cycle about 8 times per day, with only small incremental
progress each round.  Should topN be higher?

Or, do we need to rebuild the entire crawl database?

Please let me know if there is any information I need to
provide.

Thanks in advance for any assistance provided.

JohnM

Re: nutch fetch of redirects not ending up in index

Reply via email to