Forget my last email. I went back and read your original email. What
type of webpages are you trying to fetch? This doesn't seem like a
configuration issue to me.
Dennis
Dennis Kubes wrote:
Hi John,
If the http.redirect.max config variable in nutch-*.xml is set to 0 then
any redirect is queued to be fetched during the next fetching round
similar to new urls we parse off of a webpage. Try setting it to 3 and
your redirects should go down.
Dennis
John Mendenhall wrote:
We are using nutch version nutch-2008-07-22_04-01-29.
We have a crawldb with over 500k urls.
The status breakdown was as follows:
status 1 (db_unfetched): 19261
status 2 (db_fetched): 71628
status 4 (db_redir_temp): 274899
status 5 (db_redir_perm): 148220
status 6 (db_notmodified): 822
We had set the http.redirect.max property to 7, in
the nutch-site.xml file. We are currently using
bin/nutch fetch.
We have set logging level to debug for the Fetcher. We see the
fetching log entries. We see the protocol
redirect to log entries. However, after a complete
cycle of fetch/update/merge/index, etc, all the numbers
above stayed the same.
We originally fetched with the obsolete setting
of db.default.fetch.interval set to 365 days.
We now have db.fetch.interval.default set to 35 days.
We run the generate cycle with -add-days 370 to
refetch all those urls. This does not appear to be
working.
We found db.update.additions.allowed, set to false
from a previous run type. We tried setting it to
true. Now, after several more cycles, our status
breakdown is now:
status 1 (db_unfetched): 19270
status 2 (db_fetched): 100230
status 3 (db_gone): 1
status 4 (db_redir_temp): 278559
status 5 (db_redir_perm): 168816
status 6 (db_notmodified): 64366
The number of fetched and searchable pages has gone
from about 72000 to about 140k (per the search
interface).
Our question is, what can we do to get these redirected
pages indexed? Do we need to increase add-days?
Do we need to increase our topN (currently using
100k)? Or, do we just need to start over?
After a couple more days, our status breakdown is as
follows:
status 1 (db_unfetched): 19261
status 2 (db_fetched): 100198
status 3 (db_gone): 2
status 4 (db_redir_temp): 279591
status 5 (db_redir_perm): 171076
status 6 (db_notmodified): 80227
Fetched has gone down, with not modified going up.
Search results through the search interface has gone up
to about 148500 results.
There must be some configuration variable we have not
set properly. We use topN of 100000. We run it through
the cycle about 8 times per day, with only small incremental
progress each round. Should topN be higher?
Or, do we need to rebuild the entire crawl database?
Please let me know if there is any information I need to
provide.
Thanks in advance for any assistance provided.
JohnM