We are using nutch version nutch-2008-07-22_04-01-29.
We have a crawldb with over 500k urls.

The status breakdown was as follows:

status 1 (db_unfetched):        19261
status 2 (db_fetched):  71628
status 4 (db_redir_temp):       274899
status 5 (db_redir_perm):       148220
status 6 (db_notmodified):      822

We had set the http.redirect.max property to 7, in
the nutch-site.xml file.  We are currently using
bin/nutch fetch.

We have set logging level to debug for the Fetcher.  
We see the fetching log entries.  We see the protocol
redirect to log entries.  However, after a complete
cycle of fetch/update/merge/index, etc, all the numbers
above stayed the same.

We originally fetched with the obsolete setting
of db.default.fetch.interval set to 365 days.
We now have db.fetch.interval.default set to 35 days.
We run the generate cycle with -add-days 370 to
refetch all those urls.  This does not appear to be
working.

We found db.update.additions.allowed, set to false
from a previous run type.  We tried setting it to
true.  Now, after several more cycles, our status
breakdown is now:

status 1 (db_unfetched):        19270
status 2 (db_fetched):  100230
status 3 (db_gone):     1
status 4 (db_redir_temp):       278559
status 5 (db_redir_perm):       168816
status 6 (db_notmodified):      64366

The number of fetched and searchable pages has gone
from about 72000 to about 140k (per the search
interface).

Our question is, what can we do to get these redirected
pages indexed?  Do we need to increase add-days?
Do we need to increase our topN (currently using
100k)?  Or, do we just need to start over?

We are not seeing errors in the logs, at least not
very many.

In the fetch map task list interface, we are seeing
entries like this:

-----
7502 pages, 3 errors, 29.4 pages/s, 24844 kb/s,
7409 pages, 0 errors, 32.2 pages/s, 26922 kb/s,
7464 pages, 3 errors, 28.4 pages/s, 23778 kb/s,
7366 pages, 2 errors, 35.2 pages/s, 29802 kb/s,
7344 pages, 0 errors, 31.8 pages/s, 26700 kb/s, 
-----

We were unable to easily find the errors listed
above.  Any pointers on finding these errors, to ensure
they are not something serious?

Of course, this is not even close to the missing
numbers we should be seeing.

Thanks in advance for any assistance or pointers
to other resources or ideas.

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

Reply via email to