We are using nutch version nutch-2008-07-22_04-01-29. We have a crawldb with over 500k urls.
The status breakdown was as follows: status 1 (db_unfetched): 19261 status 2 (db_fetched): 71628 status 4 (db_redir_temp): 274899 status 5 (db_redir_perm): 148220 status 6 (db_notmodified): 822 We had set the http.redirect.max property to 7, in the nutch-site.xml file. We are currently using bin/nutch fetch. We have set logging level to debug for the Fetcher. We see the fetching log entries. We see the protocol redirect to log entries. However, after a complete cycle of fetch/update/merge/index, etc, all the numbers above stayed the same. We originally fetched with the obsolete setting of db.default.fetch.interval set to 365 days. We now have db.fetch.interval.default set to 35 days. We run the generate cycle with -add-days 370 to refetch all those urls. This does not appear to be working. We found db.update.additions.allowed, set to false from a previous run type. We tried setting it to true. Now, after several more cycles, our status breakdown is now: status 1 (db_unfetched): 19270 status 2 (db_fetched): 100230 status 3 (db_gone): 1 status 4 (db_redir_temp): 278559 status 5 (db_redir_perm): 168816 status 6 (db_notmodified): 64366 The number of fetched and searchable pages has gone from about 72000 to about 140k (per the search interface). Our question is, what can we do to get these redirected pages indexed? Do we need to increase add-days? Do we need to increase our topN (currently using 100k)? Or, do we just need to start over? We are not seeing errors in the logs, at least not very many. In the fetch map task list interface, we are seeing entries like this: ----- 7502 pages, 3 errors, 29.4 pages/s, 24844 kb/s, 7409 pages, 0 errors, 32.2 pages/s, 26922 kb/s, 7464 pages, 3 errors, 28.4 pages/s, 23778 kb/s, 7366 pages, 2 errors, 35.2 pages/s, 29802 kb/s, 7344 pages, 0 errors, 31.8 pages/s, 26700 kb/s, ----- We were unable to easily find the errors listed above. Any pointers on finding these errors, to ensure they are not something serious? Of course, this is not even close to the missing numbers we should be seeing. Thanks in advance for any assistance or pointers to other resources or ideas. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
