> We are using nutch version nutch-2008-07-22_04-01-29. > We have a crawldb with over 500k urls. > > The status breakdown was as follows: > > status 1 (db_unfetched): 19261 > status 2 (db_fetched): 71628 > status 4 (db_redir_temp): 274899 > status 5 (db_redir_perm): 148220 > status 6 (db_notmodified): 822 > > We had set the http.redirect.max property to 7, in > the nutch-site.xml file. We are currently using > bin/nutch fetch. > > We have set logging level to debug for the Fetcher. > We see the fetching log entries. We see the protocol > redirect to log entries. However, after a complete > cycle of fetch/update/merge/index, etc, all the numbers > above stayed the same. > > We originally fetched with the obsolete setting > of db.default.fetch.interval set to 365 days. > We now have db.fetch.interval.default set to 35 days. > We run the generate cycle with -add-days 370 to > refetch all those urls. This does not appear to be > working. > > We found db.update.additions.allowed, set to false > from a previous run type. We tried setting it to > true. Now, after several more cycles, our status > breakdown is now: > > status 1 (db_unfetched): 19270 > status 2 (db_fetched): 100230 > status 3 (db_gone): 1 > status 4 (db_redir_temp): 278559 > status 5 (db_redir_perm): 168816 > status 6 (db_notmodified): 64366 > > The number of fetched and searchable pages has gone > from about 72000 to about 140k (per the search > interface). > > Our question is, what can we do to get these redirected > pages indexed? Do we need to increase add-days? > Do we need to increase our topN (currently using > 100k)? Or, do we just need to start over?
After a couple more days, our status breakdown is as follows: status 1 (db_unfetched): 19261 status 2 (db_fetched): 100198 status 3 (db_gone): 2 status 4 (db_redir_temp): 279591 status 5 (db_redir_perm): 171076 status 6 (db_notmodified): 80227 Fetched has gone down, with not modified going up. Search results through the search interface has gone up to about 148500 results. There must be some configuration variable we have not set properly. We use topN of 100000. We run it through the cycle about 8 times per day, with only small incremental progress each round. Should topN be higher? Or, do we need to rebuild the entire crawl database? Please let me know if there is any information I need to provide. Thanks in advance for any assistance provided. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
