On Thu, 04 Dec 2008, Dennis Kubes wrote: > Forget my last email. I went back and read your original email. What > type of webpages are you trying to fetch? This doesn't seem like a > configuration issue to me.
Most of this particular url set are redirects. The pages are dynamic pages, all from a single website, served by hundreds of web servers, requested through a local proxy server (local to nutch server). I believe the urlfilter is set correctly. The logs are stating the pages are getting fetched now. I have debug set for fetcher and generator. I am not seeing messages of the urls being skipped. Perhaps I need to debug elsewhere to find out why we are not getting the urls fetched and indexed. I did increase topN to 1000000. We got quite a bit more, but still not all of the pages. The current status breakdown after an additional cycle is: status 1 (db_unfetched): 19265 status 2 (db_fetched): 159912 status 3 (db_gone): 2 status 4 (db_redir_temp): 299024 status 5 (db_redir_perm): 230154 status 6 (db_notmodified): 159418 Searchable results is now about 287k. Total urls in crawldb is listed as 867775. Thanks for your assistance. JohnM > Dennis Kubes wrote: > >Hi John, > > > >If the http.redirect.max config variable in nutch-*.xml is set to 0 then > >any redirect is queued to be fetched during the next fetching round > >similar to new urls we parse off of a webpage. Try setting it to 3 and > >your redirects should go down. > > > >Dennis > > > >John Mendenhall wrote: > >>>We are using nutch version nutch-2008-07-22_04-01-29. > >>>We have a crawldb with over 500k urls. > >>> > >>>The status breakdown was as follows: > >>> > >>>status 1 (db_unfetched): 19261 > >>>status 2 (db_fetched): 71628 > >>>status 4 (db_redir_temp): 274899 > >>>status 5 (db_redir_perm): 148220 > >>>status 6 (db_notmodified): 822 > >>> > >>>We had set the http.redirect.max property to 7, in > >>>the nutch-site.xml file. We are currently using > >>>bin/nutch fetch. > >>> > >>>We have set logging level to debug for the Fetcher. We see the > >>>fetching log entries. We see the protocol > >>>redirect to log entries. However, after a complete > >>>cycle of fetch/update/merge/index, etc, all the numbers > >>>above stayed the same. > >>> > >>>We originally fetched with the obsolete setting > >>>of db.default.fetch.interval set to 365 days. > >>>We now have db.fetch.interval.default set to 35 days. > >>>We run the generate cycle with -add-days 370 to > >>>refetch all those urls. This does not appear to be > >>>working. > >>> > >>>We found db.update.additions.allowed, set to false > >>>from a previous run type. We tried setting it to > >>>true. Now, after several more cycles, our status > >>>breakdown is now: > >>> > >>>status 1 (db_unfetched): 19270 > >>>status 2 (db_fetched): 100230 > >>>status 3 (db_gone): 1 > >>>status 4 (db_redir_temp): 278559 > >>>status 5 (db_redir_perm): 168816 > >>>status 6 (db_notmodified): 64366 > >>> > >>>The number of fetched and searchable pages has gone > >>>from about 72000 to about 140k (per the search > >>>interface). > >>> > >>>Our question is, what can we do to get these redirected > >>>pages indexed? Do we need to increase add-days? > >>>Do we need to increase our topN (currently using > >>>100k)? Or, do we just need to start over? > >> > >>After a couple more days, our status breakdown is as > >>follows: > >> > >>status 1 (db_unfetched): 19261 > >>status 2 (db_fetched): 100198 > >>status 3 (db_gone): 2 > >>status 4 (db_redir_temp): 279591 > >>status 5 (db_redir_perm): 171076 > >>status 6 (db_notmodified): 80227 > >> > >>Fetched has gone down, with not modified going up. > >>Search results through the search interface has gone up > >>to about 148500 results. > >> > >>There must be some configuration variable we have not > >>set properly. We use topN of 100000. We run it through > >>the cycle about 8 times per day, with only small incremental > >>progress each round. Should topN be higher? > >> > >>Or, do we need to rebuild the entire crawl database? > >> > >>Please let me know if there is any information I need to > >>provide. > >> > >>Thanks in advance for any assistance provided. > >> > >>JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
