Dennis, > I will need to look deeper but I think there is a subtle logic bug in > Fetcher. > > Redirect statuses, both temp and perm get output by the fetchers, even > when redirecting immediately, so if you have multiple redirects you > would have multiple outputs in the crawl_fetch output from segments. > The outputs are by url, but the urls change with redirects. > > When updating to crawldb the latest (in terms of time) update is used. > But that is per url and I don't think it updates the redirected urls. > Meaning the updated crawldb would contain the successfully fetched urls > and the redirected intermediate urls. At least that is what I think is > happening. > > The final number indexed should be the successfully fetched urls, which > would be db_fetched. > > Dennis
Anything I can do to help debug this? JohnM > John Mendenhall wrote: > >On Thu, 04 Dec 2008, Dennis Kubes wrote: > > > >>Forget my last email. I went back and read your original email. What > >>type of webpages are you trying to fetch? This doesn't seem like a > >>configuration issue to me. > > > >Most of this particular url set are redirects. > >The pages are dynamic pages, all from a single > >website, served by hundreds of web servers, > >requested through a local proxy server (local to > >nutch server). > > > >I believe the urlfilter is set correctly. > >The logs are stating the pages are getting > >fetched now. I have debug set for fetcher > >and generator. I am not seeing messages > >of the urls being skipped. Perhaps I need > >to debug elsewhere to find out why we are > >not getting the urls fetched and indexed. > > > >I did increase topN to 1000000. We got quite > >a bit more, but still not all of the pages. > >The current status breakdown after an additional > >cycle is: > > > >status 1 (db_unfetched): 19265 > >status 2 (db_fetched): 159912 > >status 3 (db_gone): 2 > >status 4 (db_redir_temp): 299024 > >status 5 (db_redir_perm): 230154 > >status 6 (db_notmodified): 159418 > > > >Searchable results is now about 287k. > >Total urls in crawldb is listed as 867775. > > > >Thanks for your assistance. > > > >JohnM > > > > > >>Dennis Kubes wrote: > >>>Hi John, > >>> > >>>If the http.redirect.max config variable in nutch-*.xml is set to 0 then > >>>any redirect is queued to be fetched during the next fetching round > >>>similar to new urls we parse off of a webpage. Try setting it to 3 and > >>>your redirects should go down. > >>> > >>>Dennis > >>> > >>>John Mendenhall wrote: > >>>>>We are using nutch version nutch-2008-07-22_04-01-29. > >>>>>We have a crawldb with over 500k urls. > >>>>> > >>>>>The status breakdown was as follows: > >>>>> > >>>>>status 1 (db_unfetched): 19261 > >>>>>status 2 (db_fetched): 71628 > >>>>>status 4 (db_redir_temp): 274899 > >>>>>status 5 (db_redir_perm): 148220 > >>>>>status 6 (db_notmodified): 822 > >>>>> > >>>>>We had set the http.redirect.max property to 7, in > >>>>>the nutch-site.xml file. We are currently using > >>>>>bin/nutch fetch. > >>>>> > >>>>>We have set logging level to debug for the Fetcher. We see the > >>>>>fetching log entries. We see the protocol > >>>>>redirect to log entries. However, after a complete > >>>>>cycle of fetch/update/merge/index, etc, all the numbers > >>>>>above stayed the same. > >>>>> > >>>>>We originally fetched with the obsolete setting > >>>>>of db.default.fetch.interval set to 365 days. > >>>>>We now have db.fetch.interval.default set to 35 days. > >>>>>We run the generate cycle with -add-days 370 to > >>>>>refetch all those urls. This does not appear to be > >>>>>working. > >>>>> > >>>>>We found db.update.additions.allowed, set to false > >>>>>from a previous run type. We tried setting it to > >>>>>true. Now, after several more cycles, our status > >>>>>breakdown is now: > >>>>> > >>>>>status 1 (db_unfetched): 19270 > >>>>>status 2 (db_fetched): 100230 > >>>>>status 3 (db_gone): 1 > >>>>>status 4 (db_redir_temp): 278559 > >>>>>status 5 (db_redir_perm): 168816 > >>>>>status 6 (db_notmodified): 64366 > >>>>> > >>>>>The number of fetched and searchable pages has gone > >>>>>from about 72000 to about 140k (per the search > >>>>>interface). > >>>>> > >>>>>Our question is, what can we do to get these redirected > >>>>>pages indexed? Do we need to increase add-days? > >>>>>Do we need to increase our topN (currently using > >>>>>100k)? Or, do we just need to start over? > >>>>After a couple more days, our status breakdown is as > >>>>follows: > >>>> > >>>>status 1 (db_unfetched): 19261 > >>>>status 2 (db_fetched): 100198 > >>>>status 3 (db_gone): 2 > >>>>status 4 (db_redir_temp): 279591 > >>>>status 5 (db_redir_perm): 171076 > >>>>status 6 (db_notmodified): 80227 > >>>> > >>>>Fetched has gone down, with not modified going up. > >>>>Search results through the search interface has gone up > >>>>to about 148500 results. > >>>> > >>>>There must be some configuration variable we have not > >>>>set properly. We use topN of 100000. We run it through > >>>>the cycle about 8 times per day, with only small incremental > >>>>progress each round. Should topN be higher? > >>>> > >>>>Or, do we need to rebuild the entire crawl database? > >>>> > >>>>Please let me know if there is any information I need to > >>>>provide. > >>>> > >>>>Thanks in advance for any assistance provided. > >>>> > >>>>JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
