On Sat, 19 Jan 2008, Dennis Kubes wrote: > There are a few different things that could be causing this.
Thanks for the response! > One, there is a variable called generate.max.per.host in the > nutch-default.xml file. If this is set to a value instead of -1 then it > will limit the number of urls from that host. Variable generate.max.per.host is set to -1. > Two, have you set the http.agent.name? If you didn't it probably > wouldn't have fetched anything at all. The job would complete but the > output would be 0. Variable http.agent.name is set. Nutch definitely fetches documents. No problem there. > Three, you could be maxing out your bandwidth and only 1/10th or urls > are actually getting through before timeout or the site is blocking most > of the urls you are trying to fetch through robots.txt. Look at the > JobTracker admin screen for the fetch job and see how many errors are in > each fetch task. We work with the site, and robots.txt is allowing us through. It is definitely getting different pages each time. We have 100000 urls in the crawldb. It is only getting about 3% new pages each generate- fetch-update cycle. The most recent completed run had 97 map tasks and 17 reduce tasks, all completed fine, with 0 failures. > It could also be a url-filter problem with a bad regex filter. I doubt this is a problem. Each cycle run allows new urls in. It just seems limited for each run. > My guess would be from the info you have given that you are maxing your > bandwidth. This would cause the number fetched to fluctuate some but be > about the same. What is your bandwidth for fetching and what do you > have mapred.map.tasks set to and fetcher.threads.fetch set to? I will have to check on the bandwidth available for fetching. Variable mapred.map.tasks is set to 97. Variable mapred.reduce.tasks is set to 17. Variable fetcher.threads.fetch is set to 10. Thanks again for any pointers you can provide. JohnM > John Mendenhall wrote: > >Hello, > > > >I am running nutch 0.9 currently. > >I am running on 4 nodes, one is the > >master, in addition to being a slave. > > > >I have injected 100k urls into nutch. > >All urls are on the same host. > > > >I am running a generate/fetch/update > >cycle with topN set at 100k. > > > >However, after each cycle, it only > >fetches between 2588 and 2914 urls > >each time. I have run this over 8 > >times, all with the same result. > > > >I have tried using nutch fetch and > >nutch fetch2. > > > >My hypothesis is, this is due to all > >urls being on same host (www.example.com/some/path). > > > >Do I need to set the fetcher.threads.per.host > >to something higher than the default of 2? > > The fetcher.threads.per.host variable is just the number of threads > (fetchers) that can fetch a single host at a given time. If you own/run > the domain it is okay to crawl it faster, if not the default politeness > settings are best as to not overwhelm the server you are crawling. > > > > >Is there something in the logs I should > >look for to determine the exact cause of > >this problem? > > > >Thank you in advance for any assistance > >that can be provided. > > > >If you need any additional information, > >please let me know and I'll send it. > > > >Thanks! > > > >JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
