Indeed it's doubtful, but I don't think there is a exact value for politeness. Interestingly, nutch is described as "aggressively polite" here http://opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/ . So maybe nutch is polite anyway in the end.. :D
On Mon, Feb 16, 2015 at 12:52 AM, Swati Kothari <[email protected]> wrote: > Thanks Jiaxin. We are already trying to vary the parameters as you said, > but what values would be appropriate for the properties that we are > changing is still doubtful. > > On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye <[email protected]> wrote: > >> Hi Swati, >> >> I am also the student in Prof Matmann's class. I think the politeness >> depends on the crawl-delay to the same server. Usually in the robots.txt >> the crawl-delay will be set to 5 to 15 seconds. It's true that setting >> fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay >> value from robots.txt to be ignored, but you can set the >> fetcher.server.delay to be 5 to 15 seconds to rebalance the successive >> requests time. >> >> I also think we should change the content in suffix_urlfillter as well, >> as our task is to collect as much data as we can from the three websites. >> >> Jiaxin >> >> On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <[email protected]> wrote: >> >>> Hi, >>> We are working on a project under Professor Chris Mattmann as part of >>> Information Retrieval course. >>> We are trying to edit different properties to change politeness and do >>> url filtering. >>> >>> We are trying more than 1 thread, which makes it impolite, but we are >>> not sure how impolite it should be made for better results. >>> Also, url filtering blocks almost all image, audio, video formats in >>> suffix_urlfilter.xml, should that be tampered with or not? >>> >> >> >

