Hi Swati, I am also the student in Prof Matmann's class. I think the politeness depends on the crawl-delay to the same server. Usually in the robots.txt the crawl-delay will be set to 5 to 15 seconds. It's true that setting fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay value from robots.txt to be ignored, but you can set the fetcher.server.delay to be 5 to 15 seconds to rebalance the successive requests time.
I also think we should change the content in suffix_urlfillter as well, as our task is to collect as much data as we can from the three websites. Jiaxin On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <[email protected]> wrote: > Hi, > We are working on a project under Professor Chris Mattmann as part of > Information Retrieval course. > We are trying to edit different properties to change politeness and do url > filtering. > > We are trying more than 1 thread, which makes it impolite, but we are not > sure how impolite it should be made for better results. > Also, url filtering blocks almost all image, audio, video formats in > suffix_urlfilter.xml, should that be tampered with or not? >

