Hi Swati,

I am also the student in Prof Matmann's class. I think the politeness
depends on the crawl-delay to the same server. Usually in the robots.txt
the crawl-delay will be set to 5 to 15 seconds. It's true that setting
fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
value from robots.txt to be ignored, but you can set the
fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
requests time.

I also think we should change the content in suffix_urlfillter as well, as
our task is to collect as much data as we can from the three websites.

Jiaxin

On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <[email protected]> wrote:

> Hi,
> We are working on a project under Professor Chris Mattmann as part of
> Information Retrieval course.
> We are trying to edit different properties to change politeness and do url
> filtering.
>
> We are trying more than 1 thread, which makes it impolite, but we are not
> sure how impolite it should be made for better results.
> Also, url filtering blocks almost all image, audio, video formats in
> suffix_urlfilter.xml, should that be tampered with or not?
>

Reply via email to