Indeed it's doubtful, but I don't think there is a exact value for
politeness. Interestingly, nutch is described as "aggressively polite" here
http://opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/ . So
maybe nutch is polite anyway in the end.. :D

On Mon, Feb 16, 2015 at 12:52 AM, Swati Kothari <[email protected]> wrote:

> Thanks Jiaxin. We are already trying to vary the parameters as you said,
> but what values would be appropriate for the properties that we are
> changing is still doubtful.
>
> On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye <[email protected]> wrote:
>
>> Hi Swati,
>>
>> I am also the student in Prof Matmann's class. I think the politeness
>> depends on the crawl-delay to the same server. Usually in the robots.txt
>> the crawl-delay will be set to 5 to 15 seconds. It's true that setting
>> fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
>> value from robots.txt to be ignored, but you can set the
>> fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
>> requests time.
>>
>> I also think we should change the content in suffix_urlfillter as well,
>> as our task is to collect as much data as we can from the three websites.
>>
>> Jiaxin
>>
>> On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <[email protected]> wrote:
>>
>>> Hi,
>>> We are working on a project under Professor Chris Mattmann as part of
>>> Information Retrieval course.
>>> We are trying to edit different properties to change politeness and do
>>> url filtering.
>>>
>>> We are trying more than 1 thread, which makes it impolite, but we are
>>> not sure how impolite it should be made for better results.
>>> Also, url filtering blocks almost all image, audio, video formats in
>>> suffix_urlfilter.xml, should that be tampered with or not?
>>>
>>
>>
>

Reply via email to