Exactly, Jiaxin, great answer.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Jiaxin Ye <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, February 15, 2015 at 11:34 PM
To: "[email protected]" <[email protected]>
Subject: Re:

>Hi Swati,
>
>
>I am also the student in Prof Matmann's class. I think the politeness
>depends on the crawl-delay to the same server. Usually in the robots.txt
>the crawl-delay will be set to 5 to 15 seconds. It's true that setting
>fetcher.threads.per.queue to be bigger
> than 1 will cause the Crawl-Delay value from robots.txt to be ignored,
>but you can set the fetcher.server.delay to be 5 to 15 seconds to
>rebalance the successive requests time.
>
>
>I also think we should change the content in suffix_urlfillter as well,
>as our task is to collect as much data as we can from the three websites.
>
>
>Jiaxin
>
>
>On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari
><[email protected]> wrote:
>
>Hi,
>We are working on a project under Professor Chris Mattmann as part of
>Information Retrieval course.
>We are trying to edit different properties to change politeness and do
>url filtering.
>
>
>We are trying more than 1 thread, which makes it impolite, but we are not
>sure how impolite it should be made for better results.
>Also, url filtering blocks almost all image, audio, video formats in
>suffix_urlfilter.xml, should that be tampered with or not?
>
>
>
>
>
>

Reply via email to