Exactly, Jiaxin, great answer. Cheers, Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Jiaxin Ye <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Sunday, February 15, 2015 at 11:34 PM To: "[email protected]" <[email protected]> Subject: Re: >Hi Swati, > > >I am also the student in Prof Matmann's class. I think the politeness >depends on the crawl-delay to the same server. Usually in the robots.txt >the crawl-delay will be set to 5 to 15 seconds. It's true that setting >fetcher.threads.per.queue to be bigger > than 1 will cause the Crawl-Delay value from robots.txt to be ignored, >but you can set the fetcher.server.delay to be 5 to 15 seconds to >rebalance the successive requests time. > > >I also think we should change the content in suffix_urlfillter as well, >as our task is to collect as much data as we can from the three websites. > > >Jiaxin > > >On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari ><[email protected]> wrote: > >Hi, >We are working on a project under Professor Chris Mattmann as part of >Information Retrieval course. >We are trying to edit different properties to change politeness and do >url filtering. > > >We are trying more than 1 thread, which makes it impolite, but we are not >sure how impolite it should be made for better results. >Also, url filtering blocks almost all image, audio, video formats in >suffix_urlfilter.xml, should that be tampered with or not? > > > > > >

