I started a crawl on July 30 like this:

nohup time nutch crawl /usr/tmp/urls.txt -dir /usr/tmp/85sites -threads 20
-depth 10 -topN 103103

I included 85 different sites in the seed URLS.  However, my crawl has slowed
down to about one site every  seconds--the value I have for
fetcher.server.delay.  The console output will look something like this:

fetching
http://www.topix.net/soccer-fifa/world-cup/2007/04/news-fixtures-results-match-reports-stats
fetching
http://www.topix.net/city/tarpon-springs-fl/2007/04/city/tarpon-springs-fl
fetching
http://sportsillustrated.cnn.com/si_online/scorecard/news/2002/07/02/sc/
fetching
http://www.topix.net/classifieds/rockford-mi/WVZ850H4sIAAAAAAAAA2NhYzE0MjZhYWHhYOZmYGBgZQISpYmszEAqP7OAlQVIl6QWl4AFSjNTwPIlxWBuDpDLCqQLkvNTUrkEkjNLKvWL8pOz0/KLUnRzM0Fa88AyzqbBzkZOQcGBwQaebq5mFiCZgrLMFC5Z/eScxOLizLTM1JRiZL36BaklxWAri3K4BJEVARWU5rIBZRKTSzLz8wCKjZwMwAAAAA__
fetching
http://www.topix.net/forum/football-players/steve-mcnair/TOAGFF3LREQQ0C40H/p4
fetching http://www.bbc.co.uk/wales/southeast/webguide/pages/books.shtml
fetching
http://www.topix.net/football-players/dede-dorsey/2007/07/ready-to-roll
fetching http://www.bbc.co.uk/wales/raiseyourgame/preparation/
fetching http://www.topix.net/forum/news/terri-schiavo/TGNF3ITGCIOGCQG74/post14
fetching http://www.topix.net/forum/world/canada/TLDSECEPRGQCAOSJI/post11

Often when it hits topix.net it's slow to respond.  I purposely started my seed
with a large number of sites so that it wouldn't get stuck with this kind of
low speed.  I anticipated that nutch would initiate many downloads during it's
three second window per site.  However, the horizon seems to be fixated with
references to topix.net so it's really just slower than ooze.

Should I have set the topN lower?  Depth higher?  Why is it behaving this way?


      
____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the 
Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 

Reply via email to