I started a crawl on July 30 like this: nohup time nutch crawl /usr/tmp/urls.txt -dir /usr/tmp/85sites -threads 20 -depth 10 -topN 103103
I included 85 different sites in the seed URLS. However, my crawl has slowed down to about one site every seconds--the value I have for fetcher.server.delay. The console output will look something like this: fetching http://www.topix.net/soccer-fifa/world-cup/2007/04/news-fixtures-results-match-reports-stats fetching http://www.topix.net/city/tarpon-springs-fl/2007/04/city/tarpon-springs-fl fetching http://sportsillustrated.cnn.com/si_online/scorecard/news/2002/07/02/sc/ fetching http://www.topix.net/classifieds/rockford-mi/WVZ850H4sIAAAAAAAAA2NhYzE0MjZhYWHhYOZmYGBgZQISpYmszEAqP7OAlQVIl6QWl4AFSjNTwPIlxWBuDpDLCqQLkvNTUrkEkjNLKvWL8pOz0/KLUnRzM0Fa88AyzqbBzkZOQcGBwQaebq5mFiCZgrLMFC5Z/eScxOLizLTM1JRiZL36BaklxWAri3K4BJEVARWU5rIBZRKTSzLz8wCKjZwMwAAAAA__ fetching http://www.topix.net/forum/football-players/steve-mcnair/TOAGFF3LREQQ0C40H/p4 fetching http://www.bbc.co.uk/wales/southeast/webguide/pages/books.shtml fetching http://www.topix.net/football-players/dede-dorsey/2007/07/ready-to-roll fetching http://www.bbc.co.uk/wales/raiseyourgame/preparation/ fetching http://www.topix.net/forum/news/terri-schiavo/TGNF3ITGCIOGCQG74/post14 fetching http://www.topix.net/forum/world/canada/TLDSECEPRGQCAOSJI/post11 Often when it hits topix.net it's slow to respond. I purposely started my seed with a large number of sites so that it wouldn't get stuck with this kind of low speed. I anticipated that nutch would initiate many downloads during it's three second window per site. However, the horizon seems to be fixated with references to topix.net so it's really just slower than ooze. Should I have set the topN lower? Depth higher? Why is it behaving this way? ____________________________________________________________________________________ Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/
