Otis, Thanks, I'll try the generate.optimal.url.ordering property on my next crawl. As for JobStream.py, see my reply to Dennis.
Regards, Svein Willassen 2008/4/15, [EMAIL PROTECTED] <[EMAIL PROTECTED]>: > > Svein, > Could you update the page with JobStream.py with your corrections? > > As for URL generation, have you seen > https://issues.apache.org/jira/browse/NUTCH-570 ? > > Want to try it and see if it improves things for you? > > Otis > > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > ----- Original Message ---- > From: Svein Yngvar Willassen <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Sent: Tuesday, April 15, 2008 7:57:14 AM > Subject: JobStream.py > > I don't know how many are actually using this script at the moment, but I > found it needed some modifications to run with nutch-1.0-dev: > > - "DB_unfetched" must be changed to "db_unfetched", as this seems to be > the > current format in the dump > - "-rm" must be changed to "-rmr", otherwise directories in the DFS won't > be > deleted and the script won't function. > > I also notice that the url list from the db is split directly without any > sorting. Isn't this suboptimal? The urllist as it is dumped from the > database is sorted alphabetically. This means that all urls from the same > host will end up in the same split. This translates into a slower crawl, > because the fetcher will hit the "politeness" limitation quite often. So, > from my point of view, it would be better fo the list were sorted on some > criterion ensuring random distribution of hosts, for example sorting on > the > url reversed. > Do you agree or am I overlooking something here? > > -- > Best Regards, > > Svein Y. Willassen > http://willassen.blogspot.com/ > > > > -- Best Regards, Svein Y. Willassen http://willassen.blogspot.com/