Re: JobStream.py

Svein Yngvar Willassen Tue, 15 Apr 2008 10:31:13 -0700

Otis,

Thanks, I'll try the generate.optimal.url.ordering property on my next
crawl.  As for JobStream.py, see my reply to Dennis.


Regards,

Svein Willassen


2008/4/15, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:
>
> Svein,
> Could you update the page with JobStream.py with your corrections?
>
> As for URL generation, have you seen
> https://issues.apache.org/jira/browse/NUTCH-570 ?
>
> Want to try it and see if it improves things for you?
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Svein Yngvar Willassen <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, April 15, 2008 7:57:14 AM
> Subject: JobStream.py
>
> I don't know how many are actually using this script at the moment, but I
> found it needed some modifications to run with nutch-1.0-dev:
>
> - "DB_unfetched" must be changed to "db_unfetched", as this seems to be
> the
> current format in the dump
> - "-rm" must be changed to "-rmr", otherwise directories in the DFS won't
> be
> deleted and the script won't function.
>
> I also notice that the url list from the db is split directly without any
> sorting. Isn't this suboptimal?  The urllist as it is dumped from the
> database is sorted alphabetically. This means that all urls from the same
> host will end up in the same split. This translates into a slower crawl,
> because the fetcher will hit the "politeness" limitation quite often. So,
> from my point of view, it would be better fo the list were sorted on some
> criterion ensuring random distribution of hosts, for example sorting on
> the
> url reversed.
> Do you agree or am I overlooking something here?
>
> --
> Best Regards,
>
> Svein Y. Willassen
> http://willassen.blogspot.com/
>
>
>
>


-- 
Best Regards,

Svein Y. Willassen
http://willassen.blogspot.com/

Re: JobStream.py

Reply via email to