Re: The numFetchers option

Andrzej Bialecki Sun, 08 Mar 2009 09:39:53 -0700

[email protected] wrote:

On Mar 8, 2009 11:13am, Andrzej Bialecki <[email protected]> wrote:
Michael Chan wrote:
On Fri, Feb 27, 2009 at 5:14 PM, Andrzej Bialecki [email protected]> wrote:
Michael Chan wrote:
Hi,
I'm trying to generate multiple segments so that I can run several
fetching
tasks on a *single* machine. This is just to reduce the effort needed to
refetch after a crash. Is the -numFetchers option still available in 0.9?
When I use -numFetchers 4, it seems to be ignored and the generator
generates one partition. Has it been deprecated? If so, is there an
alternative?
The numFetchers option is poorly named - it still works with the current
code but not in the same way as with Nutch 0.7: now it determines thenumber
of fetching tasks, and this happens ONLY when you run in distributed mode
(on a Hadoop cluster). In local mode it has no effect.
Currently there is no support for generating multiple segments in one go.
However, if you set generator.update.crawldb to true, you can generate
multiple segments in multiple runs of Generator, and then fetch / update
these segments in arbitrary order.
Is it recommended to run several fetchers using these segments on asingle
machine at once? Thanks.
It's not recommended - if you run everything on a single machine it'sbetter to increase the number of threads. If your machine can take theload you could try to run multiple fetchers at once, but it consumesmore resources than 1 fetcher using more threads. Usually the load istoo high (in terms of CPU, disk IO and network traffic) on a singlemachine, that's why it's better to set up a cluster.
But in principle, if several fetchers are run on a single machine, wouldthe filesystem be corrupted, eg, several fetchers or parsers writing toit at once?

No, it wouldn't become corrupted. If you really started writing to thesame locations (e.g. by starting two jobs fetching the same segment),one of the jobs would throw an exception (it would discover that outputfiles it's going to create already exist), and the other job would continue.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: The numFetchers option

Reply via email to