Robin Haswell wrote:
On Fri, 2006-12-08 at 11:01 +0100, Andrzej Bialecki wrote:
Ad 1.

I suspect that it's sorting the reduce output now ... in 0.8.x this operation has poor performance, especially when run on a single server. So, I advise patience, and giving as much CPU and RAM as possible. For the future, it's also much much better to run the fetcher in non-parsing mode and run "nutch parse" afterwards as a separate step.

Okay, I'll give it a while and see what happens. Is it possible to get
any information on what's going on? I'm running 0.8 pretty much
out-of-the-box on a single server. I've seen people mentioning phases of
Hadoop - can it tell me what's going on?

This should be shown in the logs - the map xx% or reduce xx% progress is printed to the logs.

The reduce phase consists of copying map outputs (reduce 0-33%), then sorting them - and here's where most CPU and disk IO and time is spent - which happens between 33%-66%, and finally copying sorted outputs to form the final result.

You can also do a kill -SIGQUIT <pid> to get a thread dump - you will be able to see what the threads are really doing.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to