Doug Cook wrote:
I've been planning to spend some time looking at this, but haven't gotten
round to it yet -- I see the same (serious) performance problems on a single
machine setup -- reduce takes quite a bit longer than the fetch (map)
operation in my case, and this is on a very fast 4-CPU machine with a ton of
memory. It just doesn't seem like it should take this long. I'm using 0.8 +
some patches & local mods.

If you find some things, please let me know. Likewise, when I get round to
it, I will post my findings.

I was talking about slownes months ago, so I'm glad someone else have the same problems. We also have single machine and reduce task takes hours to complete. Funny thing is that CPU is loaded 100% but when we do search on this server there is no difference in speed. But still It would be great if things go faster.

When fetching I have 20 to 30 pages per sec. But then I have to wait for reduce task to finish. I try use debug loging and only thing I can see is about 1 to 3 seconds between reduce log msgs. I know that map/reduce is meant to use with multiple nodes.

regards

Uros
Thanks,

Doug



AJ Chen-2 wrote:
Sorry for repeating this question. But, I have to find a solution,
otherwise
the crawling is too slow to be practical.  I'm using nutch 0.9-dev on one
linux server to crawl millions of pages.  The fetching itself is
reasonable,
but the map-reduce operations is killing the performance. For example,
fetching takes 10 hours and map-reduce also takes 10 hours, which makes
the
overall performance very slow. Can anyone share experience on how to speed
up map-reduce for single server crawling?  Single server uses local file
system. It should spend very little time in doing map and reduce, isn't it
right?

Thanks,
--
AJ Chen, PhD
http://web2express.org




Reply via email to