brainstorm wrote:
On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
brainstorm wrote:

This is one example crawled segment:

/user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000

As you see, just one part-NNNN file is generated... in the conf file
(nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
suggested in previous emails).
First of all - for a 7 node cluster the mapred.map.tasks should be set at
least to something around 23 or 31 or even higher, and the number of reduce
tasks to e.g. 11.



I see, now it makes more sense to me than just assigning 2 maps by
default as suggested before... then, according to:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Maps:

Given:
64MB DFS blocks
500MB RAM per node
500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
exceptions occur)

31 maps... we'll see if it works. It would be cool to have a more
precise "formula" to calculate this number in the nutch case. I assume
that "23 to 31 or higher" is empirically determined by you: thanks for
sharing your knowledge !

That's already described on the Wiki page that you mention above ...


Reduces:
1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) = 135

This number is the total number of reduces running across the cluster nodes ?

Hmm .. did you actually try running 11 simultaneous reduce tasks on each node? It very much depends on the CPU, the amount of available RAM and the heapsize of each task (mapred.child.java.opts). My experience is that it takes a beefy hardware to run more than ~4-5 reduce tasks per node - load avg is above 20, CPU is pegged at 100% and disks are thrashing. YYMV of course.

Regarding the number - what you calculated is the upper bound of all possible simultaneous tasks, assuming you have 7 nodes and each will run 11 tasks at the same time. This is not what I meant - I meant that you should set the total number of reduces to 11 or so. What that page doesn't discuss is that there is also some cost in job startup / finish, so there is a sweet spot number somewhere that fits your current data size and your current cluster. In other words, it's better not to run too many reduces, just the right number so that individual sort operations run quickly, and tasks occupy most of the available slots.


In conclusion, as you predicted (and if the script is not horribly
broken), the non-dmoz sample is quite homogeneous (there are lots of
urls coming from auto-generated ad sites, for instance)... adding the
fact that *a lot* of them lead to "Unknown host exceptions", the crawl
ends being extremely slow.

But that does not solve the fact that few nodes are actually fetching
on DMOZ-based crawl. So next thing to try is to raise
mapred.map.tasks.maximum as you suggested, should fix my issues... I
hope so :/

I suggest that you try first a value of 4-5, #maps = 23, and #reduces=7.

Just to be sure ... are you sure you are running a distributed JobTracker? Can you see the JobTracker UI in the browser?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to