Re: Distributed fetching only happening in one node ?

Andrzej Bialecki Mon, 11 Aug 2008 09:00:35 -0700

brainstorm wrote:

On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

brainstorm wrote:

This is one example crawled segment:

/user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000

As you see, just one part-NNNN file is generated... in the conf file
(nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
suggested in previous emails).

First of all - for a 7 node cluster the mapred.map.tasks should be set at
least to something around 23 or 31 or even higher, and the number of reduce
tasks to e.g. 11.




I see, now it makes more sense to me than just assigning 2 maps by
default as suggested before... then, according to:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Maps:

Given:
64MB DFS blocks
500MB RAM per node
500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
exceptions occur)

31 maps... we'll see if it works. It would be cool to have a more
precise "formula" to calculate this number in the nutch case. I assume
that "23 to 31 or higher" is empirically determined by you: thanks for
sharing your knowledge !


That's already described on the Wiki page that you mention above ...

Reduces:
1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) = 135

This number is the total number of reduces running across the cluster nodes ?

Hmm .. did you actually try running 11 simultaneous reduce tasks on eachnode? It very much depends on the CPU, the amount of available RAM andthe heapsize of each task (mapred.child.java.opts). My experience isthat it takes a beefy hardware to run more than ~4-5 reduce tasks pernode - load avg is above 20, CPU is pegged at 100% and disks arethrashing. YYMV of course.

Regarding the number - what you calculated is the upper bound of allpossible simultaneous tasks, assuming you have 7 nodes and each will run11 tasks at the same time. This is not what I meant - I meant that youshould set the total number of reduces to 11 or so. What that pagedoesn't discuss is that there is also some cost in job startup / finish,so there is a sweet spot number somewhere that fits your current datasize and your current cluster. In other words, it's better not to runtoo many reduces, just the right number so that individual sortoperations run quickly, and tasks occupy most of the available slots.

In conclusion, as you predicted (and if the script is not horribly
broken), the non-dmoz sample is quite homogeneous (there are lots of
urls coming from auto-generated ad sites, for instance)... adding the
fact that *a lot* of them lead to "Unknown host exceptions", the crawl
ends being extremely slow.

But that does not solve the fact that few nodes are actually fetching
on DMOZ-based crawl. So next thing to try is to raise
mapred.map.tasks.maximum as you suggested, should fix my issues... I
hope so :/


I suggest that you try first a value of 4-5, #maps = 23, and #reduces=7.

Just to be sure ... are you sure you are running a distributedJobTracker? Can you see the JobTracker UI in the browser?


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Distributed fetching only happening in one node ?

Reply via email to