At last, fixed, thanks to Andrzej ! Fellow nutchers, please, revise your hadoop-site.xml file, specially those settings:
mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum You should set these values to something like 4 maps and 2 reduces. *and* mapred.tasktracker.map.tasks mapred.tasktracker.reduce.tasks You should set these values to something like 23 maps and 13 reduces. Assuming you have a 8-node cluster ;) Regards, Roman On Thu, Aug 14, 2008 at 9:43 PM, brainstorm <[EMAIL PROTECTED]> wrote: > Sorry for that late reply... got summer power outages on the building > that prevented me from running more tests on the cluster, now I'm back > online... replying below. > > On Mon, Aug 11, 2008 at 5:59 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >> brainstorm wrote: >>> >>> On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> brainstorm wrote: >>>> >>>>> This is one example crawled segment: >>>>> >>>>> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000 >>>>> >>>>> As you see, just one part-NNNN file is generated... in the conf file >>>>> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as >>>>> suggested in previous emails). >>>> >>>> First of all - for a 7 node cluster the mapred.map.tasks should be set at >>>> least to something around 23 or 31 or even higher, and the number of >>>> reduce >>>> tasks to e.g. 11. >>> >>> >>> >>> I see, now it makes more sense to me than just assigning 2 maps by >>> default as suggested before... then, according to: >>> >>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces >>> >>> Maps: >>> >>> Given: >>> 64MB DFS blocks >>> 500MB RAM per node >>> 500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace >>> exceptions occur) >>> >>> 31 maps... we'll see if it works. It would be cool to have a more >>> precise "formula" to calculate this number in the nutch case. I assume >>> that "23 to 31 or higher" is empirically determined by you: thanks for >>> sharing your knowledge ! >> >> That's already described on the Wiki page that you mention above ... >> >> >>> Reduces: >>> 1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) = >>> 135 >>> >>> This number is the total number of reduces running across the cluster >>> nodes ? >> >> Hmm .. did you actually try running 11 simultaneous reduce tasks on each >> node? It very much depends on the CPU, the amount of available RAM and the >> heapsize of each task (mapred.child.java.opts). My experience is that it >> takes a beefy hardware to run more than ~4-5 reduce tasks per node - load >> avg is above 20, CPU is pegged at 100% and disks are thrashing. YYMV of >> course. >> >> Regarding the number - what you calculated is the upper bound of all >> possible simultaneous tasks, assuming you have 7 nodes and each will run 11 >> tasks at the same time. This is not what I meant - I meant that you should >> set the total number of reduces to 11 or so. What that page doesn't discuss >> is that there is also some cost in job startup / finish, so there is a sweet >> spot number somewhere that fits your current data size and your current >> cluster. In other words, it's better not to run too many reduces, just the >> right number so that individual sort operations run quickly, and tasks >> occupy most of the available slots. >> >> >>> In conclusion, as you predicted (and if the script is not horribly >>> broken), the non-dmoz sample is quite homogeneous (there are lots of >>> urls coming from auto-generated ad sites, for instance)... adding the >>> fact that *a lot* of them lead to "Unknown host exceptions", the crawl >>> ends being extremely slow. >>> >>> But that does not solve the fact that few nodes are actually fetching >>> on DMOZ-based crawl. So next thing to try is to raise >>> mapred.map.tasks.maximum as you suggested, should fix my issues... I >>> hope so :/ >> >> I suggest that you try first a value of 4-5, #maps = 23, and #reduces=7. >> >> Just to be sure ... are you sure you are running a distributed JobTracker? >> Can you see the JobTracker UI in the browser? > > > > Yes, distributed JobTracker is running (full cluster mode), I can see > all the tasks via :50030... but I'm having same results with your > maps/reduces values: just two nodes are fetching. > > Could it be possible that, given the dmoz url input filesize (31KB) is > not being splitted on all nodes due to 64MB DFS block size ? (just one > block "slot" for 31KB file)... just wondering :/ > > > >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> >
