Sorry for that late reply... got summer power outages on the building
that prevented me from running more tests on the cluster, now I'm back
online... replying below.

On Mon, Aug 11, 2008 at 5:59 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> brainstorm wrote:
>>
>> On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>>>
>>> brainstorm wrote:
>>>
>>>> This is one example crawled segment:
>>>>
>>>> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
>>>>
>>>> As you see, just one part-NNNN file is generated... in the conf file
>>>> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
>>>> suggested in previous emails).
>>>
>>> First of all - for a 7 node cluster the mapred.map.tasks should be set at
>>> least to something around 23 or 31 or even higher, and the number of
>>> reduce
>>> tasks to e.g. 11.
>>
>>
>>
>> I see, now it makes more sense to me than just assigning 2 maps by
>> default as suggested before... then, according to:
>>
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>
>> Maps:
>>
>> Given:
>> 64MB DFS blocks
>> 500MB RAM per node
>> 500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
>> exceptions occur)
>>
>> 31 maps... we'll see if it works. It would be cool to have a more
>> precise "formula" to calculate this number in the nutch case. I assume
>> that "23 to 31 or higher" is empirically determined by you: thanks for
>> sharing your knowledge !
>
> That's already described on the Wiki page that you mention above ...
>
>
>> Reduces:
>> 1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) =
>> 135
>>
>> This number is the total number of reduces running across the cluster
>> nodes ?
>
> Hmm .. did you actually try running 11 simultaneous reduce tasks on each
> node? It very much depends on the CPU, the amount of available RAM and the
> heapsize of each task (mapred.child.java.opts). My experience is that it
> takes a beefy hardware to run more than ~4-5 reduce tasks per node - load
> avg is above 20, CPU is pegged at 100% and disks are thrashing. YYMV of
> course.
>
> Regarding the number - what you calculated is the upper bound of all
> possible simultaneous tasks, assuming you have 7 nodes and each will run 11
> tasks at the same time. This is not what I meant - I meant that you should
> set the total number of reduces to 11 or so. What that page doesn't discuss
> is that there is also some cost in job startup / finish, so there is a sweet
> spot number somewhere that fits your current data size and your current
> cluster. In other words, it's better not to run too many reduces, just the
> right number so that individual sort operations run quickly, and tasks
> occupy most of the available slots.
>
>
>> In conclusion, as you predicted (and if the script is not horribly
>> broken), the non-dmoz sample is quite homogeneous (there are lots of
>> urls coming from auto-generated ad sites, for instance)... adding the
>> fact that *a lot* of them lead to "Unknown host exceptions", the crawl
>> ends being extremely slow.
>>
>> But that does not solve the fact that few nodes are actually fetching
>> on DMOZ-based crawl. So next thing to try is to raise
>> mapred.map.tasks.maximum as you suggested, should fix my issues... I
>> hope so :/
>
> I suggest that you try first a value of 4-5, #maps = 23, and #reduces=7.
>
> Just to be sure ... are you sure you are running a distributed JobTracker?
> Can you see the JobTracker UI in the browser?



Yes, distributed JobTracker is running (full cluster mode), I can see
all the tasks via :50030... but I'm having same results with your
maps/reduces values: just two nodes are fetching.

Could it be possible that, given the dmoz url input filesize (31KB) is
not being splitted on all nodes due to 64MB DFS block size ? (just one
block "slot" for 31KB file)... just wondering :/



> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to