Hello i still have limited knowledge about nutch, but i can share some of my
experience, as i am crawling similiar number of urls.

On Nov 8, 2007 7:46 PM, Daniel Clark <[EMAIL PROTECTED]> wrote:

> I have a nine box cluster using hadoop and I want to get the optimum
> performance.  I'm crawling 5 million sites.  There are three settings in
> the
> hadoop-site.xml that I'm not clear on.  Please, help.
>
>
>
> Mapred Map & Reduce Tasks
>
> ========================
>
> I have the following based on the description note, but the wiki said to
> use
> multiples of the number of slave hosts.  Can I up this to 36 or even more
> to
> speed up my crawl?  What is recommended?
>
>
>
> <property>
>
>  <name>mapred.map.tasks</name>
>  <value>9</value>
>  <description>
>    define mapred.map tasks to be number of slave hosts
>  </description>
> </property>
>

After some trial and error i found out that setting this to a high number
helps a lot  with fetching large number of urls. Map tasks are failing quite
often, and when it happens only small part of data needs to be refetched.
Also it seems that memory limit might become a problem when there's few map
tasks. I've set it up to 99 on a cluster of 3 machines.


<property>
>  <name>mapred.reduce.tasks</name>
>  <value>9</value>
>  <description>
>    define mapred.reduce tasks to be number of slave hosts
>  </description>
> </property>
>

I've set it to 2* number of nodes because all my machines have dual core
processors and each map task runs on single thread.


>
>
> Replication
>
> ===========
>
> The wiki said to use 2 or 3.  Why?  What is recommended for the best
> performance?
>
> <property>
>  <name>dfs.replication</name>
>  <value>2</value>
> </property>
>
> This parameter defines how many copies of each data segment will be stored
on dfs. So for performance best would be 1, but you risk loosing your index
if one of nodes fails.

>
> ~~~~~~~~~~~~~~~~~~~~~
>
> Daniel Clark, President
>
> DAC Systems, Inc.
>
>  (703) 403-0340
>
> ~~~~~~~~~~~~~~~~~~~~~
>
>
>
>


-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Reply via email to