Re: Multiple Nutch instances for crawling?

Yves Petinot Fri, 18 Dec 2009 03:11:46 -0800

Thanks a lot for the detailed explanation, Jason, this was mostdefinitely useful !


J.G.Konrad wrote:

I have upgraded to 0.19.2 but the capacity scheduler feature is
available in 0.19.1. You will need to download the hadoop common
package ( http://hadoop.apache.org/common/releases.html ) to get the
capacity scheduler jar. It is not included in the Nutch releases.


  After downloading you will want to copy the
contrib/capacity-scheduler/hadoop-0.19.x-capacity-scheduler.jar into
lib directory where you have your nutch code. There is also a example
configuration file that comes with the hadoop package that is quite
explanatory ( conf/capacity-scheduler ). The docs can be found here
http://hadoop.apache.org/common/docs/r0.19.1/capacity_scheduler.html

 The first thing to do is to set the scheduler and define the queues.
This is done in the hadoop-site.xml file that is used to start the
jobtracker. This is what mine looks like for defining two queues (plus
the default).

<property>
  <name>mapred.jobtracker.taskScheduler</name>
  <value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>

<property>
  <name>mapred.queue.names</name>
  <value>nutchFetchCycle,nutchIndex,default</value>
</property>

The other file to add  is conf/capacity-scheduler.conf. This is where
the properties of the queues are defined. Here is some of my scheduler
conf

  <property>
    
<name>mapred.capacity-scheduler.queue.nutchFetchCycle.guaranteed-capacity</name>
    <value>50</value>
    <description>Percentage of the number of slots in the cluster that are
      guaranteed to be available for jobs in this queue.
    </description>
  </property>
  <property>
    <name>mapred.capacity-scheduler.queue.nutchIndex.guaranteed-capacity</name>
    <value>50</value>
    <description>Percentage of the number of slots in the cluster that are
      guaranteed to be available for jobs in this queue.
    </description>
  </property>
  <property>
    <name>mapred.capacity-scheduler.queue.default.guaranteed-capacity</name>
    <value>0</value>
    <description>Percentage of the number of slots in the cluster that are
      guaranteed to be available for jobs in this queue.
    </description>
  </property>

  You will need to restart your jobtracker in order for these changes
to be applied. You will be able to see the queues if you visit the web
interface of the jobtracker.

To use the different queues you will need to set the
mapred.job.queue.name property. To accomplish this I have two
directores, nutch-fetch and nutch-index. Each directory has their own
conf/hadoop-site.xml file with the different queue names ( I also have
different number of map/reduce tasks ).

nutch-index:
  <property>
    <name>mapred.job.queue.name</name>
    <value>nutchIndex</value>
  </property>

nutch-fetch:
  <property>
    <name>mapred.job.queue.name</name>
    <value>nutchFetchCycle</value>
  </property>


When a job is started from one of the directories the job will be
placed in the corresponding queue and will be able to run
simultaneously with jobs in the other queue. In order for this to work
there needs to be at least the capacity for 2 map and 2 reduce tasks
so each queue will be guaranteed 1 of each ( in this example since
it's a 50/50 distribution).

Good luck with your concurrent fetches and don't forget to set the
generate.update.crawldb property to 'true' so that you will generate
different fetch lists for each instance.

Enjoy,
  Jason


On Thu, Dec 17, 2009 at 1:03 PM, Yves Petinot <[email protected]> wrote:

Jason,

that sounds really good ! ... did you have to upgrade the default version of
Hadoop or were you able to get the distro that comes with Nutch (0.19.1 for
me , which I assume is the standard) to accept it ? If the latter worked for
you, can you take use through your configuration changes ?

thanks a bunch ;-)

-y

J.G.Konrad wrote:

I have integrated the CapacityTaskScheduler into my Nutch 1.0 setup
although it is not for doing concurrent fetching but it could be used
for that purpose.  Using the capacity scheduler you set up two
separate queues and allocate half of the available resources
(map/reduce tasks) to each queue. Technically there are three queues
in my setup , two specialized and the 'default'. The capacity
scheduler requires the 'default' queue to be defined although you
don't have to send any jobs to it.

The only catch is you are not guaranteed symmetrical distribution if
you have multiple machines in the cluster. That may or may not be a
issues depending on your requirements.

-Jason


On Thu, Dec 17, 2009 at 11:00 AM, Yves Petinot <[email protected]> wrote:

Just one comment though I think hadoop will serialize your jobs any how
so
you won't get a parallel execution of your hadoop jobs unless you run
them
from different hardware.

I'm actually wondering if someone on this list has been able to use
Hadoop's
Fair Scheduler (or any other scheduler for that matter). This would
definitely solve this problem (which i'm experiencing too). Is it at all
possible to change the default scheduler with Nutch 1.0 (and the version
of
Hadoop that comes with it) or do we have to wait until the next hadoop
upgrade ?

hopefully someone on the list can shed some light on this issue,

cheers,

-y


MilleBii wrote:

I guess because of the different nutch-site.xml & url filter that you
want
to use it won't work... but you could try installing nutch twice run the
crawl/fetch/parse from those two locations. And joined the segments to
recreate a unified searchable index (make sure you put all your segments
under the same location).

Just one comment though I think hadoop will serialize your jobs any how
so
you won't get a parallel execution of your hadoop jobs unless you run
them
from different hardware.

2009/12/16 Christopher Bader <[email protected]>

Felix,

I've had trouble running multiple instances.  I would be interested in
hearing from anyone who has done it successfully.

CB


On Wed, Dec 16, 2009 at 4:26 PM, Felix Zimmermann <[email protected]>
wrote:

Hi,

I would like to run at least two instances of nutch ONLY for crawling
at
one time; one for very frequently updated sites and one for other
sites.
Will the nutch instances get in trouble when running several
crawlscripts, especially the nutch confdir variable?

Thanks!
Felix.

Re: Multiple Nutch instances for crawling?

Reply via email to