Apache Nutch - Parallel segs execution

Alexander Zazhigin Mon, 10 Sep 2012 10:21:17 -0700

Ok. I do understand that Crawl.java is experimental purpose.

And in real crawl cluster I should execute Injector, Generator, Fetcher,ParseSegment,... separatedly.And that Fetcher works with one Fetch Job and has parallel multithreadedarchitecture.


So but what is the architecture of such ensemble?

I have 10 nodes cluster. And how do I conduct such ensemble 10 nodeswith separately use above command line tools?

For example I compeated Inject and Generate (to 10 sements) phasesexecuting from one script.

How do 10 nodes of cluster understand that it mush be started and whichsegment is to fetch ?

What is right architectural point of view?

How does such work and synchronizing in Nutch enterprise clusters ?

Than each node makes parse and updatedb to crawldb and linkdb.

But again how master node with generated new partitions to new segmentsunderstanding that all nodes have compleated fetch, parse and updatesphases ?


How this loop closing for the next iteration ?

What is the right architecture of such distributed Nutch enterprisecluster ?


1. Master node algorithm of calling Nutch command line tools.
2. Synchronization Master to Slave.

3. Slave event of execute, algorithm of calling Nutch commnad-linetools and dispatching compleate status.

4. Synchronization Master by Slaves events.

What is the technological point of view ?

Using Cron on Master, Slaves.. Using schedules intervals in Master andSlaves?

Technologies of Synching in both directions ?
Master and Slave shell scripts of calling Nutch CL tools?

And how all do right ?
(from real enterprise nutch cluster examples)

With Best Regards,
Alexander Zazhigin

I have done deep research of Apache Nutch (1.5.1).

All ideas lied down there I understood except one thing which led my to
contradictions.

Distributed architecture in your slides:
http://www.slideshare.net/abial/nutch-webscale-search-engine-toolkit

With the essence on such scheme:
http://mmcg.z52.ru/drupal/sites/default/files/Nutch_schema.png

- tells about multi segments generated to fetch in parallel.

But the main loop in Crawl.java:
Crawl.java is just a simple tool to get started without spending toomuch time to learn individual tools in Nutch. For crawling multiplesegments in parallel you should use individual tools.
I understood that single fetcher designed to work (Map and Reduce)
within one Node.
How do I modify above code to run multiple Fetcher's by all generated
segments in parallel mode?
It's best done with individual tools. There are many reasons why youshould not use Crawl for anything than quick tests.
Abstract idea - How do Nutch realy work at the cluster fetching segments
parallel?
It splits the fetchlist into N parts, and each part is fetched by aseparate map task on the cluster, in parallel. Additionally, eachFetcher map task uses multiple threads, to further increase parallelism.
If you have spare capacity on your cluster, you can also start severalFetcher jobs in parallel - and here's where you would generatemultiple segments to start several Fetcher jobs, each job fetching itsown segment.
Is it some configuration/execution Nutch trick at the real multi-node
cluster?
(to work by the generated segments in parallel)
Please see the command-line tools - generator, fetcher, parse,updatedb, updatelinkdb. They provide options to handle multiple segments.

Apache Nutch - Parallel segs execution

Reply via email to