Ok. I do understand that Crawl.java is experimental purpose.
And in real crawl cluster I should execute Injector, Generator, Fetcher, ParseSegment,... separatedly. And that Fetcher works with one Fetch Job and has parallel multithreaded architecture.

So but what is the architecture of such ensemble?

I have 10 nodes cluster. And how do I conduct such ensemble 10 nodes with separately use above command line tools?

For example I compeated Inject and Generate (to 10 sements) phases executing from one script.

How do 10 nodes of cluster understand that it mush be started and which segment is to fetch ?
What is right architectural point of view?

How does such work and synchronizing in Nutch enterprise clusters ?

Than each node makes parse and updatedb to crawldb and linkdb.
But again how master node with generated new partitions to new segments understanding that all nodes have compleated fetch, parse and updates phases ?

How this loop closing for the next iteration ?

What is the right architecture of such distributed Nutch enterprise cluster ?

1. Master node algorithm of calling Nutch command line tools.
2. Synchronization Master to Slave.
3. Slave event of execute, algorithm of calling Nutch commnad-line tools and dispatching compleate status.
4. Synchronization Master by Slaves events.

What is the technological point of view ?
Using Cron on Master, Slaves.. Using schedules intervals in Master and Slaves?
Technologies of Synching in both directions ?
Master and Slave shell scripts of calling Nutch CL tools?

And how all do right ?
(from real enterprise nutch cluster examples)

With Best Regards,
Alexander Zazhigin

I have done deep research of Apache Nutch (1.5.1).

All ideas lied down there I understood except one thing which led my to
contradictions.

Distributed architecture in your slides:
http://www.slideshare.net/abial/nutch-webscale-search-engine-toolkit

With the essence on such scheme:
http://mmcg.z52.ru/drupal/sites/default/files/Nutch_schema.png

- tells about multi segments generated to fetch in parallel.

But the main loop in Crawl.java:

Crawl.java is just a simple tool to get started without spending too much time to learn individual tools in Nutch. For crawling multiple segments in parallel you should use individual tools.


I understood that single fetcher designed to work (Map and Reduce)
within one Node.
How do I modify above code to run multiple Fetcher's by all generated
segments in parallel mode?

It's best done with individual tools. There are many reasons why you should not use Crawl for anything than quick tests.


Abstract idea - How do Nutch realy work at the cluster fetching segments
parallel?

It splits the fetchlist into N parts, and each part is fetched by a separate map task on the cluster, in parallel. Additionally, each Fetcher map task uses multiple threads, to further increase parallelism.

If you have spare capacity on your cluster, you can also start several Fetcher jobs in parallel - and here's where you would generate multiple segments to start several Fetcher jobs, each job fetching its own segment.


Is it some configuration/execution Nutch trick at the real multi-node
cluster?
(to work by the generated segments in parallel)

Please see the command-line tools - generator, fetcher, parse, updatedb, updatelinkdb. They provide options to handle multiple segments.


Reply via email to