Don't worry, Hadoop will take care of all that for you and operating Nutch' various jobs on Hadoop is identical to running it locally. A job is scheduled by Hadoop to run of a configured number of nodes, 1 (local), 100 or 10000.
Check out the Nutch on Hadoop tutorial on the Nutch wiki. Please use the user@nutch mailing list for user questions. -----Original message----- > From:Alexander Zazhigin <[email protected]> > Sent: Mon 10-Sep-2012 19:24 > To: [email protected] > Subject: Apache Nutch - Parallel segs execution > > Ok. I do understand that Crawl.java is experimental purpose. > And in real crawl cluster I should execute Injector, Generator, Fetcher, > ParseSegment,... separatedly. > And that Fetcher works with one Fetch Job and has parallel multithreaded > architecture. > > So but what is the architecture of such ensemble? > > I have 10 nodes cluster. And how do I conduct such ensemble 10 nodes > with separately use above command line tools? > > For example I compeated Inject and Generate (to 10 sements) phases > executing from one script. > > How do 10 nodes of cluster understand that it mush be started and which > segment is to fetch ? > What is right architectural point of view? > > How does such work and synchronizing in Nutch enterprise clusters ? > > Than each node makes parse and updatedb to crawldb and linkdb. > But again how master node with generated new partitions to new segments > understanding that all nodes have compleated fetch, parse and updates > phases ? > > How this loop closing for the next iteration ? > > What is the right architecture of such distributed Nutch enterprise > cluster ? > > 1. Master node algorithm of calling Nutch command line tools. > 2. Synchronization Master to Slave. > 3. Slave event of execute, algorithm of calling Nutch commnad-line > tools and dispatching compleate status. > 4. Synchronization Master by Slaves events. > > What is the technological point of view ? > Using Cron on Master, Slaves.. Using schedules intervals in Master and > Slaves? > Technologies of Synching in both directions ? > Master and Slave shell scripts of calling Nutch CL tools? > > And how all do right ? > (from real enterprise nutch cluster examples) > > With Best Regards, > Alexander Zazhigin > > >> I have done deep research of Apache Nutch (1.5.1). > >> > >> All ideas lied down there I understood except one thing which led my to > >> contradictions. > >> > >> Distributed architecture in your slides: > >> http://www.slideshare.net/abial/nutch-webscale-search-engine-toolkit > >> > >> With the essence on such scheme: > >> http://mmcg.z52.ru/drupal/sites/default/files/Nutch_schema.png > >> > >> - tells about multi segments generated to fetch in parallel. > >> > >> But the main loop in Crawl.java: > > > > Crawl.java is just a simple tool to get started without spending too > > much time to learn individual tools in Nutch. For crawling multiple > > segments in parallel you should use individual tools. > > > > > >> I understood that single fetcher designed to work (Map and Reduce) > >> within one Node. > >> How do I modify above code to run multiple Fetcher's by all generated > >> segments in parallel mode? > > > > It's best done with individual tools. There are many reasons why you > > should not use Crawl for anything than quick tests. > > > >> > >> Abstract idea - How do Nutch realy work at the cluster fetching segments > >> parallel? > > > > It splits the fetchlist into N parts, and each part is fetched by a > > separate map task on the cluster, in parallel. Additionally, each > > Fetcher map task uses multiple threads, to further increase parallelism. > > > > If you have spare capacity on your cluster, you can also start several > > Fetcher jobs in parallel - and here's where you would generate > > multiple segments to start several Fetcher jobs, each job fetching its > > own segment. > > > >> > >> Is it some configuration/execution Nutch trick at the real multi-node > >> cluster? > >> (to work by the generated segments in parallel) > > > > Please see the command-line tools - generator, fetcher, parse, > > updatedb, updatelinkdb. They provide options to handle multiple segments. > > >

