RE: Apache Nutch - Parallel segs execution

Markus Jelsma Mon, 10 Sep 2012 10:30:19 -0700

Don't worry, Hadoop will take care of all that for you and operating Nutch' 
various jobs on Hadoop is identical to running it locally. A job is scheduled 
by Hadoop to run of a configured number of nodes, 1 (local), 100 or 10000.


Check out the Nutch on Hadoop tutorial on the Nutch wiki.

Please use the user@nutch mailing list for user questions.
 
 
-----Original message-----
> From:Alexander Zazhigin <[email protected]>
> Sent: Mon 10-Sep-2012 19:24
> To: [email protected]
> Subject: Apache Nutch - Parallel segs execution
> 
> Ok. I do understand that Crawl.java is experimental purpose.
> And in real crawl cluster I should execute Injector, Generator, Fetcher, 
> ParseSegment,... separatedly.
> And that Fetcher works with one Fetch Job and has parallel multithreaded 
> architecture.
> 
> So but what is the architecture of such ensemble?
> 
> I have 10 nodes cluster. And how do I conduct such ensemble 10 nodes 
> with separately use above command line tools?
> 
> For example I compeated Inject and Generate (to 10 sements) phases 
> executing from one script.
> 
> How do 10 nodes of cluster understand that it mush be started and which 
> segment is to fetch ?
> What is right architectural point of view?
> 
> How does such work and synchronizing in Nutch enterprise clusters ?
> 
> Than each node makes parse and updatedb to crawldb and linkdb.
> But again how master node with generated new partitions to new segments 
> understanding that all nodes have compleated fetch, parse and updates 
> phases ?
> 
> How this loop closing for the next iteration ?
> 
> What is the right architecture of such distributed Nutch enterprise 
> cluster ?
> 
> 1. Master node algorithm of calling Nutch command line tools.
> 2. Synchronization Master to Slave.
> 3. Slave event of execute,  algorithm of calling Nutch commnad-line 
> tools and dispatching compleate status.
> 4. Synchronization Master by Slaves events.
> 
> What is the technological point of view ?
> Using Cron on Master, Slaves.. Using schedules intervals in Master and 
> Slaves?
> Technologies of Synching in both directions ?
> Master and Slave shell scripts of calling Nutch CL tools?
> 
> And how all do right ?
> (from real enterprise nutch cluster examples)
> 
> With Best Regards,
> Alexander Zazhigin
> 
> >> I have done deep research of Apache Nutch (1.5.1).
> >>
> >> All ideas lied down there I understood except one thing which led my to
> >> contradictions.
> >>
> >> Distributed architecture in your slides:
> >> http://www.slideshare.net/abial/nutch-webscale-search-engine-toolkit
> >>
> >> With the essence on such scheme:
> >> http://mmcg.z52.ru/drupal/sites/default/files/Nutch_schema.png
> >>
> >> - tells about multi segments generated to fetch in parallel.
> >>
> >> But the main loop in Crawl.java:
> >
> > Crawl.java is just a simple tool to get started without spending too 
> > much time to learn individual tools in Nutch. For crawling multiple 
> > segments in parallel you should use individual tools.
> >
> >
> >> I understood that single fetcher designed to work (Map and Reduce)
> >> within one Node.
> >> How do I modify above code to run multiple Fetcher's by all generated
> >> segments in parallel mode?
> >
> > It's best done with individual tools. There are many reasons why you 
> > should not use Crawl for anything than quick tests.
> >
> >>
> >> Abstract idea - How do Nutch realy work at the cluster fetching segments
> >> parallel?
> >
> > It splits the fetchlist into N parts, and each part is fetched by a 
> > separate map task on the cluster, in parallel. Additionally, each 
> > Fetcher map task uses multiple threads, to further increase parallelism.
> >
> > If you have spare capacity on your cluster, you can also start several 
> > Fetcher jobs in parallel - and here's where you would generate 
> > multiple segments to start several Fetcher jobs, each job fetching its 
> > own segment.
> >
> >>
> >> Is it some configuration/execution Nutch trick at the real multi-node
> >> cluster?
> >> (to work by the generated segments in parallel)
> >
> > Please see the command-line tools - generator, fetcher, parse, 
> > updatedb, updatelinkdb. They provide options to handle multiple segments. 
> 
> 
>

RE: Apache Nutch - Parallel segs execution

Reply via email to