I was thinking that Nutch needs some sort of workflow
manager. This way you could build jobs off specific
workflows and hopefully recover jobs based upon the
portion of the workflow they are stuck. (or restart a
job if failed/processing time > x hours and other such
workflow processes rules)

Something like that could also send notifications of
jobs done, trigger other events and create a
management interface to what your cluster is up to or
apply configuration types to be defigned based upon
batch job/workflow process "in process".  For example
if i'm building a blog index i may want more smaller
segments based upon daily fetches while for other jobs
i may want less larger segments. 

Does something like that make much sense for where
mapred branch is going?

is workflow the right term for such beast?

-byron



--- "Goldschmidt, Dave" <[EMAIL PROTECTED]>
wrote:

> Could you also just copy segments out of NDFS to
> local -- perform merges
> in local -- then copy segments back into NDFS?
> 
> DaveG
> 
> 
> -----Original Message-----
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, January 12, 2006 2:14 PM
> To: [email protected]
> Subject: Re: MapReduce and segment merging
> 
> Mike Alulin wrote:
> > Then how people uses the new version if they need
> let's say daily
> crawls of the new/updated pages? I crawl updated
> pages every 24 hours
> and if I do not merge the segments, soon I will have
> hundreds of them.
> What is the best solution in this case? 
> >    
> >   Full recrawl is not a good option as i have
> millions of documents
> and I DO know which of them were updated without
> requesting them.
> >   
> 
> This is a development version, nobody said it's
> feature complete. 
> Patience, my friend... or spend some effort to
> improve it. ;-)
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 

Reply via email to