I was thinking that Nutch needs some sort of workflow manager. This way you could build jobs off specific workflows and hopefully recover jobs based upon the portion of the workflow they are stuck. (or restart a job if failed/processing time > x hours and other such workflow processes rules)
Something like that could also send notifications of jobs done, trigger other events and create a management interface to what your cluster is up to or apply configuration types to be defigned based upon batch job/workflow process "in process". For example if i'm building a blog index i may want more smaller segments based upon daily fetches while for other jobs i may want less larger segments. Does something like that make much sense for where mapred branch is going? is workflow the right term for such beast? -byron --- "Goldschmidt, Dave" <[EMAIL PROTECTED]> wrote: > Could you also just copy segments out of NDFS to > local -- perform merges > in local -- then copy segments back into NDFS? > > DaveG > > > -----Original Message----- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 12, 2006 2:14 PM > To: [email protected] > Subject: Re: MapReduce and segment merging > > Mike Alulin wrote: > > Then how people uses the new version if they need > let's say daily > crawls of the new/updated pages? I crawl updated > pages every 24 hours > and if I do not merge the segments, soon I will have > hundreds of them. > What is the best solution in this case? > > > > Full recrawl is not a good option as i have > millions of documents > and I DO know which of them were updated without > requesting them. > > > > This is a development version, nobody said it's > feature complete. > Patience, my friend... or spend some effort to > improve it. ;-) > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __________________________________ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > >
