Hi Tarjei, You should take a look at Nutch. It's a search-engine built on Lucene, though it can be setup on top of Hadoop. Take a look:
<http://lucene.apache.org/nutch/> -and- <http://wiki.apache.org/nutch/NutchHadoopTutorial> Hope this helps! Alex On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse <[EMAIL PROTECTED]> wrote: > Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer > tasks. The basic flow is > > input: array of urls > actions: | > 1. get pages > | > 2. extract new urls from pages -> start new job > extract text -> index / filter (as new jobs) > > What I'm considering is how I should build this application to fit into the > map/reduce context. I'm thinking that step 1 and 2 should be separate > map/reduce tasks that then pipe things on to the next step. > > This is where I am a bit at loss to see how it is smart to organize the > code in logical units and also how to spawn new tasks when an old one is > over. > > Is the usual way to control the flow of a set of tasks to have an external > application running that listens to jobs ending via the endNotificationUri > and then spawns new tasks or should the job itself contain code to create > new jobs? Would it be a good idea to use Cascading here? > > I'm also considering how I should do job scheduling (I got a lot of > reoccurring tasks). Has anyone found a good framework for job control of > reoccurring tasks or should I plan to build my own using quartz ? > > Any tips/best practices with regard to the issues described above are most > welcome. Feel free to ask further questions if you find my descriptions of > the issues lacking. > > Kind regards, > Tarjei > > >
