Hi Tarjei,

You should take a look at Nutch.  It's a search-engine built on Lucene,
though it can be setup on top of Hadoop.  Take a look:

<http://lucene.apache.org/nutch/>
-and-
<http://wiki.apache.org/nutch/NutchHadoopTutorial>

Hope this helps!

Alex

On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse <[EMAIL PROTECTED]> wrote:

> Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
> tasks. The basic flow is
>
> input:    array of urls
> actions:          |
> 1.              get pages
>                      |
> 2.          extract new urls from pages -> start new job
>            extract text  -> index / filter (as new jobs)
>
> What I'm considering is how I should build this application to fit into the
> map/reduce context. I'm thinking that step 1 and 2 should be separate
> map/reduce tasks that then pipe things on to the next step.
>
> This is where I am a bit at loss to see how it is smart to organize the
> code in logical units and also how to spawn new tasks when an old one is
> over.
>
> Is the usual way to control the flow of a set of tasks to have an external
> application running that listens to jobs ending via the endNotificationUri
> and then spawns new tasks or should the job itself contain code to create
> new jobs? Would it be a good idea to use Cascading here?
>
> I'm also considering how I should do job scheduling (I got a lot of
> reoccurring tasks). Has anyone found a good framework for job control of
> reoccurring tasks or should I plan to build my own using quartz ?
>
> Any tips/best practices with regard to the issues described above are most
> welcome. Feel free to ask further questions if you find my descriptions of
> the issues lacking.
>
> Kind regards,
> Tarjei
>
>
>

Reply via email to