Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
tasks. The basic flow is
input: array of urls
actions: |
1. get pages
|
2. extract new urls from pages -> start new job
extract text -> index / filter (as new jobs)
What I'm considering is how I should build this application to fit into
the map/reduce context. I'm thinking that step 1 and 2 should be
separate map/reduce tasks that then pipe things on to the next step.
This is where I am a bit at loss to see how it is smart to organize the
code in logical units and also how to spawn new tasks when an old one is
over.
Is the usual way to control the flow of a set of tasks to have an
external application running that listens to jobs ending via the
endNotificationUri and then spawns new tasks or should the job itself
contain code to create new jobs? Would it be a good idea to use
Cascading here?
I'm also considering how I should do job scheduling (I got a lot of
reoccurring tasks). Has anyone found a good framework for job control of
reoccurring tasks or should I plan to build my own using quartz ?
Any tips/best practices with regard to the issues described above are
most welcome. Feel free to ask further questions if you find my
descriptions of the issues lacking.
Kind regards,
Tarjei