Basic code organization questions + scheduling

Tarjei Huse Sun, 07 Sep 2008 11:55:16 -0700

Hi, I'm planning to use Hadoop in for a set of typical crawler/indexertasks. The basic flow is


input:    array of urls
actions:          |
1.              get pages
                      |
2.          extract new urls from pages -> start new job
            extract text  -> index / filter (as new jobs)

What I'm considering is how I should build this application to fit intothe map/reduce context. I'm thinking that step 1 and 2 should beseparate map/reduce tasks that then pipe things on to the next step.

This is where I am a bit at loss to see how it is smart to organize thecode in logical units and also how to spawn new tasks when an old one isover.

Is the usual way to control the flow of a set of tasks to have anexternal application running that listens to jobs ending via theendNotificationUri and then spawns new tasks or should the job itselfcontain code to create new jobs? Would it be a good idea to useCascading here?

I'm also considering how I should do job scheduling (I got a lot ofreoccurring tasks). Has anyone found a good framework for job control ofreoccurring tasks or should I plan to build my own using quartz ?

Any tips/best practices with regard to the issues described above aremost welcome. Feel free to ask further questions if you find mydescriptions of the issues lacking.


Kind regards,
Tarjei

Basic code organization questions + scheduling

Reply via email to