Burak
       If you have a continuous inflow of data, you can choose flume to
aggregate the files into larger sequence files or so if they are small and
when you have a substantial chunk of data(equal to hdfs block size). You
can push that data on to hdfs based on your SLAs you need to schedule your
jobs using oozie or simpe shell script. In very simple terms
- push input data (could be from flume collector) into a staging hdfs dir
- before triggering the job(hadoop jar) copy the input from staging to main
input dir
- execute the job
- archive the input and output into archive dirs(any other dirs).
       - the output archive dir could be source of output data
- delete output dir and empty input dir

Hope it helps!...

Regards
Bejoy.K.S

On Tue, Dec 6, 2011 at 2:19 AM, burakkk <burak.isi...@gmail.com> wrote:

> Hi everyone,
> I want to run a MR job continuously. Because i have streaming data and i
> try to analyze it all the time in my way(algorithm). For example you want
> to solve wordcount problem. It's the simplest one :) If you have some
> multiple files and the new files are keep going, how do you handle it?
> You could execute a MR job per one file but you have to do it repeatly. So
> what do you think?
>
> Thanks
> Best regards...
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

Reply via email to