Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule your jobs using oozie or simpe shell script. In very simple terms - push input data (could be from flume collector) into a staging hdfs dir - before triggering the job(hadoop jar) copy the input from staging to main input dir - execute the job - archive the input and output into archive dirs(any other dirs). - the output archive dir could be source of output data - delete output dir and empty input dir
Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:19 AM, burakkk <burak.isi...@gmail.com> wrote: > Hi everyone, > I want to run a MR job continuously. Because i have streaming data and i > try to analyze it all the time in my way(algorithm). For example you want > to solve wordcount problem. It's the simplest one :) If you have some > multiple files and the new files are keep going, how do you handle it? > You could execute a MR job per one file but you have to do it repeatly. So > what do you think? > > Thanks > Best regards... > > -- > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* > * > * >