InputSplit?

Owen O'Malley Mon, 19 Nov 2007 11:36:41 -0800

On Nov 19, 2007, at 10:45 AM, Eugeny N Dzhurinsky wrote:

1) I can not know the number of records. In fact it is somethinglike endlessloop, and the code which populates records from a database into astream is abit complicated, and there could be cases when it would take fewhours until anew data will be prepared by a third-party application forprocessing, so theproducer thread (which fetches the records and passes them to theHadoop
handlers) will just block and wait for the data.
2) I would like to maintain fixed number of jobs at a time, and notspawn anew one until some of jobs ends - this means I would like to havesome kind ofa job pool of fixed size (something similar to PoolingExecutor fromjava.concurrentpackage). I assume it would not be hard to implement such logicover theHadoop, however if there is something which will ease this taskwithin Hadoop - it
would be great.

With map/reduce, you really want to process items in a batch. I'veseen instances where that is expressed as a "update" directory and a"current" directory. (I'm going to express it as files, but it doesgeneralize pretty easily to non-files.) The outside process puts newthings in the update directory and when the map/reduce job is aboutto run, it finds the files that are ready to be processed andgenerates input splits for those. It also generates input splits forthe current directory and does a join between the two data sets withthe reduce doing the update. By running these jobs one after another,you get the effect you are looking for with a minimum of delay.

If you do write an InputFormat to read from a database, let us know.*smile* If I was doing it, I'd probably generate InputSplit's basedon key ranges. Therefore, if my table was indexed on names, I'dgenerate split points that divide the table into the desired numberof splits that are roughly equal using sampling. Then eachRecordReader is doing a "select * from Tbl where key >= minKey andkey < maxKey" to read just its input.


-- Owen

Re: custom implementation of InputFormat/RecordReader/InputSplit?

Reply via email to