On Nov 19, 2007, at 10:45 AM, Eugeny N Dzhurinsky wrote:

1) I can not know the number of records. In fact it is something like endless loop, and the code which populates records from a database into a stream is a bit complicated, and there could be cases when it would take few hours until a new data will be prepared by a third-party application for processing, so the producer thread (which fetches the records and passes them to the Hadoop
handlers) will just block and wait for the data.

2) I would like to maintain fixed number of jobs at a time, and not spawn a new one until some of jobs ends - this means I would like to have some kind of a job pool of fixed size (something similar to PoolingExecutor from java.concurrent package). I assume it would not be hard to implement such logic over the Hadoop, however if there is something which will ease this task within Hadoop - it
would be great.

With map/reduce, you really want to process items in a batch. I've seen instances where that is expressed as a "update" directory and a "current" directory. (I'm going to express it as files, but it does generalize pretty easily to non-files.) The outside process puts new things in the update directory and when the map/reduce job is about to run, it finds the files that are ready to be processed and generates input splits for those. It also generates input splits for the current directory and does a join between the two data sets with the reduce doing the update. By running these jobs one after another, you get the effect you are looking for with a minimum of delay.

If you do write an InputFormat to read from a database, let us know. *smile* If I was doing it, I'd probably generate InputSplit's based on key ranges. Therefore, if my table was indexed on names, I'd generate split points that divide the table into the desired number of splits that are roughly equal using sampling. Then each RecordReader is doing a "select * from Tbl where key >= minKey and key < maxKey" to read just its input.

-- Owen

Reply via email to