On Nov 19, 2007, at 10:45 AM, Eugeny N Dzhurinsky wrote:
1) I can not know the number of records. In fact it is something
like endless
loop, and the code which populates records from a database into a
stream is a
bit complicated, and there could be cases when it would take few
hours until a
new data will be prepared by a third-party application for
processing, so the
producer thread (which fetches the records and passes them to the
Hadoop
handlers) will just block and wait for the data.
2) I would like to maintain fixed number of jobs at a time, and not
spawn a
new one until some of jobs ends - this means I would like to have
some kind of
a job pool of fixed size (something similar to PoolingExecutor from
java.concurrent
package). I assume it would not be hard to implement such logic
over the
Hadoop, however if there is something which will ease this task
within Hadoop - it
would be great.
With map/reduce, you really want to process items in a batch. I've
seen instances where that is expressed as a "update" directory and a
"current" directory. (I'm going to express it as files, but it does
generalize pretty easily to non-files.) The outside process puts new
things in the update directory and when the map/reduce job is about
to run, it finds the files that are ready to be processed and
generates input splits for those. It also generates input splits for
the current directory and does a join between the two data sets with
the reduce doing the update. By running these jobs one after another,
you get the effect you are looking for with a minimum of delay.
If you do write an InputFormat to read from a database, let us know.
*smile* If I was doing it, I'd probably generate InputSplit's based
on key ranges. Therefore, if my table was indexed on names, I'd
generate split points that divide the table into the desired number
of splits that are roughly equal using sampling. Then each
RecordReader is doing a "select * from Tbl where key >= minKey and
key < maxKey" to read just its input.
-- Owen