On Mon, Nov 19, 2007 at 10:40:54PM +0530, Arun C Murthy wrote: > Yes Eugene, you are on the right track. As you have noted you would need > custom InputFormat/InputSplit/RecordReader implementations and we seem to > have only file-based ones in our source tree. > > One option is clearly to pre-fetch the records from the database and store > it on HDFS or any other filesystem (e.g. local filesystem). > > However, it really shouldn't be very complicated to process data directly > from the database. > > I'd imagine that your custom InputFormat would return as many InputSplit(s) > as the number of maps you want to spawn (for e.g. > COUNT(*)/no_desired_records_per_map) and your RecordReader.next(k, v) would > work with the InputSplit to track the record-ranges (rows) it is responsible > for, be cognizant of the 'record' type, fetch it from the database and feed > it to the map. You need to ensure that your InputSplit tracks the > record-ranges (rows) you want each map to process and that it implements the > Writable interface for serialization.
Thank you for prompt response, it is really helpful - I know I'm not trying to do a task with a wrong tool. However there are few things I probably forgot to mention: 1) I can not know the number of records. In fact it is something like endless loop, and the code which populates records from a database into a stream is a bit complicated, and there could be cases when it would take few hours until a new data will be prepared by a third-party application for processing, so the producer thread (which fetches the records and passes them to the Hadoop handlers) will just block and wait for the data. 2) I would like to maintain fixed number of jobs at a time, and not spawn a new one until some of jobs ends - this means I would like to have some kind of a job pool of fixed size (something similar to PoolingExecutor from java.concurrent package). I assume it would not be hard to implement such logic over the Hadoop, however if there is something which will ease this task within Hadoop - it would be great. Thank you in advance! -- Eugene N Dzhurinsky
