On Mon, Nov 19, 2007 at 06:43:40PM +0200, Eugeny N Dzhurinsky wrote:
>Hello, gentlemen!
>
>I would like to implement a custom data provider which will create a records
>to start map jobs with them. For example I would like to create a thread which
>will extract some data from a storage (e.g. relational database) and start a
>new job, which will take single record and start map/reduce processing. Each
>of such record will produce a lot of results, which will be processed by
>reduce task later.
>
>The question is - how to implement such interfaces? As far as I learned, I
>would need to implement interfaces InputSplit, RecordReader and and
>InputFormat. However after looking at sources and javadocs I found all
>operations seems to be file-based, and this file could be split between
>several hosts, which isn't my case. I would deal with single stream I need to
>parse and start a job.
>

Yes Eugene, you are on the right track. As you have noted you would need custom 
InputFormat/InputSplit/RecordReader implementations and we seem to have only 
file-based ones in our source tree.

One option is clearly to pre-fetch the records from the database and store it 
on HDFS or any other filesystem (e.g. local filesystem).

However, it really shouldn't be very complicated to process data directly from 
the database. 

I'd imagine that your custom InputFormat would return as many InputSplit(s) as 
the number of maps you want to spawn (for e.g. 
COUNT(*)/no_desired_records_per_map) and your RecordReader.next(k, v) would 
work with the InputSplit to track the record-ranges (rows) it is responsible 
for, be cognizant of the 'record' type, fetch it from the database and feed it 
to the map. You need to ensure that your InputSplit tracks the record-ranges 
(rows) you want each map to process and that it implements the Writable 
interface for serialization.

hth,
Arun


Reply via email to