Eugeny N Dzhurinsky wrote:
I would like to implement a custom data provider which will create a records
to start map jobs with them. For example I would like to create a thread which
will extract some data from a storage (e.g. relational database) and start a
new job, which will take single record and start map/reduce processing. Each
of such record will produce a lot of results, which will be processed by
reduce task later.
The question is - how to implement such interfaces? As far as I learned, I
would need to implement interfaces InputSplit, RecordReader and and
InputFormat. However after looking at sources and javadocs I found all
operations seems to be file-based, and this file could be split between
several hosts, which isn't my case. I would deal with single stream I need to
parse and start a job.
InputFormat and OutputFormat do not require files. There are a few bugs
where Path or FileSystem appear in these interfaces, but these uses are
optional and it is simple to implement non-file-based inputs and outputs
for mapreduce.
An example of a non-file-based InputFormat is in HBase:
http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/hbase/src/java/org/apache/hadoop/hbase/mapred/TableInputFormat.java?view=markup
And a non-file-based OutputFormat is in:
http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/hbase/src/java/org/apache/hadoop/hbase/mapred/TableOutputFormat.java?view=markup
Doug