Eugeny N Dzhurinsky wrote:
I would like to implement a custom data provider which will create a records
to start map jobs with them. For example I would like to create a thread which
will extract some data from a storage (e.g. relational database) and start a
new job, which will take single record and start map/reduce processing. Each
of such record will produce a lot of results, which will be processed by
reduce task later.

The question is - how to implement such interfaces? As far as I learned, I
would need to implement interfaces InputSplit, RecordReader and and
InputFormat. However after looking at sources and javadocs I found all
operations seems to be file-based, and this file could be split between
several hosts, which isn't my case. I would deal with single stream I need to
parse and start a job.

InputFormat and OutputFormat do not require files. There are a few bugs where Path or FileSystem appear in these interfaces, but these uses are optional and it is simple to implement non-file-based inputs and outputs for mapreduce.

An example of a non-file-based InputFormat is in HBase:

http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/hbase/src/java/org/apache/hadoop/hbase/mapred/TableInputFormat.java?view=markup

And a non-file-based OutputFormat is in:

http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/hbase/src/java/org/apache/hadoop/hbase/mapred/TableOutputFormat.java?view=markup

Doug

Reply via email to