Multiple, generic InputFormats for MapReduce
--------------------------------------------
Key: HADOOP-3926
URL: https://issues.apache.org/jira/browse/HADOOP-3926
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Reporter: Vuk Ercegovac
Priority: Minor
The feature that allows an InputFormat per path to be specified for a MapReduce
job should be generalized
(see http://issues.apache.org/jira/browse/HADOOP-372) to support InputFormats
other than FileInputFormat
(e.g., HBase table). This is needed when joining or co-grouping multiple
inputs. Even for the case of multiple FileInputFormats, it seems that if a
sub-class sets and configures itself from the JobConf, the inputs will need to
insure that they do not have name clashes. In general, the child InputFormats
should not be aware of each other.
We've implemented this for Jaql but would like to remove dependencies on other
libs (json) and see how it can be integrated with the HADOOP-372 changes. It
works similar to HADOOP-372. A UnionInputFormat consists of multiple child
InputFormats. The UnionInputFormat records an array of <InputFormat, name-value
pairs for JobConf> in the JobConf. For creating splits, it collects child
splits (similar to DelegatingInputFormat) and wraps each child's split with its
index into the array (similar to TaggedInputSplit). The UnionInputFormat, given
a split, can then dig out the corresponding InputFormat given its index,
instantiate it, and return its RecordReader. Each child InputFormat depends on
setting up an empty JobConf prior to its instantiation. An alternative is to
use a string version of an InputFormat's setup JobConf. The analog to
DelegatingMapper simply exposes the child split's index to drive per input
logic (in our case, its a script rather than a Map class). As with HADOOP-372,
these are lib-level changes, not core.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.