[jira] Created: (HADOOP-3926) Multiple, generic InputFormats for MapReduce

Vuk Ercegovac (JIRA) Fri, 08 Aug 2008 00:01:09 -0700

Multiple, generic InputFormats for MapReduce
--------------------------------------------


                 Key: HADOOP-3926
                 URL: https://issues.apache.org/jira/browse/HADOOP-3926
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Vuk Ercegovac
            Priority: Minor


The feature that allows an InputFormat per path to be specified for a MapReduce 
job should be generalized
(see http://issues.apache.org/jira/browse/HADOOP-372) to support InputFormats 
other than FileInputFormat
(e.g., HBase table). This is needed when joining or co-grouping multiple 
inputs. Even for the case of multiple FileInputFormats, it seems that if a 
sub-class sets and configures itself from the JobConf, the inputs will need to 
insure that they do not have name clashes. In general, the child InputFormats 
should not be aware of each other.

We've implemented this for Jaql but would like to remove dependencies on other 
libs (json) and see how it can be integrated with the HADOOP-372 changes. It 
works similar to HADOOP-372. A UnionInputFormat consists of multiple child 
InputFormats. The UnionInputFormat records an array of <InputFormat, name-value 
pairs for JobConf> in the JobConf. For creating splits, it collects child 
splits (similar to DelegatingInputFormat) and wraps each child's split with its 
index into the array (similar to TaggedInputSplit). The UnionInputFormat, given 
a split, can then dig out the corresponding InputFormat given its index, 
instantiate it, and return its RecordReader. Each child InputFormat depends on 
setting up an empty JobConf prior to its instantiation. An alternative is to 
use a string version of an InputFormat's setup JobConf. The analog to 
DelegatingMapper simply exposes the child split's index to drive per input 
logic (in our case, its a script rather than a Map class). As with HADOOP-372, 
these are lib-level changes, not core.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-3926) Multiple, generic InputFormats for MapReduce

Reply via email to