Implement MultipleTableInputs which is analogous to MultipleInputs in Hadoop
----------------------------------------------------------------------------

                 Key: HBASE-2965
                 URL: https://issues.apache.org/jira/browse/HBASE-2965
             Project: HBase
          Issue Type: New Feature
          Components: mapred, mapreduce
            Reporter: Adam Warrington
            Priority: Minor


This feature would be helpful for doing reduce side joins, or even passing 
similarly structured data from multiple tables through map reduce. The API I 
envision would be very similar to the already existent MultipleInputs, parts of 
which could be reused.

MultipleTableInputs would have a public api like:

class MultipleTableInputs {
  public static void addInputTable(Job job, Table table, Scan scan, Class<? 
extends TableInputFormatBase> inputFormatClass, Class<? extends Mapper> 
mapperClass);
};

MultipleTableInputs would build a mapping of Tables to configured 
TableInputFormats the same way MultipleInputs builds a mapping between Paths 
and InputFormats. Since most people will probably use TableInputFormat.class as 
the input format class, the MultipleTableInput implementation will have to 
replace the TableInputFormatBase's private scan and table members that are 
configured when an instance of TableInputFormat is created (from within its 
setConf() method) by calling setScan and setHTable with the table and scan that 
are passed into addInputTable above. MultipleTableInputFormat's addInputTable() 
member function would also set the input format for the job to 
DelegatingTableInputFormat, described below.

A new class called DelegatingTableInputFormat would be analogous to 
DelegatingInputFormat, where getSplits() would return TaggedInputSplits (same 
TaggedInputSplit object that the Hadoop DelegatingInputFormat uses), which tag 
the split with its InputFormat and Mapper. These are created by looping through 
the HTable to InputFormat mappings, and calling getSplits on each input format, 
and using the split, the input format, and mapper as constructor args to 
TaggedInputSplits.

The createRecordReader() function in DelegatingTableInputFormat could have the 
same implementation as the Hadoop DelegatingInputFormat.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to