[jira] Created: (MAPREDUCE-2070) Cartesian product file split

Paul Burkhardt (JIRA) Wed, 15 Sep 2010 16:05:16 -0700

Cartesian product file split
----------------------------

                 Key: MAPREDUCE-2070
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2070
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
    Affects Versions: 0.22.0
            Reporter: Paul Burkhardt
            Priority: Minor



Generates a Cartesian product of file pairs from two directory inputs and 
enables a RecordReader to optimally read the split in tuple order, eliminating 
extraneous read operations.

The new InputFormat generates a split comprised of file combinations as tuples. 
The size of the split is configurable. A RecordReader employs the convenience 
class, CartesianProductFileSplitReader, to generate file pairs in tuple 
ordering. The actual read operations are delegated to the RecordReader which 
must implement the CartesianProductTupleReader interface. An implementor of a 
RecordReader can perform file manipulations without restriction and also 
benefit from the optimization of tuple ordering.

In the Cartesian product of two sets with cardinalities, X and Y, each element 
x in {X } need only be referenced once, saving X(Y-1) references of the 
elements. If the Cartesian product is split into subsets of size N there are 
then X(Y/N) instead of XY references for a difference of XY(N-1)/N. Suppose 
each x is equal in size, s, this would save reading sXY(N-1)/N bytes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (MAPREDUCE-2070) Cartesian product file split

Reply via email to