Cartesian product file split
----------------------------
Key: MAPREDUCE-2070
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2070
Project: Hadoop Map/Reduce
Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Paul Burkhardt
Priority: Minor
Generates a Cartesian product of file pairs from two directory inputs and
enables a RecordReader to optimally read the split in tuple order, eliminating
extraneous read operations.
The new InputFormat generates a split comprised of file combinations as tuples.
The size of the split is configurable. A RecordReader employs the convenience
class, CartesianProductFileSplitReader, to generate file pairs in tuple
ordering. The actual read operations are delegated to the RecordReader which
must implement the CartesianProductTupleReader interface. An implementor of a
RecordReader can perform file manipulations without restriction and also
benefit from the optimization of tuple ordering.
In the Cartesian product of two sets with cardinalities, X and Y, each element
x in {X } need only be referenced once, saving X(Y-1) references of the
elements. If the Cartesian product is split into subsets of size N there are
then X(Y/N) instead of XY references for a difference of XY(N-1)/N. Suppose
each x is equal in size, s, this would save reading sXY(N-1)/N bytes.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.