[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Burkhardt updated MAPREDUCE-2070:
--------------------------------------

    Attachment: MAPREDUCE-2070

Patched against the trunk.

> Cartesian product file split
> ----------------------------
>
>                 Key: MAPREDUCE-2070
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2070
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.22.0
>            Reporter: Paul Burkhardt
>            Priority: Minor
>         Attachments: MAPREDUCE-2070
>
>
> Generates a Cartesian product of file pairs from two directory inputs and 
> enables a RecordReader to optimally read the split in tuple order, 
> eliminating extraneous read operations.
> The new InputFormat generates a split comprised of file combinations as 
> tuples. The size of the split is configurable. A RecordReader employs the 
> convenience class, CartesianProductFileSplitReader, to generate file pairs in 
> tuple ordering. The actual read operations are delegated to the RecordReader 
> which must implement the CartesianProductTupleReader interface. An 
> implementor of a RecordReader can perform file manipulations without 
> restriction and also benefit from the optimization of tuple ordering.
> In the Cartesian product of two sets with cardinalities, X and Y, each 
> element x in {X } need only be referenced once, saving X(Y-1) references of 
> the elements. If the Cartesian product is split into subsets of size N there 
> are then X(Y/N) instead of XY references for a difference of XY(N-1)/N. 
> Suppose each x is equal in size, s, this would save reading sXY(N-1)/N bytes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to