[
https://issues.apache.org/jira/browse/MAPREDUCE-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Burkhardt updated MAPREDUCE-2070:
--------------------------------------
Attachment: MAPREDUCE-2070
Patched against the trunk.
> Cartesian product file split
> ----------------------------
>
> Key: MAPREDUCE-2070
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2070
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Affects Versions: 0.22.0
> Reporter: Paul Burkhardt
> Priority: Minor
> Attachments: MAPREDUCE-2070
>
>
> Generates a Cartesian product of file pairs from two directory inputs and
> enables a RecordReader to optimally read the split in tuple order,
> eliminating extraneous read operations.
> The new InputFormat generates a split comprised of file combinations as
> tuples. The size of the split is configurable. A RecordReader employs the
> convenience class, CartesianProductFileSplitReader, to generate file pairs in
> tuple ordering. The actual read operations are delegated to the RecordReader
> which must implement the CartesianProductTupleReader interface. An
> implementor of a RecordReader can perform file manipulations without
> restriction and also benefit from the optimization of tuple ordering.
> In the Cartesian product of two sets with cardinalities, X and Y, each
> element x in {X } need only be referenced once, saving X(Y-1) references of
> the elements. If the Cartesian product is split into subsets of size N there
> are then X(Y/N) instead of XY references for a difference of XY(N-1)/N.
> Suppose each x is equal in size, s, this would save reading sXY(N-1)/N bytes.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.