align map splits on sorted files with key boundaries
----------------------------------------------------
Key: HADOOP-2921
URL: https://issues.apache.org/jira/browse/HADOOP-2921
Project: Hadoop Core
Issue Type: New Feature
Affects Versions: 0.16.0
Reporter: Joydeep Sen Sarma
(this is something that we have implemented in the application layer - may be
useful to have in hadoop itself).
long term log storage systems often keep data sorted (by some sort-key). future
computations on such files can often benefit from this sort order. if the job
requires grouping by the sort-key - then it should be possible to do reduction
in the map stage itself.
this is not natively supported by hadoop (except in the degenerate case of 1
map file per task) since splits can span the sort-key. however aligning the
data read by the map task to sort key boundaries is straightforward - and this
would be a useful capability to have in hadoop.
the definition of the sort key should be left up to the application (it's not
necessarily the key field in a Sequencefile) through a generic interface - but
otherwise - the sequencefile and text file readers can use the extracted sort
key to align map task data with key boundaries.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.