[
https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574693#action_12574693
]
Doug Cutting commented on HADOOP-2921:
--------------------------------------
We could implement this by adding an abstract method to
SequenceFileRecordReader that's called when it is first opened and that could
scan forward to a key boundary, right? Then one could define a subclass of
SequenceFileInputFormat that uses this RecordReader. Similarly for
TextInputFormat.
> the definition of the sort key should be left up to the application (it's not
> necessarily the key field in a Sequencefile)
[ ... ]
> we don't use the key at all - the sort field is embedded in the value itself.
Side note: wouldn't it make more sense to not use the value and to just sort on
part of the key? Then you could pass a Comparator to SequenceFile and the
definition of the sort key is the same. We already have a generic means for
specifying sort keys. I don't see the need for a new one. Why do you prefer
using values to keys?
> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
> Key: HADOOP-2921
> URL: https://issues.apache.org/jira/browse/HADOOP-2921
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.16.0
> Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be
> useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key).
> future computations on such files can often benefit from this sort order. if
> the job requires grouping by the sort-key - then it should be possible to do
> reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1
> map file per task) since splits can span the sort-key. however aligning the
> data read by the map task to sort key boundaries is straightforward - and
> this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not
> necessarily the key field in a Sequencefile) through a generic interface -
> but otherwise - the sequencefile and text file readers can use the extracted
> sort key to align map task data with key boundaries.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.