[
https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574223#action_12574223
]
Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------
no - didn't override getSplit. i have an inputformat that opens sequencefile
readers for two splits. one is the split handed down from the map task. the
other is a split that contains the rest of the file (positioned after the map
split).
we skip the first set of records in the map split (unless starting at offset
0). and we process the first set of records in the next split. (ditto as how
sequencefiles work with sync markers - using sort key boundaries as sync
positions instead)
> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
> Key: HADOOP-2921
> URL: https://issues.apache.org/jira/browse/HADOOP-2921
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.16.0
> Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be
> useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key).
> future computations on such files can often benefit from this sort order. if
> the job requires grouping by the sort-key - then it should be possible to do
> reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1
> map file per task) since splits can span the sort-key. however aligning the
> data read by the map task to sort key boundaries is straightforward - and
> this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not
> necessarily the key field in a Sequencefile) through a generic interface -
> but otherwise - the sequencefile and text file readers can use the extracted
> sort key to align map task data with key boundaries.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.