[ 
https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574246#action_12574246
 ] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

as i mentioned in a different jira - we don't use the key at all - the sort 
field is embedded in the value itself. an interface like the partitioner 
interface - that takes both the key and the value and returns an object  would 
do the job for us. (the reader can invoke the equals method to determine the 
boundaries).

yeah - we shouldn't change the default semantics of the current reader - either 
have an option that alters the semantics or a new reader.

what about text files? we don't use them much directly (always embed in 
sequencefiles) - but i imagine other folks do and the same considerations can 
apply ..

> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be 
> useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). 
> future computations on such files can often benefit from this sort order. if 
> the job requires grouping by the sort-key - then it should be possible to do 
> reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 
> map file per task) since splits can span the sort-key. however aligning the 
> data read by the map task  to sort key boundaries is straightforward - and 
> this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not 
> necessarily the key field in a Sequencefile) through a generic interface - 
> but otherwise - the sequencefile and text file readers can use the extracted 
> sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to