[jira] Commented: (HADOOP-287) Speed up SequenceFile sort with memory reduction

Pi Song (JIRA) Tue, 19 Feb 2008 05:17:17 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570206#action_12570206
 ]


Pi Song commented on HADOOP-287:
--------------------------------

This should be measurable only if the output key set of a map task is big 
enough. This may happen when the initial Map generates many  {K2,V2} for one 
input tuple {K1,V1}.

Still I think merge sort in some cases is worth considering (but not for the 
current implementation)

1. For small data sets,  performance characteristic of merge sort and quick 
sort should be almost the same in many ways 
- Neglect the fact that merge sort requires more memory which is small in this 
case
- Merge sort on index array doesn't incur more objects for GC to be collected

2. For medium-sized data sets (I don't indicate how big is medium), quick sort 
should be better (assume good pivoting so worst cases are rare).

3. For very large data sets, a good implementation of merge sort which allows  
max memory setting and really does multi-pass pipelined merging on the capped 
memory buffer should yield better performance plus better control over memory 
usage.

> Speed up SequenceFile sort with memory reduction
> ------------------------------------------------
>
>                 Key: HADOOP-287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-287
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.17.0
>            Reporter: Benjamin Reed
>            Assignee: Chris Douglas
>         Attachments: 287-1.patch, 287-2.patch, 287-2.patch, s.patch, 
> zoom-sort.patch, zoom-sort.patch
>
>
> I replaced the merge sort with a quick sort and it yielded approx 30% 
> improvement in sort time. It also reduced the memory requirement for sorting 
> because the sort is done in place.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-287) Speed up SequenceFile sort with memory reduction

Reply via email to