[ 
https://issues.apache.org/jira/browse/HADOOP-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570364#action_12570364
 ] 

Chris Douglas commented on HADOOP-287:
--------------------------------------

The difference in time taken by the respective sort algorithms is not 
significant compared to the overhead incurred in the sort phase, i.e. the time 
it requires to recreate the BlahSorter/BlahSort objects, recreate and resize 
all the accounting arrays, and effect the indexed compares/swaps via 
IntWritables. There is an additional copy for the index array during mergesort, 
but- again- this is insignificant when considering the cost of the sort/spill. 
Note that the time taken for each invocation of sort is proportional to the 
number of keys collected for each partition, which itself is a small fraction 
of the time spent in that phase; the number of invocations of sort is roughly 
proportional to the number of reducers, or more specifically: the distribution 
of output keys effected from the partitioner from a given map within the 
io.sort.mb limit. Given the current structure of the sort/spill, the particular 
sort algorithm is unlikely to have a notable impact in the general case. The 
size of the dataset or the proportion of input to output keys should have 
little bearing, since the merge code for spilled data is the same, regardless 
of the sort algorithm selected.

> Speed up SequenceFile sort with memory reduction
> ------------------------------------------------
>
>                 Key: HADOOP-287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-287
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.17.0
>            Reporter: Benjamin Reed
>            Assignee: Chris Douglas
>         Attachments: 287-1.patch, 287-2.patch, 287-2.patch, s.patch, 
> zoom-sort.patch, zoom-sort.patch
>
>
> I replaced the merge sort with a quick sort and it yielded approx 30% 
> improvement in sort time. It also reduced the memory requirement for sorting 
> because the sort is done in place.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to