[
https://issues.apache.org/jira/browse/HADOOP-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570364#action_12570364
]
Chris Douglas commented on HADOOP-287:
--------------------------------------
The difference in time taken by the respective sort algorithms is not
significant compared to the overhead incurred in the sort phase, i.e. the time
it requires to recreate the BlahSorter/BlahSort objects, recreate and resize
all the accounting arrays, and effect the indexed compares/swaps via
IntWritables. There is an additional copy for the index array during mergesort,
but- again- this is insignificant when considering the cost of the sort/spill.
Note that the time taken for each invocation of sort is proportional to the
number of keys collected for each partition, which itself is a small fraction
of the time spent in that phase; the number of invocations of sort is roughly
proportional to the number of reducers, or more specifically: the distribution
of output keys effected from the partitioner from a given map within the
io.sort.mb limit. Given the current structure of the sort/spill, the particular
sort algorithm is unlikely to have a notable impact in the general case. The
size of the dataset or the proportion of input to output keys should have
little bearing, since the merge code for spilled data is the same, regardless
of the sort algorithm selected.
> Speed up SequenceFile sort with memory reduction
> ------------------------------------------------
>
> Key: HADOOP-287
> URL: https://issues.apache.org/jira/browse/HADOOP-287
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Affects Versions: 0.17.0
> Reporter: Benjamin Reed
> Assignee: Chris Douglas
> Attachments: 287-1.patch, 287-2.patch, 287-2.patch, s.patch,
> zoom-sort.patch, zoom-sort.patch
>
>
> I replaced the merge sort with a quick sort and it yielded approx 30%
> improvement in sort time. It also reduced the memory requirement for sorting
> because the sort is done in place.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.