[
https://issues.apache.org/jira/browse/DRILL-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers resolved DRILL-5027.
--------------------------------
Resolution: Fixed
> ExternalSortBatch is inefficient: rewrites data unnecessarily
> -------------------------------------------------------------
>
> Key: DRILL-5027
> URL: https://issues.apache.org/jira/browse/DRILL-5027
> Project: Apache Drill
> Issue Type: Sub-task
> Affects Versions: 1.8.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Minor
>
> The {{ExternalSortBatch}} (ESB) operator sorts data while spilling to disk as
> needed to operate within a memory budget.
> The sort happens in two phases:
> 1. Gather the incoming batches from the upstream operator, sort them, and
> spill to disk as needed.
> 2. Merge the "runs" spilled in step 1.
> In most cases, the second step should run within the memory available for the
> first step (which is why severity is only Minor). Large queries need multiple
> sort "phases" in which previously spilled runs are read back into memory,
> merged, and again spilled. It is here that ESB has an issue. This process
> correctly limit the amount of memory used, but at the cost or rewriting the
> same data over and over.
> Consider current Drill behavior:
> {code}
> a b c d (re-spill)
> abcd e f g h (re-spill)
> abcefgh i j k
> {code}
> That is batches, a, b, c and d are re-spilled to create the combined abcd,
> and so on. The same data is rewritten over and over.
> Note that spilled batches take no (direct) memory in Drill, and require only
> a small on-heap memento. So, maintaining data on disk s "free". So, better
> would be to re-spill only newer data:
> {code}
> a b c d (re-spill)
> abcd | e f g h (re-spill)
> abcd efgh | i j k
> {code}
> Where the bar indicates a moving point at which we've already merged and do
> not need to do so again. If each letter is one unit of disk I/O, the original
> method uses 35 units while the revised method uses 27 units.
> At some point the process may have to repeat by merging the second-generation
> spill files and so on.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)