[ 
https://issues.apache.org/jira/browse/DRILL-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654495#comment-15654495
 ] 

Paul Rogers commented on DRILL-5027:
------------------------------------

Proposed solution, which seems reachable by rearranging the existing pieces of 
ESB:

Assume three "generations":

* An in-memory generation of sorted batches received from upstream, not yet 
spilled.
* A "new" spill generation: those files created directly by spilling the 
in-memory generation.
* An "old" spill generation: those files created by re-spilling new generation 
files.

The spill would work as follows:

* For each upstream batch, sort it and add it to the in-memory generation.
* When the in-memory generation reaches the spill threshold, merge the 
in-memory batches and write to a spill file. Add the spill file to the new 
spill generation. At this point, ESB memory is empty.
* If the new spill generation has reached the spill threshold, merge the 
spilled batches and write to another spill file. Delete the old spill files. 
Add the newly created file to the old spill generation. The new spill 
generation is now empty (as is memory.)
* If the old spill generation has reached the spill threshold, transfer it to 
the new generation and spill as above. The old generation now has a single 
file. (The other two generations are empty.)

The spill threshold is defined as:

* Start with the memory budget for the ESB.
* Define a target spill-batch size. (The minimum of 32K rows or some defined 
size in bytes.)
* Define the maximum number of in-memory batches as memory budget / spill-batch 
size.
* Set the spill threshold to some number less than the maximum in-memory batch 
size.

When gathering incoming batches in memory, or reading batches from disk, the 
above ensures that total memory used is less than the budget.

Benefits of this approach:

* Minimizes read/writes of existing spilled data (overcomes the re-spill issue 
above.)
* Ensures that disk files are deleted as soon as possible.
* Ensures that ESB operates within a defined memory budget.
* Handles data of any size; the algorithm above simply continues to combine 
generations as needed. Trades off performance (more disk I/O) for a fixed 
memory budget.
* Limits disk use to no more than twice the amount of spilled data (to account 
for merging the old generation).

> ExternalSortBatch is inefficient, leaks files for large queries
> ---------------------------------------------------------------
>
>                 Key: DRILL-5027
>                 URL: https://issues.apache.org/jira/browse/DRILL-5027
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> The {{ExternalSortBatch}} (ESB) operator sorts data while spilling to disk as 
> needed to operate within a memory budget.
> The sort happens in two phases:
> 1. Gather the incoming batches from the upstream operator, sort them, and 
> spill to disk as needed.
> 2. Merge the "runs" spilled in step 1.
> In most cases, the second step should run within the memory available for the 
> first step (which is why severity is only Minor). Large queries need multiple 
> sort "phases" in which previously spilled runs are read back into memory, 
> merged, and again spilled. It is here that ESB has an issue. This process 
> correctly limit the amount of memory used, but at the cost or rewriting the 
> same data over and over.
> Consider current Drill behavior:
> {code}
> a b c d (re-spill)
> abcd e f g h (re-spill)
> abcefgh i j k
> {code}
> That is batches, a, b, c and d are re-spilled to create the combined abcd, 
> and so on. The same data is rewritten over and over.
> Note that spilled batches take no (direct) memory in Drill, and require only 
> a small on-heap memento. So, maintaining data on disk s "free". So, better 
> would be to re-spill only newer data:
> {code}
> a b c d (re-spill)
> abcd | e f g h (re-spill)
> abcd efgh | i j k
> {code}
> Where the bar indicates a moving point at which we've already merged and do 
> not need to do so again. If each letter is one unit of disk I/O, the original 
> method uses 35 units while the revised method uses 27 units.
> At some point the process may have to repeat by merging the second-generation 
> spill files and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to