[jira] [Updated] (DRILL-5027) ExternalSortBatch is inefficient, leaks files for large queries

Paul Rogers (JIRA) Wed, 09 Nov 2016 21:46:09 -0800

     [ 
https://issues.apache.org/jira/browse/DRILL-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-5027:
-------------------------------
    Description: 
The {{ExternalSortBatch}} (ESB) operator sorts data while spilling to disk as 
needed to operate within a memory budget.

The sort happens in two phases:

1. Gather the incoming batches from the upstream operator, sort them, and spill 
to disk as needed.
2. Merge the "runs" spilled in step 1.

In most cases, the second step should run within the memory available for the 
first step (which is why severity is only Minor). Large queries need multiple 
sort "phases" in which previously spilled runs are read back into memory, 
merged, and again spilled. It is here that ESB has an issue. This process 
correctly limit the amount of memory used, but at the cost or rewriting the 
same data over and over.

Consider current Drill behavior:

{code}
a b c d (re-spill)
abcd e f g h (re-spill)
abcefgh i j k
{code}

That is batches, a, b, c and d are re-spilled to create the combined abcd, and 
so on. The same data is rewritten over and over.

Note that spilled batches take no (direct) memory in Drill, and require only a 
small on-heap memento. So, maintaining data on disk s "free". So, better would 
be to re-spill only newer data:

{code}
a b c d (re-spill)
abcd | e f g h (re-spill)
abcd efgh | i j k
{code}

Where the bar indicates a moving point at which we've already merged and do not 
need to do so again. If each letter is one unit of disk I/O, the original 
method uses 35 units while the revised method uses 27 units.

At some point the process may have to repeat by merging the second-generation 
spill files and so on.


  was:
The {{ExternalSortBatch}} (ESB) operator sorts data while spilling to disk as 
needed to operate within a memory budget.

The sort happens in two phases:

1. Gather the incoming batches from the upstream operator, sort them, and spill 
to disk as needed.
2. Merge the "runs" spilled in step 1.

In most cases, the second step should run within the memory available for the 
first step (which is why severity is only Minor). However, if the query is 
exceptionally large, then the second step can use excessive memory.

Here is why.

* In step 1, we create a series of n spill files. Each file contains batches of 
some maximum size, say 20 MB.
* In step 2, we must simultaneously open all the spilled files and read the 
first batch into memory in preparation for merging.

Suppose that the query has 1 TB of data. Suppose that 10 GB of memory is 
available. The result will be 1 TB / 1 GB = 1000 spill files. Suppose each 
batch in each file is 20 MB. A single-pass merge will need 20 MB * 1000 = 20 GB 
to operate. But, we only have 10 GB.

The typical solution is to perform the merge in multiple phases, with each 
phase reading and respilling only enough runs that will fit in memory. In the 
above case, the first phase would make two iterations, merging 500 GB in each 
using 10 GB of memory. The second phase would make a single iteration to merge 
the two first phase files.

The result is much disk I/O, and a requirement for sufficient disk spaces to 
store two sets of spill files (one from phase i-1, another for phase i). But, 
the query will complete within the memory budgeted.


> ExternalSortBatch is inefficient, leaks files for large queries
> ---------------------------------------------------------------
>
>                 Key: DRILL-5027
>                 URL: https://issues.apache.org/jira/browse/DRILL-5027
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The {{ExternalSortBatch}} (ESB) operator sorts data while spilling to disk as 
> needed to operate within a memory budget.
> The sort happens in two phases:
> 1. Gather the incoming batches from the upstream operator, sort them, and 
> spill to disk as needed.
> 2. Merge the "runs" spilled in step 1.
> In most cases, the second step should run within the memory available for the 
> first step (which is why severity is only Minor). Large queries need multiple 
> sort "phases" in which previously spilled runs are read back into memory, 
> merged, and again spilled. It is here that ESB has an issue. This process 
> correctly limit the amount of memory used, but at the cost or rewriting the 
> same data over and over.
> Consider current Drill behavior:
> {code}
> a b c d (re-spill)
> abcd e f g h (re-spill)
> abcefgh i j k
> {code}
> That is batches, a, b, c and d are re-spilled to create the combined abcd, 
> and so on. The same data is rewritten over and over.
> Note that spilled batches take no (direct) memory in Drill, and require only 
> a small on-heap memento. So, maintaining data on disk s "free". So, better 
> would be to re-spill only newer data:
> {code}
> a b c d (re-spill)
> abcd | e f g h (re-spill)
> abcd efgh | i j k
> {code}
> Where the bar indicates a moving point at which we've already merged and do 
> not need to do so again. If each letter is one unit of disk I/O, the original 
> method uses 35 units while the revised method uses 27 units.
> At some point the process may have to repeat by merging the second-generation 
> spill files and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-5027) ExternalSortBatch is inefficient, leaks files for large queries

Reply via email to