[ 
https://issues.apache.org/jira/browse/DRILL-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5008:
-------------------------------
    Description: 
ExternalSortBatch provides a spillable sort operator for Drill. The code works 
fine, but can be a hard to follow and understand. Make the following changes to 
improve ease-of-use for developers:

1. Refactor the large methods into bite-sized chunks to aid understanding.
2. Provide additional explanation of the theory and operation of this operator.

The memory limit cases for the spill-needed test seem inconsistent:

For the test for in-memory sort:
{code}
    long currentlyAvailable =  popConfig.getMaxAllocation() - 
oAllocator.getAllocatedMemory();
{code}

For reaching the memory limit:
{code}
oAllocator.getAllocatedMemory() > .95 * oAllocator.getLimit()
{code}

That is, one uses `oAllocator.getLimit` ("the current maximum limit this 
allocator imposes"), the other uses `popConfig.getMaxAllocation` ("The maximum 
memory this operator can allocate".)

3. The spill operation is based on an estimated record width, but the estimate 
is based on only the first record in the first batch. The record width is then 
used to ensure the spilled records fit within a memory limit. However, a single 
unusual record in the first position will throw off the calculation, possibly 
leading to excess memory use.

Better would be to sample a set of records to determine an average. Or, to find 
a way to work within a memory limit without a record width calculation (since 
record width is just an intermediate toward computing memory use.)

  was:
ExternalSortBatch provides a spillable sort operator for Drill. The code works 
fine, but can be a hard to follow and understand. Make the following changes to 
improve ease-of-use for developers:

1. Refactor the large methods into bite-sized chunks to aid understanding.
2. Provide additional explanation of the theory and operation of this operator.

The memory limit cases for the spill-needed test seem inconsistent:

For the test for in-memory sort:
{code}
    long currentlyAvailable =  popConfig.getMaxAllocation() - 
oAllocator.getAllocatedMemory();
{code}

For reaching the memory limit:
{code}
oAllocator.getAllocatedMemory() > .95 * oAllocator.getLimit()
{code}

That is, one uses `oAllocator.getLimit` ("the current maximum limit this 
allocator imposes"), the other uses `popConfig.getMaxAllocation` ("The maximum 
memory this operator can allocate".)


> Refactor, document and simplify ExternalSortBatch
> -------------------------------------------------
>
>                 Key: DRILL-5008
>                 URL: https://issues.apache.org/jira/browse/DRILL-5008
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> ExternalSortBatch provides a spillable sort operator for Drill. The code 
> works fine, but can be a hard to follow and understand. Make the following 
> changes to improve ease-of-use for developers:
> 1. Refactor the large methods into bite-sized chunks to aid understanding.
> 2. Provide additional explanation of the theory and operation of this 
> operator.
> The memory limit cases for the spill-needed test seem inconsistent:
> For the test for in-memory sort:
> {code}
>     long currentlyAvailable =  popConfig.getMaxAllocation() - 
> oAllocator.getAllocatedMemory();
> {code}
> For reaching the memory limit:
> {code}
> oAllocator.getAllocatedMemory() > .95 * oAllocator.getLimit()
> {code}
> That is, one uses `oAllocator.getLimit` ("the current maximum limit this 
> allocator imposes"), the other uses `popConfig.getMaxAllocation` ("The 
> maximum memory this operator can allocate".)
> 3. The spill operation is based on an estimated record width, but the 
> estimate is based on only the first record in the first batch. The record 
> width is then used to ensure the spilled records fit within a memory limit. 
> However, a single unusual record in the first position will throw off the 
> calculation, possibly leading to excess memory use.
> Better would be to sample a set of records to determine an average. Or, to 
> find a way to work within a memory limit without a record width calculation 
> (since record width is just an intermediate toward computing memory use.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to