[jira] [Commented] (DRILL-5601) Rollup of External Sort memory management fixes

ASF GitHub Bot (JIRA) Wed, 19 Jul 2017 11:00:46 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093514#comment-16093514
 ]


ASF GitHub Bot commented on DRILL-5601:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/860#discussion_r128133649
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortMemoryManager.java
 ---
    @@ -19,7 +19,125 @@
     
     import com.google.common.annotations.VisibleForTesting;
     
    +/**
    + * Computes the memory needs for input batches, spill batches and merge
    + * batches. The key challenges that this code tries to overcome are:
    + * <ul>
    + * <li>Drill is not designed for the small memory allocations,
    + * but the planner may provide such allocations because the memory per
    + * query is divided among slices (minor fragments) and among buffering
    + * operators, leaving very little per operator.</li>
    + * <li>Drill does not provide the detailed memory information needed to
    + * carefully manage memory in tight constraints.</li>
    + * <li>But, Drill has a death penalty for going over the memory limit.</li>
    + * </ul>
    + * As a result, this class is a bit of a hack: it attempt to consider a
    + * number of ill-defined factors in order to divide up memory use in a
    + * way that prevents OOM errors.
    + * <p>
    + * First, it is necessary to differentiate two concepts:
    + * <ul>
    + * <li>The <i>data size</i> of a batch: the amount of memory needed to hold
    + * the data itself. The data size is constant for any given batch.</li>
    + * <li>The <i>buffer size</i> of the buffers that hold the data. The buffer
    + * size varies wildly depending on how the batch was produced.</li>
    + * </ul>
    + * The three kinds of buffer layouts seen to date include:
    + * <ul>
    + * <li>One buffer per vector component (data, offsets, null flags, etc.)
    + * &ndash; create by readers, project and other operators.</li>
    + * <li>One buffer for the entire batch, with each vector component using
    + * a slice of the overall buffer. &ndash; case for batches deserialized 
from
    + * exchanges.</li>
    + * <li>One buffer for each top-level vector, with component vectors
    + * using slices of the overall vector buffer &ndash; the result of reading
    + * spilled batches from disk.</li>
    + * </ul>
    + * In each case, buffer sizes are power-of-two rounded from the data size.
    + * But since the data is grouped differently in each case, the resulting 
buffer
    + * sizes vary considerably.
    + * <p>
    + * As a result, we can never be sure of the amount of memory needed for a
    + * batch. So, we have to estimate based on a number of factors:
    + * <ul>
    + * <li>Uses the {@link RecordBatchSizer} to estimate the data size and
    + * buffer size of each incoming batch.</li>
    + * <li>Estimates the internal fragmentation due to power-of-two 
rounding.</li>
    + * <li>Configured preferences for spill and output batches.</li>
    + * </ul>
    + * The code handles "normal" and "low" memory conditions.
    + * <ul>
    + * <li>In normal memory, we simply work out the number of preferred-size
    + * batches fit in memory (based on the predicted buffer size.)</li>
    + * <li>In low memory, we divide up the available memory to produce the
    + * spill and merge batch sizes. The sizes will be less than the configured
    + * preference.</li>
    + * </ul>
    + */
    +
     public class SortMemoryManager {
    +  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExternalSortBatch.class);
    +
    +  /**
    +   * Estimate for typical internal fragmentation in a buffer due to 
power-of-two
    +   * rounding on vectors.
    +   * <p>
    +   * <p>
    +   * <pre>[____|__$__]</pre>
    +   * In the above, the brackets represent the whole vector. The
    +   * first half is always full. When the first half filled, the second
    +   * half was allocated. On average, the second half will be half full.
    +   * This means that, on average, 1/4 of the allocated space is
    +   * unused (the definition of internal fragmentation.)
    +   */
    +
    +  public static final double INTERNAL_FRAGMENTATION_ESTIMATE = 1.0/4.0;
    +
    +  /**
    +   * Given a buffer, this is the assumed amount of space
    +   * available for data. (Adding more will double the buffer
    +   * size half the time.)
    +   */
    +
    +  public static final double PAYLOAD_MULTIPLIER = 1 - 
INTERNAL_FRAGMENTATION_ESTIMATE;
    +
    +  /**
    +   * Given a data size, this is the multiplier to create the buffer
    +   * size estimate. (Note: since we work with aggregate batches, we
    +   * cannot simply round up to the next power of two: rounding is done
    +   * on a vector-by-vector basis. Here we need to estimate the aggregate
    +   * effect of rounding.
    +   */
    +
    +  public static final double BUFFER_MULTIPLIER = 1/PAYLOAD_MULTIPLIER;
    +  public static final double WORST_CASE_MULTIPLIER = 2.0;
    +
    +  public static final int MIN_ROWS_PER_SORT_BATCH = 100;
    +  public static final double LOW_MEMORY_MERGE_BATCH_RATIO = 0.25;
    +
    +  public static class BatchSizeEstimate {
    +    int dataSize;
    +    int expectedBufferSize;
    +    int maxBufferSize;
    +
    +    public void setFromData(int dataSize) {
    +      this.dataSize = dataSize;
    +      expectedBufferSize = multiply(dataSize, BUFFER_MULTIPLIER);
    +      maxBufferSize = multiply(dataSize, WORST_CASE_MULTIPLIER);
    +    }
    +
    +    public void setFromBuffer(int bufferSize) {
    +      expectedBufferSize = bufferSize;
    +      dataSize = multiply(bufferSize, PAYLOAD_MULTIPLIER);
    +      maxBufferSize = multiply(dataSize, WORST_CASE_MULTIPLIER);
    +    }
    +
    +    public void setFromWorstCaseBuffer(int bufferSize) {
    +      maxBufferSize = bufferSize;
    +      dataSize = multiply(maxBufferSize, 1 / WORST_CASE_MULTIPLIER);
    +      expectedBufferSize = multiply(dataSize, BUFFER_MULTIPLIER);
    --- End diff --
    
    Right. To make sure the numbers are right this time...
    
    * buffer size given is already the 2x the data size.
    * The data size is half that.
    * The expected size is 3/2 of the data size.
    
    For example, assume we have the following buffer, 75% filled:
    
    ```
    [xxxx] <-- actual
    [xxxx] <-- best case buffer
    [xxxx----] <-- worst case buffer
    [xxxx--] <-- average (expected) case buffer
    ```
    
    So, yes, the numbers are off. BUFFER_MULTIPLIER should be 3/2. That is, for 
2x data, will, on average need 3x of memory.


> Rollup of External Sort memory management fixes
> -----------------------------------------------
>
>                 Key: DRILL-5601
>                 URL: https://issues.apache.org/jira/browse/DRILL-5601
>             Project: Apache Drill
>          Issue Type: Task
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.12.0
>
>
> Rollup of a set of specific JIRA entries that all relate to the very 
> difficult problem of managing memory within Drill in order for the external 
> sort to stay within a memory budget. In general, the fixes relate to better 
> estimating memory used by the three ways that Drill allocates vector memory 
> (see DRILL-5522) and to predicting the size of vectors that the sort will 
> create, to avoid repeated realloc-copy cycles (see DRILL-5594).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5601) Rollup of External Sort memory management fixes

Reply via email to