[jira] [Comment Edited] (DRILL-5416) Vectors read from disk report incorrect memory sizes

Paul Rogers (JIRA) Wed, 05 Apr 2017 16:18:03 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957987#comment-15957987
 ]


Paul Rogers edited comment on DRILL-5416 at 4/5/17 11:17 PM:
-------------------------------------------------------------

The original design for serialization is that each vector serializes to a 
buffer. This is simple for single-buffer vectors (required int, say). For 
composite vectors (nullable int, Varchar), the serialization process results in 
all buffers being combined into a single write buffer, and the corresponding 
read buffer being sliced into the individual composite vectors.

For a Varchar:
{code}
Data: [FredBarneyWilma_]
Offsets: [01041015]
Output buffer:  [01041015FredBarneyWilma_]
Input buffer:   [01041015FredBarneyWilma_]
New Offsets:    [^^^^^^^^]
New Data                [^^^^^^^^^^^^^^^]
{code}

Notice that, in the original, the empty space "denoted with _" is allocated per 
vector. After serialization, free space is in a buffer shared by two vectors 
and is not "owned" by (or visible to) either.


was (Author: paul-rogers):
The original design for serialization is that each vector serializes to a 
buffer. This is simple for single-buffer vectors (required int, say). For 
composite vectors (nullable int, Varchar), the serialization process results in 
all buffers being combined into a single write buffer, and the corresponding 
read buffer being sliced into the individual composite vectors.

For a Varchar:
{code}
Data: [FredBarneyWilma_]
Offsets: [01041015]
Output buffer:  [01041015FredBarneyWilma_]
Input buffer:   [01041015FredBarneyWilma_]
New Offsets:    [^^^^^^^^]
New Data                [^^^^^^^^^^^^^^^]
{code}


> Vectors read from disk report incorrect memory sizes
> ----------------------------------------------------
>
>                 Key: DRILL-5416
>                 URL: https://issues.apache.org/jira/browse/DRILL-5416
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>             Fix For: 1.11.0
>
>
> The external sort and revised hash agg operators spill to disk using a vector 
> serialization mechanism. This mechanism serializes each vector as a (length, 
> bytes) pair.
> Before spilling, if we check the memory used for a vector (using the new 
> {{RecordBatchSizer}} class), we learn of the actual memory consumed by the 
> vector, including any unused space in the vector.
> If we spill the vector, then reread it, the reported storage size is wrong.
> On reading, the code allocates a buffer, based on the saved length, rounded 
> up to the next power of two. Then, when building the vector, we "slice" the 
> read buffer, setting the memory size to the data size.
> For example, suppose we save 20 1-byte fields. The size on disk is 20. The 
> read buffer is rounded to 32 bytes (the size of the original, pre-spill 
> buffer.) We read the 20 bytes and create a vector. Creating the vector 
> reports the memory size as 20, "hiding" the extra, unused 12 bytes.
> As a result, when computing memory sizes, we receive incorrect numbers. 
> Working with false numbers means that the code cannot safely operate within a 
> memory budget, causing the user to receive an unexpected OOM error.
> As it turns out, the code path that does the slicing is used only for reads 
> from disk. This ticket asks to remove the slicing step: just use the 
> allocated buffer directly so that the after-read vector reports the correct 
> memory usage; same as the before-spill vector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (DRILL-5416) Vectors read from disk report incorrect memory sizes

Reply via email to