[ 
https://issues.apache.org/jira/browse/DRILL-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067303#comment-16067303
 ] 

Paul Rogers commented on DRILL-5416:
------------------------------------

A major portion of the changes in DRILL-5601 address this issue by revamping 
the "record batch sizer" to consider actual allocated memory, not vector sizes.

> Vectors read from disk report incorrect memory sizes
> ----------------------------------------------------
>
>                 Key: DRILL-5416
>                 URL: https://issues.apache.org/jira/browse/DRILL-5416
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> The external sort and revised hash agg operators spill to disk using a vector 
> serialization mechanism. This mechanism serializes each vector as a (length, 
> bytes) pair.
> Before spilling, if we check the memory used for a vector (using the new 
> {{RecordBatchSizer}} class), we learn of the actual memory consumed by the 
> vector, including any unused space in the vector.
> If we spill the vector, then reread it, the reported storage size is wrong.
> On reading, the code allocates a buffer, based on the saved length, rounded 
> up to the next power of two. Then, when building the vector, we "slice" the 
> read buffer, setting the memory size to the data size.
> For example, suppose we save 20 1-byte fields. The size on disk is 20. The 
> read buffer is rounded to 32 bytes (the size of the original, pre-spill 
> buffer.) We read the 20 bytes and create a vector. Creating the vector 
> reports the memory size as 20, "hiding" the extra, unused 12 bytes.
> As a result, when computing memory sizes, we receive incorrect numbers. 
> Working with false numbers means that the code cannot safely operate within a 
> memory budget, causing the user to receive an unexpected OOM error.
> As it turns out, the code path that does the slicing is used only for reads 
> from disk. This ticket asks to remove the slicing step: just use the 
> allocated buffer directly so that the after-read vector reports the correct 
> memory usage; same as the before-spill vector.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to