Paul Rogers created DRILL-5416:
----------------------------------

             Summary: Vectors read from disk report incorrect memory sizes
                 Key: DRILL-5416
                 URL: https://issues.apache.org/jira/browse/DRILL-5416
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.8.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers
            Priority: Minor
             Fix For: 1.11.0


The external sort and revised hash agg operators spill to disk using a vector 
serialization mechanism. This mechanism serializes each vector as a (length, 
bytes) pair.

Before spilling, if we check the memory used for a vector (using the new 
{{RecordBatchSizer}} class), we learn of the actual memory consumed by the 
vector, including any unused space in the vector.

If we spill the vector, then reread it, the reported storage size is wrong.

On reading, the code allocates a buffer, based on the saved length, rounded up 
to the next power of two. Then, when building the vector, we "slice" the read 
buffer, setting the memory size to the data size.

For example, suppose we save 20 1-byte fields. The size on disk is 20. The read 
buffer is rounded to 32 bytes (the size of the original, pre-spill buffer.) We 
read the 20 bytes and create a vector. Creating the vector reports the memory 
size as 20, "hiding" the extra, unused 12 bytes.

As a result, when computing memory sizes, we receive incorrect numbers. Working 
with false numbers means that the code cannot safely operate within a memory 
budget, causing the user to receive an unexpected OOM error.

As it turns out, the code path that does the slicing is used only for reads 
from disk. This ticket asks to remove the slicing step: just use the allocated 
buffer directly so that the after-read vector reports the correct memory usage; 
same as the before-spill vector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to