Boaz Ben-Zvi created DRILL-5446:
-----------------------------------
Summary: Offset Vector in VariableLengthVectors may waste up to
256KB per value vector
Key: DRILL-5446
URL: https://issues.apache.org/jira/browse/DRILL-5446
Project: Apache Drill
Issue Type: Bug
Components: Execution - Relational Operators
Affects Versions: 1.10.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi
Fix For: 1.11.0
In exec/vector/src/main/codegen/templates/VariableLengthVectors.java -- the
implementation uses an "offset vector" to note the BEGINNING of each variable
length element. In order to find the length (i.e. the END of the element), need
to look at the FOLLOWING element.
This requires the "offset vector" to have ONE MORE entry than the total
number of elements -- in order to find the END of the LAST element.
Some places in the code (e.g., the hash table) use the maximum number of
elements - 64K ( = 65536 ). And each entry in the "offset vector" is 4-byte
UInt4, hence looks like needing 256KB.
However because of that "ONE MORE", the code in this case allocates for
65537, thus (rounding to next power of 2) allocating 512KB, where half is not
used !!!!
(And this is per each varchar value vector, per each batch; e.g., in the qa
test Functional/aggregates/tpcds_variants/text/aggregate25.q where there are 10
key columns, each hash-table batch is wasting 2.5MB !).
Possible fix: change the logic in VariableLengthVectors.java to keep the END
point of each variable length element - the first element's beginning is always
ZERO, so it need not be kept.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)