Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/1175#discussion_r175619764
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java
---
@@ -321,10 +321,8 @@ public ColumnSize(ValueVector v, String prefix) {
// Calculate pure data size.
if (isVariableWidth) {
- UInt4Vector offsetVector = ((RepeatedValueVector)
v).getOffsetVector();
- int innerValueCount =
offsetVector.getAccessor().get(valueCount);
VariableWidthVector dataVector = ((VariableWidthVector)
((RepeatedValueVector) v).getDataVector());
- totalDataSize =
dataVector.getOffsetVector().getAccessor().get(innerValueCount);
+ totalDataSize = dataVector.getCurrentSizeInBytes();
--- End diff --
Good improvement. The original code exposes far too much of the
implementation.
After all these changes, does the "dataSize" include both the offset vector
and bytes? It should, else calls will be wrong. There are supposed to be three
sizes:
* Payload size: actual data bytes.
* Data size: data + offsets + bits
* Overall size: full length of all vectors.
Payload size is what the user sees. Data size is how we calculate row width
(since the rows must contain the overhead bytes). Vector length, here, only
helps compute density, but is generated elsewhere. The point is, keep all three
in mind, but keep the code separate. Otherwise, it is *very* easy to get
confused and have the calculations blow up...
---