Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/1228#discussion_r184244479
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java
---
@@ -277,18 +286,29 @@ public boolean isRepeatedList() {
/**
* This is the average per entry width, used for vector allocation.
*/
- public int getEntryWidth() {
+ private int getEntryWidthForAlloc() {
int width = 0;
if (isVariableWidth) {
- width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH;
+ width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH;
// Subtract out the bits (is-set) vector width
- if (metadata.getDataMode() == DataMode.OPTIONAL) {
+ if (isOptional) {
width -= BIT_VECTOR_WIDTH;
}
+
+ if (isRepeated && getValueCount() == 0) {
+ return (safeDivide(width, STD_REPETITION_FACTOR));
+ }
}
- return (safeDivide(width, cardinality));
+ return (safeDivide(width, getEntryCardinalityForAlloc()));
+ }
+
+ /**
+ * This is the average per entry cardinality, used for vector
allocation.
+ */
+ private float getEntryCardinalityForAlloc() {
+ return getCardinality() == 0 ? (isRepeated ? STD_REPETITION_FACTOR :
1) :getCardinality();
--- End diff --
Makes sense, but why would a batch be empty unless that path hit EOF?
Otherwise, the batch might be due to an empty input file. We'd just skip it and
move to the next batch until we find one with data. Any reason the "get next
batch" code can just loop to be "get next non-empty batch" instead? Otherwise,
we can't really do any effective batch sizing as we have no data to go on...
---