Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165516121 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- Thanks for the info @paul-rogers . Since we want to have the RecordBatchSizer be a one stop shop for size related information I propose to do the following to address @ppadma 's concerns: - Remove changes to estSize in the RecordBatchSizer that I introduced. - Add a field called knownSize to the RecordBatchSizer. This field will contain the known column width for FixedWidthVectors and will be -1 for things like varchars and repeated columns. Then we can add a method getKnownSize() for a column size. This method will return the known size if available and will throw an exception if the column is not FixedWidth. So this method should only be used by the caller when they know they are dealing with FixedWidth columns as is the case with HashAgg for now (things will probably change when strings will be stored in varchars instead of objects). The rest of RecordBatchSizer will remain as is. - Add a constant called FRAGMENTATION_FACTOR (1.33) that people can use to multiply their custom crafted batch sizes by when they are estimating batch sizes from scratch. In HashAgg I would multiply the custom batch size I compute by this factor. Thoughts?
---