[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

paul-rogers Wed, 31 Jan 2018 21:37:15 -0800

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1101#discussion_r165263759
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
    @@ -215,6 +206,7 @@ public BatchHolder() {
               MaterializedField outputField = materializedValueFields[i];
               // Create a type-specific ValueVector for this value
               vector = TypeHelper.getNewVector(outputField, allocator);
    +          int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
    --- End diff --
    
    I can think of three reasons to use the sizer:
    
    * Type logic is complex: we have multiple sets of rules depending on the 
data type. Best to encapsulate the logic in a single place. So, either 1) use 
the "sizer", or 2) move the logic from the "sizer" to a common utility.
    * Column size is tricky as it depends on `DataMode`. The size or a 
`Required INT` is 4. The (total memory) size of an `Optional INT` is 5. For a 
`Repeated INT`? You need to know the average array cardinality, which the 
"sizer" provides (by analyzing an input batch.)
    * As discussed, variable-width columns (`VARCHAR`, `VARBINARY` for HBase) 
have no known size. We really have to completely forget about that awful "50" 
estimate. We can only estimate size from input, which is, again, what the 
"sizer" does.
    
    Of course, all the above only works I you actually sample the input.
    
    A current limitation (and good enhancement) is that the Sizer is aware of 
just one batch. The sort (the first user of the "sizer") needed only aggregate 
row size, so it just kept track of the widest row ever seen. If you need 
detailed column information, you may want another layer: one that aggregates 
information across batches. (For arrays and variable-width columns, you can 
take the weighted average or the maximum depending on your needs.)
    
    Remember, if the purpose of this number is to estimate memory use, then you 
have to add a 33% (average) allowance for internal fragmentation. (Each vector 
is, on average, 75% full.)

---

[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

Reply via email to