[ 
https://issues.apache.org/jira/browse/HIVE-24234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209327#comment-17209327
 ] 

Rajesh Balamohan commented on HIVE-24234:
-----------------------------------------

Thanks [~mustafaiman]. 

>> (outputRecords) / (inputRecords * 1.0f) can be larger than 1 when grouping 
>> sets are present. 

No, it is other way around. {{sumBatchSize} already includes the computation 
needed for grouping sets. So in worst possible case, the max ratio would be 
"1.0". Since "1.0 > 1.0" would be false, the config still holds good. (i.e 
setting 1.0 would never move to streaming mode.)

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java#L206
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java#L494


Basic idea is to ensure that, hashing with groupingsets has to be super 
effective (otherwise we end up paying the penalty of JVM mem pressure). 
Otherwise, it needs to bail out quickly and move to streaming mode. 

> Improve checkHashModeEfficiency in VectorGroupByOperator
> --------------------------------------------------------
>
>                 Key: HIVE-24234
>                 URL: https://issues.apache.org/jira/browse/HIVE-24234
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-24234.wip.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, {{VectorGroupByOperator::checkHashModeEfficiency}} compares the 
> number of entries with the number input records that have been processed. For 
> grouping sets, it accounts for grouping set length as well.
> Issue is that, the condition becomes invalid after processing large number of 
> input records. This prevents the system from switching over to streaming 
> mode. 
> e.g Assume 500,000 input records processed, with 9 grouping sets, with 
> 100,000 entries in hashtable. Hashtable would never cross 4,500,0000 entries 
> as the max size itself is 1M by default. 
> It would be good to compare the input records (adjusted for grouping sets) 
> with number of output records (along with size of hashtable size) to 
> determine hashing or streaming mode.
> E.g Q67.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to