[ 
https://issues.apache.org/jira/browse/HIVE-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165491#comment-17165491
 ] 

Rajesh Balamohan commented on HIVE-23936:
-----------------------------------------

E.g in hive, where approximate input records counter can be used: 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128]

> Provide approximate number of input records to be processed in broadcast 
> reader
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-23936
>                 URL: https://issues.apache.org/jira/browse/HIVE-23936
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>
> There are cases when broadcasted data is loaded into hashtable in upstream 
> applications (e.g Hive). Apps tends to predict the number of entries in the 
> hashtable diligently, but there are cases where these estimates can be very 
> complicated at compile time.
>  
> Tez can help in such cases, by providing "approximate number of input records 
> counter", to be processed in UnorderedKVInput. This is to avoid expensive 
> rehash when hashtable sizes are not estimated correctly. It would be good to 
> start with broadcast first and then to move on to unordered partitioned case 
> later.
>  
> This would help in predicting the number of entries at runtime & can get 
> better estimates for hashtable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to