[
https://issues.apache.org/jira/browse/HIVE-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165491#comment-17165491
]
Rajesh Balamohan commented on HIVE-23936:
-----------------------------------------
E.g in hive, where approximate input records counter can be used:
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128]
> Provide approximate number of input records to be processed in broadcast
> reader
> -------------------------------------------------------------------------------
>
> Key: HIVE-23936
> URL: https://issues.apache.org/jira/browse/HIVE-23936
> Project: Hive
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Major
>
> There are cases when broadcasted data is loaded into hashtable in upstream
> applications (e.g Hive). Apps tends to predict the number of entries in the
> hashtable diligently, but there are cases where these estimates can be very
> complicated at compile time.
>
> Tez can help in such cases, by providing "approximate number of input records
> counter", to be processed in UnorderedKVInput. This is to avoid expensive
> rehash when hashtable sizes are not estimated correctly. It would be good to
> start with broadcast first and then to move on to unordered partitioned case
> later.
>
> This would help in predicting the number of entries at runtime & can get
> better estimates for hashtable.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)