[ 
https://issues.apache.org/jira/browse/TEZ-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176856#comment-17176856
 ] 

Rajesh Balamohan commented on TEZ-4207:
---------------------------------------

Thanks for the review [~ashutoshc]. Committed to master. 

> Provide approximate number of input records to be processed in 
> UnorderedKVInput
> -------------------------------------------------------------------------------
>
>                 Key: TEZ-4207
>                 URL: https://issues.apache.org/jira/browse/TEZ-4207
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>         Attachments: TEZ-4207.1.patch, TEZ-4207.wip.patch
>
>
> There are cases when broadcasted data is loaded into hashtable in upstream 
> applications (e.g Hive). Apps tends to predict the number of entries in the 
> hashtable diligently, but there are cases where these estimates can be very 
> complicated at compile time.
>  
> Tez can help in such cases, by providing "approximate number of input records 
> counter", to be processed in UnorderedKVInput. This is to avoid expensive 
> rehash when hashtable sizes are not estimated correctly. It would be good to 
> start with broadcast first and then to move on to unordered partitioned case 
> later.
>  
> This would help in predicting the number of entries at runtime & can get 
> better estimates for hashtable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to