[
https://issues.apache.org/jira/browse/TEZ-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan reassigned TEZ-4207:
-------------------------------------
Fix Version/s: 0.10.1
Assignee: Rajesh Balamohan
Resolution: Fixed
> Provide approximate number of input records to be processed in
> UnorderedKVInput
> -------------------------------------------------------------------------------
>
> Key: TEZ-4207
> URL: https://issues.apache.org/jira/browse/TEZ-4207
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Priority: Major
> Fix For: 0.10.1
>
> Attachments: TEZ-4207.1.patch, TEZ-4207.wip.patch
>
>
> There are cases when broadcasted data is loaded into hashtable in upstream
> applications (e.g Hive). Apps tends to predict the number of entries in the
> hashtable diligently, but there are cases where these estimates can be very
> complicated at compile time.
>
> Tez can help in such cases, by providing "approximate number of input records
> counter", to be processed in UnorderedKVInput. This is to avoid expensive
> rehash when hashtable sizes are not estimated correctly. It would be good to
> start with broadcast first and then to move on to unordered partitioned case
> later.
>
> This would help in predicting the number of entries at runtime & can get
> better estimates for hashtable.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)