[
https://issues.apache.org/jira/browse/HIVE-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugene Koifman resolved HIVE-11672.
-----------------------------------
Resolution: Duplicate
> Hive Streaming API handles bucketing incorrectly
> ------------------------------------------------
>
> Key: HIVE-11672
> URL: https://issues.apache.org/jira/browse/HIVE-11672
> Project: Hive
> Issue Type: Bug
> Components: HCatalog, Hive, Transactions
> Affects Versions: 1.2.1
> Reporter: Raj Bains
> Assignee: Roshan Naik
> Priority: Critical
>
> Hive Streaming API allows the clients to get a random bucket and then insert
> data into it. However, this leads to incorrect bucketing as Hive expects data
> to be distributed into buckets based on a hash function applied to bucket
> key. The data is inserted randomly by the clients right now. They have no way
> of
> # Knowing what bucket a row (tuple) belongs to
> # Asking for a specific bucket
> There are optimization such as Sort Merge Join and Bucket Map Join that rely
> on the data being correctly distributed across buckets and these will cause
> incorrect read results if the data is not distributed correctly.
> There are two obvious design choices
> # Hive Streaming API should fix this internally by distributing the data
> correctly
> # Hive Streaming API should expose data distribution scheme to the clients
> and allow them to distribute the data correctly
> The first option will mean every client thread will write to many buckets,
> causing many small files in each bucket and too many connections open. this
> does not seem feasible. The second option pushes more functionality into the
> client of the Hive Streaming API, but can maintain high throughput and write
> good sized ORC files. This option seems preferable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)