Raj Bains created HIVE-11672:
--------------------------------

             Summary: Hive Streaming API handles bucketing incorrectly
                 Key: HIVE-11672
                 URL: https://issues.apache.org/jira/browse/HIVE-11672
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.2.1
            Reporter: Raj Bains
            Assignee: Roshan Naik
            Priority: Critical
             Fix For: 1.2.2


Hive Streaming API allows the clients to get a random bucket and then insert 
data into it. However, this leads to incorrect bucketing as Hive expects data 
to be distributed into buckets based on a hash function applied to bucket key. 
The data is inserted randomly by the clients right now. They have no way of
# Knowing what bucket a row (tuple) belongs to
# Asking for a specific bucket

There are optimization such as Sort Merge Join and Bucket Map Join that rely on 
the data being correctly distributed across buckets and these will cause 
incorrect read results if the data is not distributed correctly.

There are two obvious design choices
# Hive Streaming API should fix this internally by distributing the data 
correctly
# Hive Streaming API should expose data distribution scheme to the clients and 
allow them to distribute the data correctly

The first option will mean every client thread will write to many buckets, 
causing many small files in each bucket and too many connections open. this 
does not seem feasible. The second option pushes more functionality into the 
client of the Hive Streaming API, but can maintain high throughput and write 
good sized ORC files. This option seems preferable.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to