Prasanth Jayachandran created HIVE-19205:
--------------------------------------------

             Summary: Hive streaming ingest improvements (v2)
                 Key: HIVE-19205
                 URL: https://issues.apache.org/jira/browse/HIVE-19205
             Project: Hive
          Issue Type: Improvement
          Components: Streaming
    Affects Versions: 3.0.0, 3.1.0
            Reporter: Prasanth Jayachandran
            Assignee: Prasanth Jayachandran


This is umbrella jira to track hive streaming ingest improvements. At a high 
level following are the improvements
- Support for dynamic partitioning
- API changes (simple streaming connection builder)
- Hide the transaction batches from clients (client can tune the transaction 
batch but doesn't have to know about the transaction batch size)
- Support auto rollover to next transaction batch (clients don't have to worry 
about closing a transaction batch and opening a new one)
- Record writers will all be strict meaning the schema of the record has to 
match table schema. This is to avoid the multiple serialization/deserialization 
for re-ordering columns if there is schema mismatch
- Automatic distribution for non-bucketed tables so that compactor can have 
more parallelism
- Create delta files with all ORC overhead disabled (no compression, no 
dictionary). Compactor will recreate the orc files with compression and 
dictionary encoding.
- Automatic memory management via auto-flushing (will yield smaller stripes for 
delta files but is more scalable and clients don't have to worry about 
distributing the data across writers)
- Support for more writers (Avro specifically. ORC passthrough format?)
- Support to accept input stream instead of record byte[]
- Removing HCatalog dependency (old streaming API will be in the hcatalog 
package for backward compatibility, new streaming API will be in its own hive 
module)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to