Prasanth Jayachandran created HIVE-19205:

             Summary: Hive streaming ingest improvements (v2)
                 Key: HIVE-19205
             Project: Hive
          Issue Type: Improvement
          Components: Streaming
    Affects Versions: 3.0.0, 3.1.0
            Reporter: Prasanth Jayachandran
            Assignee: Prasanth Jayachandran

This is umbrella jira to track hive streaming ingest improvements. At a high 
level following are the improvements
- Support for dynamic partitioning
- API changes (simple streaming connection builder)
- Hide the transaction batches from clients (client can tune the transaction 
batch but doesn't have to know about the transaction batch size)
- Support auto rollover to next transaction batch (clients don't have to worry 
about closing a transaction batch and opening a new one)
- Record writers will all be strict meaning the schema of the record has to 
match table schema. This is to avoid the multiple serialization/deserialization 
for re-ordering columns if there is schema mismatch
- Automatic distribution for non-bucketed tables so that compactor can have 
more parallelism
- Create delta files with all ORC overhead disabled (no compression, no 
dictionary). Compactor will recreate the orc files with compression and 
dictionary encoding.
- Automatic memory management via auto-flushing (will yield smaller stripes for 
delta files but is more scalable and clients don't have to worry about 
distributing the data across writers)
- Support for more writers (Avro specifically. ORC passthrough format?)
- Support to accept input stream instead of record byte[]
- Removing HCatalog dependency (old streaming API will be in the hcatalog 
package for backward compatibility, new streaming API will be in its own hive 

This message was sent by Atlassian JIRA

Reply via email to