[
https://issues.apache.org/jira/browse/HIVE-19205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438141#comment-16438141
]
Prasanth Jayachandran commented on HIVE-19205:
----------------------------------------------
bq. Will batch size still be configurable somehow?
Yes. It will be configurable for advanced use cases. But it will default to 10.
bq. One can set compaction specific properties on a table
The current expectation is that the table already exists. So I would guess any
compactor specific properties would directly go to the table via alter command
or during creation. The streaming connection API does not currently have a way
to propagate custom properties to the table.
> Hive streaming ingest improvements (v2)
> ---------------------------------------
>
> Key: HIVE-19205
> URL: https://issues.apache.org/jira/browse/HIVE-19205
> Project: Hive
> Issue Type: Improvement
> Components: Streaming
> Affects Versions: 3.0.0, 3.1.0
> Reporter: Prasanth Jayachandran
> Assignee: Prasanth Jayachandran
> Priority: Major
>
> This is umbrella jira to track hive streaming ingest improvements. At a high
> level following are the improvements
> - Support for dynamic partitioning
> - API changes (simple streaming connection builder)
> - Hide the transaction batches from clients (client can tune the transaction
> batch but doesn't have to know about the transaction batch size)
> - Support auto rollover to next transaction batch (clients don't have to
> worry about closing a transaction batch and opening a new one)
> - Record writers will all be strict meaning the schema of the record has to
> match table schema. This is to avoid the multiple
> serialization/deserialization for re-ordering columns if there is schema
> mismatch
> - Automatic distribution for non-bucketed tables so that compactor can have
> more parallelism
> - Create delta files with all ORC overhead disabled (no index, no
> compression, no dictionary). Compactor will recreate the orc files with
> index, compression and dictionary encoding.
> - Automatic memory management via auto-flushing (will yield smaller stripes
> for delta files but is more scalable and clients don't have to worry about
> distributing the data across writers)
> - Support for more writers (Avro specifically. ORC passthrough format?)
> - Support to accept input stream instead of record byte[]
> - Removing HCatalog dependency (old streaming API will be in the hcatalog
> package for backward compatibility, new streaming API will be in its own hive
> module)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)