hudi-bot opened a new issue, #15964:
URL: https://github.com/apache/hudi/issues/15964
This EPIC tracks changes to the Hudi storage format.
h2. Proposals
Format change is anything that changes any bits related to
* *Timeline* : active or archived timeline contents, file names.
* {*}Base Files{*}: file format versions, any changes to any data types,
file footers, file names.
* {*}Log Files{*}: Block structure, content, names.
* {*}Metadata Table{*}: (should we call this index table instead?)
partition names, number of file groups, key/value schema and metadata to MDT
row mappings.
* {*}Table properties{*}: What's written to
[hoodie.properties|http://hoodie.properties/].
* *Marker files* : Can be left to the writer implementation.
h2. Change summary:
The following functionality should be supportable by the new format tech
specs (at a minimum)
Flexibility :
* [Pending] Ability to mix different types of base files within a single
table or even a single file group (e.g images, json, vectors ...)
* [Pending] Easy integration of metadata for JVM and non-jvm clients
(parquet as MT format, HFile native APIs)
Metafields :
* [Resolved] Should _recordkey be uuid special handling?
* Semantics of _hoodie_commit_time , with completion time changes.
Additional Info:
* Support encoding of watermarks/event time fields as first class citizen,
for handling late arriving data.
* [Resolved] Position based skipping of base file
* [Pending] Additional metadata to avoid more RPCs to scan base file/log
blocks.
* [Pending] ML/Column family use-case?
* [Resolved] Support having changeset of columns in each write, other
headers
Log :
* [No change needed] Support writing updates as deletes and inserts,
instead of logging as update to base file.
* [Pending] CDC format is GA.
Table organization:
* [Pending] Support different logical partitions on the same data
* [Pending] RFC-60/Storage of table spread across buckets/root folders
* [Pending] Decouple table location from timeline, metadata. They can all
be in different places
Concurrency/Timeline:
* [Pending] Ability to support general purpose multi-table transactions,
esp between data and metadata tables.
* [Pending] Support lockless/non-blocking transactions, where writers don't
block each other even in face of conflicts.
* [Resolved] Support for long lived instants in timeline, break down
distinction between active/archived
* [Pending] Support checking of uniqueness constraints, even in face of two
concurrent insert transactions.
* [Pending] Support precise time-travel queries
* [Pending] Support time-travel writes.
* [Pending] Support schema history tracking and aid in schema evol impl.
* [Resolved] TrueTime store/support for instant times
* [Pending] No more separate rollback action. make it a new state.
Metadata table :
* Encode filegroup ID and commit time along with file metadata
Table Properties:
* Partitioning information/indexing info
Marker Files:
* Write marker files for logs as well, based on new marker format.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-6242
- Type: Epic
- Fix version(s):
- 1.1.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]