[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-6242:
---------------------------------
Epic Colour: ghx-label-2 (was: ghx-label-8)
> Format changes for Hudi 1.X release line
> ----------------------------------------
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
> Issue Type: Epic
> Components: core
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> h2. Proposals
> Format change is anything that changes any bits related to
> * *Timeline* : active or archived timeline contents, file names.
> * {*}Base Files{*}: file format versions, any changes to any data types,
> file footers, file names.
> * {*}Log Files{*}: Block structure, content, names.
> * {*}Metadata Table{*}: (should we call this index table instead?) partition
> names, number of file groups, key/value schema and metadata to MDT row
> mappings.
> * {*}Table properties{*}: What's written to
> [hoodie.properties|http://hoodie.properties/].
> * *Marker files* : Can be left to the writer implementation.
> h2. Change summary:
> The following functionality should be supportable by the new format tech
> specs (at a minimum)
> Flexibility :
> * [Pending] Ability to mix different types of base files within a single
> table or even a single file group (e.g images, json, vectors ...)
> * [Pending] Easy integration of metadata for JVM and non-jvm clients
> (parquet as MT format, HFile native APIs)
> Metafields :
> * [Resolved] Should _recordkey be uuid special handling?
> * Semantics of _hoodie_commit_time , with completion time changes.
> Additional Info:
> * Support encoding of watermarks/event time fields as first class citizen,
> for handling late arriving data.
> * [Resolved] Position based skipping of base file
> * [Pending] Additional metadata to avoid more RPCs to scan base file/log
> blocks.
> * [Pending] ML/Column family use-case?
> * [Resolved] Support having changeset of columns in each write, other headers
> Log :
> * [No change needed] Support writing updates as deletes and inserts, instead
> of logging as update to base file.
> * [Pending] CDC format is GA.
> Table organization:
> * [Pending] Support different logical partitions on the same data
> * [Pending] RFC-60/Storage of table spread across buckets/root folders
> * [Pending] Decouple table location from timeline, metadata. They can all be
> in different places
> Concurrency/Timeline:
> * [Pending] Ability to support general purpose multi-table transactions,
> esp between data and metadata tables.
> * [Pending] Support lockless/non-blocking transactions, where writers don't
> block each other even in face of conflicts.
> * [Resolved] Support for long lived instants in timeline, break down
> distinction between active/archived
> * [Pending] Support checking of uniqueness constraints, even in face of two
> concurrent insert transactions.
> * [Pending] Support precise time-travel queries
> * [Pending] Support time-travel writes.
> * [Pending] Support schema history tracking and aid in schema evol impl.
> * [Resolved] TrueTime store/support for instant times
> * [Pending] No more separate rollback action. make it a new state.
> Metadata table :
> * Encode filegroup ID and commit time along with file metadata
> Table Properties:
> * Partitioning information/indexing info
> Marker Files:
> * Write marker files for logs as well, based on new marker format.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)