[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-6242:
---------------------------------
Description:
This EPIC tracks changes to the Hudi storage format.
h2. Proposals
Format change is anything that changes any bits related to * *Timeline* :
active or archived timeline contents, file names.
* {*}Base Files{*}: file format versions, any changes to any data types, file
footers, file names.
* {*}Log Files{*}: Block structure, content, names.
* {*}Metadata Table{*}: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row mappings.
* {*}Table properties{*}: What's written to
[hoodie.properties|http://hoodie.properties/].
* *Marker files* : Can be left to the writer implementation.
h2. Change summary:
The following functionality should be supportable by the new format tech specs
(at a minimum)
Flexibility : * [Pending] Ability to mix different types of base files within a
single table or even a single file group (e.g images, json, vectors ...)
* [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet
as MT format, HFile native APIs)
Metafields : * [Resolved] Should _recordkey be uuid special handling?
* Semantics of _hoodie_commit_time , with completion time changes.
Additional Info: * Support encoding of watermarks/event time fields as first
class citizen, for handling late arriving data.
* [Resolved] Position based skipping of base file
* [Pending] Additional metadata to avoid more RPCs to scan base file/log
blocks.
* [Pending] ML/Column family use-case?
* [Resolved] Support having changeset of columns in each write, other headers
Log : * [No change needed] Support writing updates as deletes and inserts,
instead of logging as update to base file.
* [Pending] CDC format is GA.
Table organization: * [Pending] Support different logical partitions on the
same data
* [Pending] RFC-60/Storage of table spread across buckets/root folders
* [Pending] Decouple table location from timeline, metadata. They can all be
in different places
Concurrency/Timeline: * [Pending] Ability to support general purpose
multi-table transactions, esp between data and metadata tables.
* [Pending] Support lockless/non-blocking transactions, where writers don't
block each other even in face of conflicts.
* [Resolved] Support for long lived instants in timeline, break down
distinction between active/archived
* [Pending] Support checking of uniqueness constraints, even in face of two
concurrent insert transactions.
* [Pending] Support precise time-travel queries
* [Pending] Support time-travel writes.
* [Pending] Support schema history tracking and aid in schema evol impl.
* [Resolved] TrueTime store/support for instant times
* [Pending] No more separate rollback action. make it a new state.
Metadata table : * Encode filegroup ID and commit time along with file metadata
Table Properties: * Partitioning information/indexing info
Marker Files: * Write marker files for logs as well, based on new marker format.
was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
- *Timeline* : active or archived timeline contents, file names.
- {*}Base Files{*}: file format versions, any changes to any data types, file
footers, file names.
- {*}Log Files{*}: Block structure, content, names.
- {*}Metadata Table{*}: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row mappings.
- {*}Table properties{*}: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.
h2. Change summary:
The following functionality should be supportable by the new format tech specs
(at a minimum)
Flexibility :
- Ability to mix different types of base files within a single table or even a
single file group (e.g images, json, vectors ...)
- Easy integration of metadata for JVM and non-jvm clients
Metafields :
- Should _recordkey be uuid special handling?
Additional Info:
- Support encoding of watermarks/event time fields as first class citizen, for
handling late arriving data.
- Position based skipping of base file
- Additional metadata to avoid more RPCs to scan base file/log blocks.
- ML/Column family use-case?
- Support having changeset of columns in each write, other headers
Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.
Table organization:
- Support different logical partitions on the same data
- Storage of table spread across buckets/root folders
- Decouple table location from timeline, metadata. They can all be in
different places
Concurrency/Timeline:
- Ability to support general purpose multi-table transactions, esp between
data and metadata tables.
- Support lockless/non-blocking transactions, where writers don't block each
other even in face of conflicts.
- Support for long lived instants in timeline, break down distinction between
active/archived
- Support checking of uniqueness constraints, even in face of two concurrent
insert transactions.
- Support precise time-travel queries
- Support time-travel writes.
- Support schema history tracking and aid in schema evol impl.
- TrueTime store/support for instant times
- No more separate rollback action. make it a new state.
Metadata table :
- Encode filegroup ID and commit time along with file metadata
Table Properties:
- Partitioning information/indexing info
> Format changes for Hudi 1.X release line
> ----------------------------------------
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
> Issue Type: Epic
> Components: core
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> h2. Proposals
> Format change is anything that changes any bits related to * *Timeline* :
> active or archived timeline contents, file names.
> * {*}Base Files{*}: file format versions, any changes to any data types,
> file footers, file names.
> * {*}Log Files{*}: Block structure, content, names.
> * {*}Metadata Table{*}: (should we call this index table instead?) partition
> names, number of file groups, key/value schema and metadata to MDT row
> mappings.
> * {*}Table properties{*}: What's written to
> [hoodie.properties|http://hoodie.properties/].
> * *Marker files* : Can be left to the writer implementation.
> h2. Change summary:
> The following functionality should be supportable by the new format tech
> specs (at a minimum)
> Flexibility : * [Pending] Ability to mix different types of base files within
> a single table or even a single file group (e.g images, json, vectors ...)
> * [Pending] Easy integration of metadata for JVM and non-jvm clients
> (parquet as MT format, HFile native APIs)
> Metafields : * [Resolved] Should _recordkey be uuid special handling?
> * Semantics of _hoodie_commit_time , with completion time changes.
> Additional Info: * Support encoding of watermarks/event time fields as first
> class citizen, for handling late arriving data.
> * [Resolved] Position based skipping of base file
> * [Pending] Additional metadata to avoid more RPCs to scan base file/log
> blocks.
> * [Pending] ML/Column family use-case?
> * [Resolved] Support having changeset of columns in each write, other headers
> Log : * [No change needed] Support writing updates as deletes and inserts,
> instead of logging as update to base file.
> * [Pending] CDC format is GA.
> Table organization: * [Pending] Support different logical partitions on the
> same data
> * [Pending] RFC-60/Storage of table spread across buckets/root folders
> * [Pending] Decouple table location from timeline, metadata. They can all be
> in different places
> Concurrency/Timeline: * [Pending] Ability to support general purpose
> multi-table transactions, esp between data and metadata tables.
> * [Pending] Support lockless/non-blocking transactions, where writers don't
> block each other even in face of conflicts.
> * [Resolved] Support for long lived instants in timeline, break down
> distinction between active/archived
> * [Pending] Support checking of uniqueness constraints, even in face of two
> concurrent insert transactions.
> * [Pending] Support precise time-travel queries
> * [Pending] Support time-travel writes.
> * [Pending] Support schema history tracking and aid in schema evol impl.
> * [Resolved] TrueTime store/support for instant times
> * [Pending] No more separate rollback action. make it a new state.
> Metadata table : * Encode filegroup ID and commit time along with file
> metadata
> Table Properties: * Partitioning information/indexing info
> Marker Files: * Write marker files for logs as well, based on new marker
> format.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)