[ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
---------------------------------
    Description: 
This EPIC tracks changes to the Hudi storage format.
h2. Proposals

Format change is anything that changes any bits related to

  * *Timeline* : active or archived timeline contents, file names.
 * {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 * {*}Log Files{*}: Block structure, content, names.
 * {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 * {*}Table properties{*}: What's written to 
[hoodie.properties|http://hoodie.properties/].
 * *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs 
(at a minimum)
Flexibility : 
 * [Pending] Ability to mix different types of base files within a single table 
or even a single file group (e.g images, json, vectors ...)
 * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet 
as MT format, HFile native APIs)

Metafields : 
 * [Resolved] Should _recordkey be uuid special handling?
 * Semantics of _hoodie_commit_time , with completion time changes.

Additional Info: 
 * Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 * [Resolved] Position based skipping of base file
 * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
blocks.
 * [Pending] ML/Column family use-case?
 * [Resolved] Support having changeset of columns in each write, other headers

Log : 
 * [No change needed] Support writing updates as deletes and inserts, instead 
of logging as update to base file.
 * [Pending] CDC format is GA.

Table organization: 

 * [Pending] Support different logical partitions on the same data
 * [Pending] RFC-60/Storage of table spread across buckets/root folders
 * [Pending] Decouple table location from timeline, metadata. They can all be 
in different places

Concurrency/Timeline: 

  * [Pending] Ability to support general purpose multi-table transactions, esp 
between data and metadata tables.
 * [Pending] Support lockless/non-blocking transactions, where writers don't 
block each other even in face of conflicts.
 * [Resolved] Support for long lived instants in timeline, break down 
distinction between active/archived
 * [Pending] Support checking of uniqueness constraints, even in face of two 
concurrent insert transactions.
 * [Pending] Support precise time-travel queries
 * [Pending] Support time-travel writes.
 * [Pending] Support schema history tracking and aid in schema evol impl.
 * [Resolved] TrueTime store/support for instant times
 * [Pending] No more separate rollback action. make it a new state.

Metadata table :

  * Encode filegroup ID and commit time along with file metadata

Table Properties:

  * Partitioning information/indexing info

Marker Files:

  * Write marker files for logs as well, based on new marker format.

  was:
This EPIC tracks changes to the Hudi storage format.
h2. Proposals
Format change is anything that changes any bits related to * *Timeline* : 
active or archived timeline contents, file names.
 * {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 * {*}Log Files{*}: Block structure, content, names.
 * {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 * {*}Table properties{*}: What's written to 
[hoodie.properties|http://hoodie.properties/].
 * *Marker files* : Can be left to the writer implementation.

h2. Change summary:
The following functionality should be supportable by the new format tech specs 
(at a minimum)
Flexibility : * [Pending] Ability to mix different types of base files within a 
single table or even a single file group (e.g images, json, vectors ...)
 * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet 
as MT format, HFile native APIs)

Metafields : * [Resolved] Should _recordkey be uuid special handling?
 * Semantics of _hoodie_commit_time , with completion time changes.

Additional Info: * Support encoding of watermarks/event time fields as first 
class citizen, for handling late arriving data.
 * [Resolved]  Position based skipping of base file
 * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
blocks.
 * [Pending] ML/Column family use-case?
 * [Resolved] Support having changeset of columns in each write, other headers

Log : * [No change needed] Support writing updates as deletes and inserts, 
instead of logging as update to base file.
 * [Pending] CDC format is GA.

Table organization: * [Pending] Support different logical partitions on the 
same data
 * [Pending] RFC-60/Storage of table spread across buckets/root folders
 * [Pending] Decouple table location from timeline, metadata. They can all be 
in different places

Concurrency/Timeline: * [Pending] Ability to support general purpose 
multi-table transactions, esp between data and metadata tables.
 * [Pending] Support lockless/non-blocking transactions, where writers don't 
block each other even in face of conflicts.
 * [Resolved] Support for long lived instants in timeline, break down 
distinction between active/archived
 * [Pending] Support checking of uniqueness constraints, even in face of two 
concurrent insert transactions.
 * [Pending] Support precise time-travel queries
 * [Pending] Support time-travel writes.
 * [Pending] Support schema history tracking and aid in schema evol impl.
 * [Resolved] TrueTime store/support for instant times
 * [Pending] No more separate rollback action. make it a new state.

Metadata table : * Encode filegroup ID and commit time along with file metadata

Table Properties: * Partitioning information/indexing info

Marker Files: * Write marker files for logs as well, based on new marker format.


> Format changes for Hudi 1.X release line
> ----------------------------------------
>
>                 Key: HUDI-6242
>                 URL: https://issues.apache.org/jira/browse/HUDI-6242
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: core
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>              Labels: hudi-umbrellas
>             Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> h2. Proposals
> Format change is anything that changes any bits related to
>   * *Timeline* : active or archived timeline contents, file names.
>  * {*}Base Files{*}: file format versions, any changes to any data types, 
> file footers, file names.
>  * {*}Log Files{*}: Block structure, content, names.
>  * {*}Metadata Table{*}: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings.
>  * {*}Table properties{*}: What's written to 
> [hoodie.properties|http://hoodie.properties/].
>  * *Marker files* : Can be left to the writer implementation.
> h2. Change summary:
> The following functionality should be supportable by the new format tech 
> specs (at a minimum)
> Flexibility : 
>  * [Pending] Ability to mix different types of base files within a single 
> table or even a single file group (e.g images, json, vectors ...)
>  * [Pending] Easy integration of metadata for JVM and non-jvm clients 
> (parquet as MT format, HFile native APIs)
> Metafields : 
>  * [Resolved] Should _recordkey be uuid special handling?
>  * Semantics of _hoodie_commit_time , with completion time changes.
> Additional Info: 
>  * Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data.
>  * [Resolved] Position based skipping of base file
>  * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
> blocks.
>  * [Pending] ML/Column family use-case?
>  * [Resolved] Support having changeset of columns in each write, other headers
> Log : 
>  * [No change needed] Support writing updates as deletes and inserts, instead 
> of logging as update to base file.
>  * [Pending] CDC format is GA.
> Table organization: 
>  * [Pending] Support different logical partitions on the same data
>  * [Pending] RFC-60/Storage of table spread across buckets/root folders
>  * [Pending] Decouple table location from timeline, metadata. They can all be 
> in different places
> Concurrency/Timeline: 
>   * [Pending] Ability to support general purpose multi-table transactions, 
> esp between data and metadata tables.
>  * [Pending] Support lockless/non-blocking transactions, where writers don't 
> block each other even in face of conflicts.
>  * [Resolved] Support for long lived instants in timeline, break down 
> distinction between active/archived
>  * [Pending] Support checking of uniqueness constraints, even in face of two 
> concurrent insert transactions.
>  * [Pending] Support precise time-travel queries
>  * [Pending] Support time-travel writes.
>  * [Pending] Support schema history tracking and aid in schema evol impl.
>  * [Resolved] TrueTime store/support for instant times
>  * [Pending] No more separate rollback action. make it a new state.
> Metadata table :
>   * Encode filegroup ID and commit time along with file metadata
> Table Properties:
>   * Partitioning information/indexing info
> Marker Files:
>   * Write marker files for logs as well, based on new marker format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to