[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

Vinoth Chandar (Jira) Mon, 19 Aug 2024 16:22:04 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinoth Chandar updated HUDI-8076:
---------------------------------
    Description: 
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to the latest table version, subsequent reads will just adapt and read 
it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. {*}tbd{*}: whether or not 
completion time based changes 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
 # Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. *TBD* to see if any of the code thatt

  was:
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to the latest table version, subsequent reads will just adapt and read 
it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. tbd: whether or not completion 
time based changes 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
 # Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. TBD to see if any of the code thatt


> RFC for backwards compatible writer mode in Hudi 1.0
> ----------------------------------------------------
>
>                 Key: HUDI-8076
>                 URL: https://issues.apache.org/jira/browse/HUDI-8076
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Ethan Guo
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 1.0.0
>
>
> (!) Work in Progress. 
> h3. Basic Idea: 
> Introduce support for Hudi writer code to produce storage format for the last 
> 2-3 table versions. This enables older readers to continue reading the table 
> even when writers are upgraded, as long as writer produces storage compatible 
> with the latest table version the reader can read. The readers can then be 
> rolling upgraded easily, at different cadence without need for any tight 
> co-ordination. Additionally, the reader should have ability to "dynamically" 
> deduce table version based on table properties, such that when the writer is 
> switched to the latest table version, subsequent reads will just adapt and 
> read it as the latest table version. 
> Operators still need to ensure all readers have the latest binary that 
> supports a given table version, before switching the writer to that version. 
> Special consideration to table services, as reader/writer processes, that 
> should be able manage the tables as well. Queries should gracefully fail 
> during table version switches and start eventually succeeding when writer 
> completes switching. Writers/table services should fail if working with an 
> unsupported table version, without which one cannot start switching writers 
> to new version (this may still need a minor release on the last 2-3 table 
> versions?)
> h3. High level approach: 
> We need to introduce table version aware reading/writing inside the core 
> layers of Hudi, as well as query engines like Spark/Flink. 
> To this effect : We need a HoodieStorageFormat abstraction that can cover the 
> following layers . 
>  # Timeline : Timeline already has a timeline layout version, which can be 
> extended to write older and newer timeline. The ArchivedTimeline can be old 
> style or LSM style, depending on table version. {*}tbd{*}: whether or not 
> completion time based changes 
>  # WriteHandle : This layer may or may not need changes, as base files don't 
> really change. and log format is already versioned (see next point). However, 
> it's prudent to ensure we have a mechanism for this, since there could be 
> different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
>  # Metadata table:  Encoding of k-v pairs, their schemas etc.. which 
> partitions are supported in what table versions. *TBD* to see if any of the 
> code thatt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

Reply via email to