[
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Y Ethan Guo updated HUDI-8076:
------------------------------
Remaining Estimate: 10h
Original Estimate: 10h
> RFC for backwards compatible writer mode in Hudi 1.0
> ----------------------------------------------------
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Ethan Guo (this is the old account; please use "yihua")
> Assignee: Vinoth Chandar
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.0.0
>
> Original Estimate: 10h
> Remaining Estimate: 10h
>
> *(!) Work in Progress.*
> h3. Basic Idea:
> Introduce support for Hudi writer code to produce storage format for the last
> 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue
> reading the table even when writers are upgraded, as long as writer produces
> storage compatible with the latest table version the reader can read. The
> readers can then be rolling upgraded easily, at different cadence without
> need for any tight co-ordination. Additionally, the reader should have
> ability to "dynamically" deduce table version based on table properties, such
> that when the writer is switched to the latest table version, subsequent
> reads will just adapt and read it as the latest table version.
> Operators still need to ensure all readers have the latest binary that
> supports a given table version, before switching the writer to that version.
> Special consideration to table services, as reader/writer processes, that
> should be able manage the tables as well. Queries should gracefully fail
> during table version switches and start eventually succeeding when writer
> completes switching. Writers/table services should fail if working with an
> unsupported table version, without which one cannot start switching writers
> to new version (this may still need a minor release on the last 2-3 table
> versions?)
> h3. High level approach:
> We need to introduce table version aware reading/writing inside the core
> layers of Hudi, as well as query engines like Spark/Flink.
> To this effect : We need a HoodieStorageFormat abstraction that can cover the
> following layers .
> *Table Properties*
> All table properties need to be table version aware.
> *Timeline*
> Timeline already has a timeline layout version, which can be extended to
> write older and newer timeline. The ArchivedTimeline can be old style or LSM
> style, depending on table version. Typically, we have only written timeline
> with latest schema, we may have to version the .avsc files themselves and
> write using the table version specific avro specific class? Also we need a
> flexible mapping from action names used in each table version, to handle
> things like replacecommit to cluster.
> {*}TBD{*}: whether or not completion time based changes can be retained,
> assuming instant file creation timestamp.
> h4. *FileSystemView*
> Need to support file slice grouping based on old/new timeline + file naming
> combination. Also need abstractions for how files are named in each table
> version, to handle things like log file name changes.
> h4. *WriteHandle*
> This layer may or may not need changes, as base files don't really change.
> and log format is already versioned (see next point). However, it's prudent
> to ensure we have a mechanism for this, since there could be different ways
> of encoding records or footers etc (e.g HFile k-v pairs, ...)
> h4. Metadata table
> Encoding of k-v pairs, their schemas etc.. which partitions are supported in
> what table versions. *TBD* to see if any of the code around recent
> simplification needs to be undone.
> h4. LogFormat Reader/Writer
> the blocks and format itself is versioned, but its not yet tied to the table
> version overall. So we need these links so that the reader can for eg decide
> how to read/assemble log file scanning? FileGroupReader abstractions may need
> to change as well.
> h4. Table Service
> This should largely be independent and built on layers above. but some cases
> exist, depending on whether the code relies on format specifics for
> generating plans or execution. This may be a mix of code changes, acceptable
> behavior changes and format specific specializations.
> TBD: to see if behavior like writing rollback block needs to be
--
This message was sent by Atlassian Jira
(v8.20.10#820010)