[ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
---------------------------------
    Due Date: 13/Sep/23  (was: 30/Jun/23)

> Format changes for Hudi 1.X release line
> ----------------------------------------
>
>                 Key: HUDI-6242
>                 URL: https://issues.apache.org/jira/browse/HUDI-6242
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: core
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>              Labels: hudi-umbrellas
>             Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> - *Marker files* : how would we treat these?
> The following functionality should be supportable by the new format tech 
> specs (at a minimum) 
> Flexibility : 
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...) 
>  - Easy integration of metadata for JVM and non-jvm clients
> Metafields :
> - Should _recordkey be uuid special handling?
> Additional Info:
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data. 
>  - Position based skipping of base file
>  - Additional metadata to avoid more RPCs to scan base file/log blocks.
>  - ML/Column family use-case?
>  - Support having changeset of columns in each write, other headers
> Log : 
>  - Support writing updates as deletes and inserts, instead of logging as 
> update to base file.
>  - CDC format is GA.
> Table organization:
>  - Support different logical partitions on the same data
>  - Storage of table spread across buckets/root folders
>  - Decouple table location from timeline, metadata. They can all be in 
> different places
> Concurrency/Timeline: 
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts. 
>  - Support for long lived instants in timeline, break down distinction 
> between active/archived
>  - Support checking of uniqueness constraints, even in face of two concurrent 
> insert transactions. 
>  - Support precise time-travel queries
>  - Support time-travel writes.
>  - Support schema history tracking and aid in schema evol impl.
>  - TrueTime store/support for instant times
> Metadata table :
>  - Encode filegroup ID and commit time along with file metadata
> Table Properties: 
>  - Partitioning information/indexing info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to