[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-6242:
---------------------------------
Description:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
- *Timeline* : active or archived timeline contents, file names.
- *Base Files*: file format versions, any changes to any data types, file
footers, file names.
- *Log Files*: Block structure, content, names.
- *Metadata Table*: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row
mappings.
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?
The following functionality should be supportable by the new format tech specs
(at a minimum)
Flexibility :
- Ability to mix different types of base files within a single table or even a
single file group (e.g images, json, vectors ...)
- Easy integration of metadata for JVM and non-jvm clients
Metafields :
- Should _recordkey be uuid special handling?
Additional Info:
- Support encoding of watermarks/event time fields as first class citizen, for
handling late arriving data.
- Position based skipping of base file
- Additional metadata to avoid more RPCs to scan base file/log blocks.
- ML/Column family use-case?
- Support having changeset of columns in each write, other headers
Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.
Table organization:
- Support different logical partitions on the same data
- Storage of table spread across buckets/root folders
- Decouple table location from timeline, metadata. They can all be in
different places
Concurrency/Timeline:
- Ability to support general purpose multi-table transactions, esp between
data and metadata tables.
- Support lockless/non-blocking transactions, where writers don't block each
other even in face of conflicts.
- Support for long lived instants in timeline, break down distinction between
active/archived
- Support checking of uniqueness constraints, even in face of two concurrent
insert transactions.
- Support precise time-travel queries
- Support time-travel writes.
- Support schema history tracking and aid in schema evol impl.
- TrueTime store/support for instant times
Metadata table :
- Encode filegroup ID and commit time along with file metadata
Table Properties:
- Partitioning information/indexing info
was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
- *Timeline* : active or archived timeline contents, file names.
- *Base Files*: file format versions, any changes to any data types, file
footers, file names.
- *Log Files*: Block structure, content, names.
- *Metadata Table*: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row
mappings.
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?
The following functionality should be supportable by the new format tech specs
(at a minimum)
Flexibility :
- Ability to mix different types of base files within a single table or even a
single file group (e.g images, json, vectors ...)
- Easy integration of metadata for JVM and non-jvm clients
Additional Info:
- Support encoding of watermarks/event time fields as first class citizen, for
handling late arriving data.
- Position based skipping of base file
- Additional metadata to avoid more RPCs to scan base file/log blocks.
- ML/Column family use-case?
- Support having changeset of columns in each write, other headers
Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.
Table organization:
- Support different logical partitions on the same data
- Storage of table spread across buckets/root folders
- Decouple table location from timeline, metadata. They can all be in
different places
Concurrency/Timeline:
- Ability to support general purpose multi-table transactions, esp between
data and metadata tables.
- Support lockless/non-blocking transactions, where writers don't block each
other even in face of conflicts.
- Support for long lived instants in timeline, break down distinction between
active/archived
- Support checking of uniqueness constraints, even in face of two concurrent
insert transactions.
- Support precise time-travel queries
- Support time-travel writes.
- Support schema history tracking and aid in schema evol impl.
- TrueTime store/support for instant times
Metadata table :
- Encode filegroup ID and commit time along with file metadata
Table Properties:
- Partitioning information/indexing info
> Format changes for Hudi 1.X release line
> ----------------------------------------
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
> Issue Type: Epic
> Components: core
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
> - *Timeline* : active or archived timeline contents, file names.
> - *Base Files*: file format versions, any changes to any data types, file
> footers, file names.
> - *Log Files*: Block structure, content, names.
> - *Metadata Table*: (should we call this index table instead?) partition
> names, number of file groups, key/value schema and metadata to MDT row
> mappings.
> - *Table properties*: What's written to hoodie.properties.
> - *Marker files* : how would we treat these?
> The following functionality should be supportable by the new format tech
> specs (at a minimum)
> Flexibility :
> - Ability to mix different types of base files within a single table or even
> a single file group (e.g images, json, vectors ...)
> - Easy integration of metadata for JVM and non-jvm clients
> Metafields :
> - Should _recordkey be uuid special handling?
> Additional Info:
> - Support encoding of watermarks/event time fields as first class citizen,
> for handling late arriving data.
> - Position based skipping of base file
> - Additional metadata to avoid more RPCs to scan base file/log blocks.
> - ML/Column family use-case?
> - Support having changeset of columns in each write, other headers
> Log :
> - Support writing updates as deletes and inserts, instead of logging as
> update to base file.
> - CDC format is GA.
> Table organization:
> - Support different logical partitions on the same data
> - Storage of table spread across buckets/root folders
> - Decouple table location from timeline, metadata. They can all be in
> different places
> Concurrency/Timeline:
> - Ability to support general purpose multi-table transactions, esp between
> data and metadata tables.
> - Support lockless/non-blocking transactions, where writers don't block each
> other even in face of conflicts.
> - Support for long lived instants in timeline, break down distinction
> between active/archived
> - Support checking of uniqueness constraints, even in face of two concurrent
> insert transactions.
> - Support precise time-travel queries
> - Support time-travel writes.
> - Support schema history tracking and aid in schema evol impl.
> - TrueTime store/support for instant times
> Metadata table :
> - Encode filegroup ID and commit time along with file metadata
> Table Properties:
> - Partitioning information/indexing info
--
This message was sent by Atlassian Jira
(v8.20.10#820010)