(hudi) branch master updated: docs: RFC-100 - Unstructured Data Storage in Hudi (Initial strawman proposal) (#13924)

vinoth Sat, 18 Oct 2025 00:48:38 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/master by this push:
     new ab5b3752ceb3 docs: RFC-100 - Unstructured Data Storage in Hudi 
(Initial strawman proposal) (#13924)
ab5b3752ceb3 is described below

commit ab5b3752ceb38e8814c3f4a86c6e3f68bfd3dbc6
Author: vinoth chandar <[email protected]>
AuthorDate: Thu Oct 9 19:27:19 2025 -0700

    docs: RFC-100 - Unstructured Data Storage in Hudi (Initial strawman 
proposal) (#13924)
---
 rfc/README.md                              |   4 +-
 rfc/rfc-100/rfc-100-autonomous-driving.png | Bin 0 -> 64224 bytes
 rfc/rfc-100/rfc-100.md                     | 238 +++++++++++++++++++++++++++++
 rfc/rfc-80/rfc-80.md                       | 193 +++++++++++++++--------
 rfc/template.md                            |   4 +-
 5 files changed, 371 insertions(+), 68 deletions(-)

diff --git a/rfc/README.md b/rfc/README.md
index 1867381ce1e1..68e7866a8abd 100644
--- a/rfc/README.md
+++ b/rfc/README.md
@@ -115,7 +115,7 @@ The list of all RFCs can be found here.
 | 77         | [Secondary Index](./rfc-77/rfc-77.md)                           
                                                                                
                                                                     | 
:white_check_mark: `COMPLETED`    |
 | 78         | [1.0 Migration](./rfc-78/rfc-78.md)                             
                                                                                
                                                                     | 
:hammer_and_wrench: `IN PROGRESS`  |
 | 79         | [Robust handling of spark task retries and 
failures](./rfc-79/rfc-79.md)                                                   
                                                                                
          | :x: `ABANDONED`    |
-| 80         | [Column Families](./rfc-80/rfc-80.md)                           
                                                                                
                                                                     | 
:hammer_and_wrench: `IN PROGRESS`  |
+| 80         | [Column Groups](./rfc-80/rfc-80.md)                             
                                                                                
                                                                     | 
:hammer_and_wrench: `IN PROGRESS`  |
 | 81         | [Log Compaction with Merge Sort](./rfc-81/rfc-81.md)            
                                                                                
                                                                     | :eyes: 
`UNDER REVIEW` |
 | 82         | [Concurrent schema evolution detection](./rfc-82/rfc-82.md)     
                                                                                
                                                                     | 
:white_check_mark: `COMPLETED`    |
 | 83         | [Incremental Table Service](./rfc-83/rfc-83.md)                 
                                                                                
                                                                     | 
:white_check_mark: `COMPLETED`    |
@@ -136,4 +136,4 @@ The list of all RFCs can be found here.
 | 98         | [Spark Datasource V2 Read](./rfc-98/rfc-98.md)                  
                                                                                
                                                                     | :eyes: 
`UNDER REVIEW` |
 | 99         | [Hudi Type System Redesign](./rfc-99/rfc-99.md)                 
                                                                                
                                                                     | :eyes: 
`UNDER REVIEW` |
 | 100        | [Unstructured Data Storage in Hudi](./rfc-100/rfc-100.md)       
                                                                                
                                                                     | :eyes: 
`UNDER REVIEW` |
-| 100        | [Updates to the HoodieRecordMerger API](./rfc-101/rfc-101.md)   
                                                                                
                                                                     | 
:hammer_and_wrench: `IN PROGRESS` |
\ No newline at end of file
+| 101        | [Updates to the HoodieRecordMerger API](./rfc-101/rfc-101.md)   
                                                                                
                                                                     | 
:hammer_and_wrench: `IN PROGRESS` |
\ No newline at end of file
diff --git a/rfc/rfc-100/rfc-100-autonomous-driving.png 
b/rfc/rfc-100/rfc-100-autonomous-driving.png
new file mode 100644
index 000000000000..87e5c0317bbb
Binary files /dev/null and b/rfc/rfc-100/rfc-100-autonomous-driving.png differ
diff --git a/rfc/rfc-100/rfc-100.md b/rfc/rfc-100/rfc-100.md
new file mode 100644
index 000000000000..0802f2f67935
--- /dev/null
+++ b/rfc/rfc-100/rfc-100.md
@@ -0,0 +1,238 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-100: Unstructured Data Storage in Hudi
+
+## Proposers
+
+- @rahil-c
+- @the-other-tim-brown
+- @vinothchandar
+
+## Approvers
+ - @balaji-varadarajan-ai
+ - @yihua
+
+## Status
+
+Issue: <Link to GH feature issue>
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+This RFC proposes extending Apache Hudi storage to support unstructured data 
storage alongside traditional structured data within a unified table format. 
Building on RFC-80's column groups and RFC-99's BLOB type system, 
+this feature enables users to create tables with mixed structured and 
unstructured columns, with intelligent storage strategies for different data 
types. The proposal introduces hybrid storage formats within 
+file groups, dynamic inline/out-of-line storage decisions for BLOB data, and 
seamless integration with existing Hudi table services and other table 
lifecycle operations.
+
+## Background
+
+The modern data landscape is rapidly evolving beyond traditional structured 
data. In the era of AI and machine learning, organizations need to manage 
diverse data types including images, videos, audio files, documents, 
+embeddings, and other unstructured content alongside their traditional tabular 
data. Current lakehouse architectures, including Hudi, are primarily optimized 
for structured data storage and querying, 
+creating significant limitations for AI-driven workloads.
+
+### Lakehouses + Unstructured Data. Why?
+
+**AI/ML Workload Requirements**: Modern AI applications require co-location of 
structured metadata with unstructured content. For example, a computer vision 
pipeline might need product metadata (structured) 
+alongside product images (unstructured) in the same table. Currently, users 
must maintain separate storage systems, leading to data consistency issues and 
complex pipeline orchestration.
+
+**Unified Data Management**: Organizations benefit from applying the same 
governance, versioning, and ACID properties to both structured and unstructured 
data. Hudi's ACID semantics, metadata tracking, indexing, 
+incremental processing, and table services should extend to unstructured data 
to provide a unified data management experience.
+
+**Performance and Scalability**: Unstructured data storage as raw files in 
object storage, suffers from the same bottlenecks -- too many small objects, 
costly cloud storage GET calls, misaligned partitioning schemes --
+that Hudi solves for structured data storage already.
+
+**Leap to AI-enabled open data storage**: By extending Hudi to unstructured 
data storage, in a way that seamlessly co-exists with current structured data 
columns, users can't seamlessly adapt their workflows to use of AI. For e.g.
+a SEO company now pivoting to optimize content for AI search, can store raw 
Internet documents right inside the same table, by simply adding and 
backfilling a new BLOB column `doc` that is populated by reading a `url` 
+column that already exists.
+
+![Data Flow for Autonomous Driving](rfc-100-autonomous-driving.png)
+
+### Building on Existing Foundation
+
+This RFC leverages two key foundation pieces:
+
+1. **RFC-80 Column Groups**: Provides the mechanism to split file groups 
across different column groups, enabling efficient storage of different data 
types within the same logical file group.
+
+2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi 
type system, providing the type foundation for unstructured data storage.
+
+## Requirements 
+
+Below are the high-level requirements for this feature.
+
+1. Users must be able to define tables with a mix of structured (current 
types) and unstructured (blob type)
+   columns
+2. Records are distributed across file groups like regular Hudi storage layout 
into file groups. But within each
+   file group, structured and unstructured columns are split into different 
column groups. This way the table can
+   also scalably grow in terms of number of columns.
+3. Unstructured data can be stored inline (e.g small images right inside the 
column group) or out-of-line (e.g
+   pointer to a multi-GB video file someplace). This decision should be made 
dynamically during write/storage time.
+3. All table life-cycle operations and table services work seamlessly across 
both column types.for e.g cleaning
+   the file slices should reclaim both inline and out-of-line blob data. 
Clustering should be able re-organize
+   records across file groups or even redistribute columns across column 
groups within the same file group.
+4. Storage should support different column group distributions i.e different 
membership of columns
+   across column groups, across file groups, to ensure users or table services 
can flexibly reconfigure all this as
+   table grows, without re-writing all of the data.
+5. Hudi should expose controls at the writer level, to control whether new 
columns are written to new column
+   groups or expand an existing column group within a file group.
+
+
+## High-Level Design
+
+The design introduces a hybrid storage model where each file group can contain 
multiple column groups with different file formats optimized for their data 
types. Structured columns continue using 
+Parquet format, while unstructured columns can use specialized formats like 
Lance or optimized Parquet configurations or HFile for random-access.
+
+### 1. Mixed Base File Format Support
+
+**Per-Column Group Format Selection**: Each column group within a file group 
can use different base file formats:
+- **Structured Column Groups**: Continue using Parquet with standard 
optimizations
+- **Unstructured Column Groups**: Use Lance format for vector/embedding data 
or specially configured Parquet for BLOB storage
+
+**Format Configuration**: File format is determined at column group creation 
time based on (per the current RFC-80). 
+But, ideally all these configurations should be automatic and Hudi should 
auto-generate colum group names and mappings.
+
+
+```sql
+CREATE TABLE multimedia_catalog (
+  id BIGINT,
+  product_name STRING,
+  category STRING,
+  image BINARY,
+  video LARGE_BINARY,
+  embeddings ARRAY<FLOAT>
+) USING HUDI
+TBLPROPERTIES (
+  'hoodie.table.type' = 'MERGE_ON_READ',
+  'hoodie.bucket.index.hash.field' = 'id',
+  'hoodie.columngroup.structured' = 'id,product_name,category;id',
+  'hoodie.columngroup.images' = 'id,image;id',
+  'hoodie.columngroup.videos' = 'id,video;id',
+  'hoodie.columngroup.ml' = 'id,embeddings;id',
+  'hoodie.columngroup.images.format' = 'parquet',
+  'hoodie.columngroup.videos.format' = 'lance',
+  'hoodie.columngroup.ml.format' = 'hfile'
+)
+```
+
+### 2. Dynamic Inline/Out-of-Line Storage
+
+**Storage Decision Logic**: During write time, Hudi determines storage 
strategy based on:
+- **Inline Storage**: BLOB data < 1MB stored directly in the column group 
file, to avoid excessive cloud storage API calls.
+- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store 
locations with pointers in the main file, to avoid write amplification during 
updates.
+
+
+**Storage Pointer Schema**:
+```json
+{
+  "type": "record",
+  "name": "BlobPointer",
+  "fields": [
+    {"name": "storage_type", "type": "string"},
+    {"name": "size", "type": "long"},
+    {"name": "checksum", "type": "string"},
+    {"name": "external_path", "type": ["null", "string"]},
+    {"name": "compression", "type": ["null", "string"]}
+  ]
+}
+```
+
+**External Storage Layout**:
+```
+{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_group}/{instant}/{blob_id}
+```
+Alternatively, User should be able to specify external storage location per 
BLOB during writes, as needed.
+
+### 3. Parquet Optimization for BLOB Storage
+
+For unstructured column groups using Parquet:
+- **Disable Compression**: Avoid double compression of already compressed 
media files
+- **Plain Encoding**: Use PLAIN encoding instead of dictionary encoding for 
BLOB columns
+- **Large Page Sizes**: Configure larger page sizes to optimize for sequential 
BLOB access
+- **Metadata Index**: Maintain BLOB metadata in Hudi metadata table for 
efficient retrieval of a single blob value.
+- **Disable stats**: Not very useful for BLOB columns
+
+### 4. Lance Format Integration
+
+**Lance Advantages for Unstructured Data**:
+- Native support for high-dimensional vectors and embeddings
+- Efficient columnar storage for mixed structured/unstructured data
+- Better compression for certain unstructured data types
+
+Supporting Lance, working across Hudi + Lance communities will help users 
unlock benefits of both currently supported 
+file formats in Hudi (parquet, orc), along with benefits of Lance. Over time, 
we could also incorporate newer emerging 
+file formats in the space and other well-established unstructured file formats.
+
+### 5. Enhanced Table Services
+
+**Cleaning Service Extensions**:
+- Track external BLOB references in metadata table
+- Implement cascading deletion of external BLOB files during cleaning
+- Add BLOB-specific retention policies, using reference counting to reclaim 
out-of-line blobs.
+
+**Compaction Service Extensions**:
+- Support cross-format compaction (merge Lance and Parquet column groups)
+- Implement BLOB deduplication during major compaction
+- Optimize external BLOB consolidation
+
+**Clustering Service Extensions**:
+- Enable redistribution of BLOB data across file groups
+- Support column group reconfiguration during clustering
+- Implement BLOB-aware data skipping strategies
+
+### 6. Flexible Column Group Management
+
+**Dynamic Column Group Creation**:
+```java
+// Writer API extensions
+HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
+  .withColumnGroupStrategy(ADAPTIVE) // AUTO, FIXED, ADAPTIVE
+  .withNewColumnGroupThreshold(100_000_000L) // 100MB
+  .withBlobStorageThreshold(1_048_576L) // 1MB
+  .build();
+```
+
+**Column Group Reconfiguration**:
+- Support splitting existing column groups when they grow too large
+- Enable merging small column groups during maintenance operations
+- Allow migration of columns between column groups
+
+### 7. Query Engine Integration
+
+**Spark Integration**:
+- Extend DataSource API to handle mixed column group formats
+- Implement vectorized readers for new file formats like Lance
+- Support predicate pushdown across different storage formats
+- Dynamically, lazily fetch BLOB values to avoid shuffling large blobs.
+
+**Ray Integration**:
+- Native support for reading unstructured data into Ray datasets using 
Ray/Hudi integration.
+- Efficient BLOB streaming for distributed ML workloads
+- Integration with Ray's object store for large BLOB caching
+
+### 8. Metadata Table Extensions
+
+- Track BLOB references for garbage collection
+- Store maintain indexes for parquet based blob storage
+- Maintain size statistics for storage optimization
+- Support BLOB-based query optimization
+
+## Rollout/Adoption Plan
+
+WIP
+
+## Test Plan
+
+WIP
\ No newline at end of file
diff --git a/rfc/rfc-80/rfc-80.md b/rfc/rfc-80/rfc-80.md
index 18abd2a1e9ea..9ec39f36ee7b 100644
--- a/rfc/rfc-80/rfc-80.md
+++ b/rfc/rfc-80/rfc-80.md
@@ -14,7 +14,7 @@
   See the License for the specific language governing permissions and
   limitations under the License.
 -->
-# RFC-80: Support column families for wide tables
+# RFC-80: Support column groups for wide tables
 
 ## Proposers  
 - @xiarixiaoyao
@@ -27,14 +27,14 @@
 
 ## Status
 
-JIRA: https://issues.apache.org/jira/browse/HUDI-7947
+Issue: https://github.com/apache/hudi/issues/13922
 
 ## Abstract
 
 In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time wide table concatenation is completed through 
Flink's multi-layer join;
 Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
 In multi-layer join scenarios, this problem is more obvious.  
-1.x also supports partial updates being encoded in logfiles. That should be 
able to handle this scenario. But even with partial-update, the column families 
will reduce write amplification on compaction.  
+1.x also supports partial updates being encoded in logfiles. That should be 
able to handle this scenario. But even with partial-update, the column groups 
will reduce write amplification on compaction.  
 
 So, main gains of clustering columns for wide tables are:  
 Write performance:
@@ -45,14 +45,14 @@ Read performance:
 Since the data is already sorted when it is written, the SortMerge method can 
be used directly to merge the data; compared with the native bucket data 
reading performance is improved a lot, and the memory consumption is reduced 
significantly.  
 
 Compaction performance:  
-The logic of compaction and reading is the same. Compaction costs across 
column families is where there real savings are.  
+The logic of compaction and reading is the same. Compaction costs across 
column groups is where there real savings are.  
 
 The log merge we can make it pluggable to decide between hash or sort merge - 
we need to introduce new log headers or standard mechanism for merging to 
determine if base file or log files are sorted.
 
 ## Background
-Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columngroup 
concept.  
 The organizational form of Hudi files is divided according to the following 
rules:  
-The data in the partition is divided into buckets according to hash (each 
bucket maps to a file group); the files in each bucket are divided according to 
columnFamily; multiple colFamily files in the bucket form a completed 
fileGroup; when there is only one columnFamily, it degenerates into the native 
Hudi bucket table.
+The data in the partition is divided into buckets according to hash (each 
bucket maps to a file group); the files in each bucket are divided according to 
columngroup; multiple colgroup files in the bucket form a completed fileGroup; 
when there is only one columngroup, it degenerates into the native Hudi bucket 
table.
 
 ![table](table.png)
 
@@ -61,84 +61,84 @@ This feature should be implemented for both Spark and 
Flink. So, a table written
 
 ### Constraints and Restrictions
 1. The overall design relies on the non-blocking concurrent writing feature of 
Hudi 1.0.  
-2. Lower version Hudi cannot read and write column family tables.  
-3. Only MOR bucketed tables support setting column families.  
-   MOR+Bucket is more suitable because it has higher write performance, but 
this does not mean that column family is incompatible with other indexes and 
cow tables.  
-4. Column families do not support repartitioning and renaming.  
-5. Schema evolution does not take effect on the current column family table.  
+2. Lower version Hudi cannot read and write column group tables.  
+3. Only MOR bucketed tables support setting column groups.  
+   MOR+Bucket is more suitable because it has higher write performance, but 
this does not mean that column group is incompatible with other indexes and cow 
tables.  
+4. Column groups do not support repartitioning and renaming.  
+5. Schema evolution does not take effect on the current column group table.  
    Not supporting Schema evolution does not mean users can not add/delete 
columns in their table, they just need to do it explicitly.
 6. Like native bucket tables, clustering operations are not supported.
 
 ### Model change
-After the column family is introduced, the storage structure of the entire 
Hudi bucket table changes:
+After the column group is introduced, the storage structure of the entire Hudi 
bucket table changes:
 
 ![bucket](bucket.png)
 
-The bucket is divided into multiple columnFamilies by column cluster. When 
columnFamily is 1, it will automatically degenerate into the native bucket 
table.
+The bucket is divided into multiple columngroups by column cluster. When 
columngroup is 1, it will automatically degenerate into the native bucket table.
 
 ![file-group](file-group.png)
 
 ### Proposed Storage Format Changes
-After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix at the end of all file names to 
facilitate Hudi itself to distinguish column families. If it's not present, we 
assume default column family.
+After splitting the fileGroup by columngroup, the naming rules for base files 
and log files change. We add the cfName suffix at the end of all file names to 
facilitate Hudi itself to distinguish column groups. If it's not present, we 
assume default column group.
 So, new file name templates will be as follows:  
 - Base file: [file_id]\_[write_token]\_[begin_time][_cfName].[extension]  
 - Log file: 
[file_id]\_[begin_instant_time][_cfName].log.[version]_[write_token]  
 
-Also, we should evolve the metadata table files schema to additionally track a 
column family name.  
+Also, we should evolve the metadata table files schema to additionally track a 
column group name.  
 
-### Specifying column families when creating a table
-In the table creation statement, column family division is specified in the 
options/tblproperties attribute;
-Column family attributes are specified in key-value mode:  
-* Key is the column family name. Format: hoodie.colFamily. Column family name  
  naming rules specified.  
-* Value is the specific content of the column family: it consists of all the 
columns included in the column family plus the preCombine field. Format: " 
col1,col2...colN; precombineCol", the column family list and the preCombine 
field are separated by ";"; in the column family list the columns are split by 
",".  
+### Specifying column groups when creating a table
+In the table creation statement, column group division is specified in the 
options/tblproperties attribute;
+Column group attributes are specified in key-value mode:  
+* Key is the column group name. Format: hoodie.colgroup. Column group name    
naming rules specified.  
+* Value is the specific content of the column group: it consists of all the 
columns included in the column group plus the preCombine field. Format: " 
col1,col2...colN; precombineCol", the column group list and the preCombine 
field are separated by ";"; in the column group list the columns are split by 
",".  
 
-Constraints: The column family list must contain the primary key, and columns 
contained in different column families cannot overlap except for the primary 
key. The preCombine field does not need to be specified. If it is not 
specified, the primary key will be taken by default.
+Constraints: The column group list must contain the primary key, and columns 
contained in different column groups cannot overlap except for the primary key. 
The preCombine field does not need to be specified. If it is not specified, the 
primary key will be taken by default.
 
-After the table is created, the column family attributes will be persisted to 
hoodie's metadata for subsequent use.
+After the table is created, the column group attributes will be persisted to 
hoodie's metadata for subsequent use.
 
-### Adding and deleting column families in existing table
-Use the SQL alter command to modify the column family attributes and persist 
it:    
-* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.  
-* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); 
to delete the column family.
+### Adding and deleting column groups in existing table
+Use the SQL alter command to modify the column group attributes and persist 
it:    
+* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columngroup.k'='a,b,c;a'); to add a new column group.  
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columngroup.k'); 
to delete the column group.
 
 Specific steps are as follows:
-1. Execute the ALTER command to modify the column family
-2. Verify whether the column family modified by alter is legal. Column family 
modification must meet the following conditions, otherwise the verification 
will not pass:
-    * The column family name of an existing column family cannot be modified.  
-    * Columns in other column families cannot be divided into new column 
families.  
-    * When creating a new column family, it must meet the format requirements 
from previous chapter.  
-3. Save the modified column family to the .hoodie directory.
+1. Execute the ALTER command to modify the column group
+2. Verify whether the column group modified by alter is legal. Column group 
modification must meet the following conditions, otherwise the verification 
will not pass:
+    * The column group name of an existing column group cannot be modified.  
+    * Columns in other column groups cannot be divided into new column groups. 
 
+    * When creating a new column group, it must meet the format requirements 
from previous chapter.  
+3. Save the modified column group to the .hoodie directory.
 
 ### Writing data
-The Hudi kernel divides the input data according to column families; the data 
belonging to a certain column family is sorted and directly written to the 
corresponding column family log file.
+The Hudi kernel divides the input data according to column groups; the data 
belonging to a certain column group is sorted and directly written to the 
corresponding column group log file.
 
 ![process-write](process-write.png)
 
 Specific steps:  
 1. The engine divides the written data into buckets according to hash and 
shuffles the data (the writing engine completes it by itself and is consistent 
with the current writing of the native bucket).  
 2. The Hudi kernel sorts the data to be written to each bucket by primary key 
(both Spark and Flink has its own ExternalSorter, we can refer those 
ExternalSorter to finish sort).  
-3. After sorting, split the data into column families.  
-4. Write the segmented data into the log file of the corresponding column 
family.  
+3. After sorting, split the data into column groups.  
+4. Write the segmented data into the log file of the corresponding column 
group.  
 
 #### Common API interface
-After the table columns are clustered, the writing process includes the 
process of sorting and splitting the data compared to the original bucket 
bucketing. A new append interface needs to be introduced to support column 
families.  
-Introduce ColumnFamilyAppendHandle extend AppendHandle to implement column 
family writing.
+After the table columns are clustered, the writing process includes the 
process of sorting and splitting the data compared to the original bucket 
bucketing. A new append interface needs to be introduced to support column 
groups.  
+Introduce ColumngroupAppendHandle extend AppendHandle to implement column 
group writing.
 
 ![append-handle](append-handle.png)
 
 ### Reading data
-#### ColumnFamilyReader and RowReader
+#### ColumngroupReader and RowReader
 ![row-reader](row-reader.png)
 
 Hudi internal row reader reading steps:  
-1. Hudi organizes files by column families to be read.
-2. Introduce familyReader to merge and read each column family's own baseFile 
and logfile to achieve column family-level data reading.  
-    * Since log files are written after being sorted by primary key, 
familyReader merges its own baseFile and logFile by primary key using sortMerge.
-    * familyReader supports upstream incoming column pruning to reduce IO 
overhead when reading data.  
-    * During the merging process, if the user specifies the precombie field 
for the column family, the merging strategy will be selected based on the 
precombie field. This logic reuses Hudi's own precombine logic and does not 
need to be modified.    
-3. Row reader merges the data read by multiple familyReaders according to the 
primary key.  
+1. Hudi organizes files by column groups to be read.
+2. Introduce groupReader to merge and read each column group's own baseFile 
and logfile to achieve column group-level data reading.  
+    * Since log files are written after being sorted by primary key, 
groupReader merges its own baseFile and logFile by primary key using sortMerge.
+    * groupReader supports upstream incoming column pruning to reduce IO 
overhead when reading data.  
+    * During the merging process, if the user specifies the precombie field 
for the column group, the merging strategy will be selected based on the 
precombie field. This logic reuses Hudi's own precombine logic and does not 
need to be modified.    
+3. Row reader merges the data read by multiple groupReaders according to the 
primary key.  
 
-Since the data read by each familyReader is sorted by the primary key, the row 
reader merges the data read by each familyReader in the form of sortMergeJoin 
and returns the complete data.  
+Since the data read by each groupReader is sorted by the primary key, the row 
reader merges the data read by each groupReader in the form of sortMergeJoin 
and returns the complete data.  
 
 The entire reading process involves a large amount of data merging, but 
because the data itself is sorted, the memory consumption of the entire merging 
process is very low and the merging is fast. Compared with Hudi's native 
merging method, the memory pressure and the merging time are significantly 
reduced.
 
@@ -150,38 +150,103 @@ The entire reading process involves a large amount of 
data merging, but because
 3) The Hudi kernel completes the data reading of rowReader and returns 
complete data. The data format is Avro.  
 4) The engine gets the Avro format data and needs to convert it into the data 
format it needs. For example, spark needs to be converted into unsaferow, hetu 
into block, flink into row, and hive into arrayWritable.
 
-### Column family level compaction
-Extend Hudi's compaction schedule module to merge each column family's own 
base file and log file:
+### Column group level compaction
+Extend Hudi's compaction schedule module to merge each column group's own base 
file and log file:
 
-![family-compaction](family-compaction.png)
+![group-compaction](group-compaction.png)
 
 ### Full compaction
-Extend Hudi's compaction schedule module to merge and update all column 
families in the entire table.  
-After merging at the column family level, multiple column families are finally 
merged into a complete row and saved.  
+Extend Hudi's compaction schedule module to merge and update all column groups 
in the entire table.  
+After merging at the column group level, multiple column groups are finally 
merged into a complete row and saved.  
 
 ![grand-compaction](grand-compaction.png)
 
-Full compaction will be optional, only column family level compaction is 
required.  
-Different users might need different columns, some might need columns come 
from multiple column families, some might need columns from only one column 
family.
+Full compaction will be optional, only column group level compaction is 
required.  
+Different users might need different columns, some might need columns come 
from multiple column groups, some might need columns from only one column group.
 It's better to allow users to choose whether enable full compaction or not.  
-Besides, after full compaction, projection on the reader side is less 
efficient because the projection could only be based on the full parquet file 
with all complete fields instead of based on column family names.    
+Besides, after full compaction, projection on the reader side is less 
efficient because the projection could only be based on the full parquet file 
with all complete fields instead of based on column group names.    
 
-ColumnFamily can be used not only in ML scenarios and AI feature table, but 
also can be used to simulate multi-stream join and concatenation of wide 
tables. 。
+Columngroup can be used not only in ML scenarios and AI feature table, but 
also can be used to simulate multi-stream join and concatenation of wide 
tables. 。
 In simulation of multi-stream join scenario, Hudi should produce complete 
rows, so full compaction is needed in this case.
 
 ## Rollout/Adoption Plan
 
-This feature itself is a brand-new feature. If you don’t actively turn it on, 
you will not be able to reach the logic of the column families.    
-Business behavior compatibility: No impact, this function will not be actively 
turned on, and column family logic will not be enabled.  
-Syntax compatibility: No impact, the column family attributes are in the table 
attributes and are executed through SQL standard syntax.  
+This feature itself is a brand-new feature. If you don’t actively turn it on, 
you will not be able to reach the logic of the column groups.    
+Business behavior compatibility: No impact, this function will not be actively 
turned on, and column group logic will not be enabled.  
+Syntax compatibility: No impact, the column group attributes are in the table 
attributes and are executed through SQL standard syntax.  
 Compatibility of data type processing methods: No impact, this design will not 
modify the bottom-level data field format.  
 
 ## Test Plan
 List to check that the implementation works as expected:  
-1. All existing tests pass successfully on tables without column families 
defined.  
-2. Hudi SQL supports setting column families when creating MOR bucketed 
tables.  
-3. Column families support adding and deleting in SQL for MOR bucketed tables. 
 
-4. Hudi supports writing data by column families.  
-5. Hudi supports reading data by column families.  
-7. Hudi supports compaction by column family.  
-8. Hudi supports full compaction, merging the data of all column families to 
achieve data widening.  
\ No newline at end of file
+1. All existing tests pass successfully on tables without column groups 
defined.  
+2. Hudi SQL supports setting column groups when creating MOR bucketed tables.  
+3. Column groups support adding and deleting in SQL for MOR bucketed tables.  
+4. Hudi supports writing data by column groups.  
+5. Hudi supports reading data by column groups.  
+7. Hudi supports compaction by column group.  
+8. Hudi supports full compaction, merging the data of all column groups to 
achieve data widening.  
+
+
+## Potential Approaches 
+
+### Approach A: Column Groups under File Group
+
+We map each record key to a single file group, consistently across column 
groups. Each column group has file slices,
+like we do today in file groups. How columns are split into column groups is 
fluid and can be different across file groups.
+
+
+
+```
+records 1-25 ==> file group 1 ==> [column group : c1-c10], [column group : 
c11-c74], [column group : c75-c100]
+
+records 26-50 ==> file group 2 ==> [column group : c1-c40], [column group : 
c41-c100]
+
+records 51-100 ==> file group 3 ==> [column group : c1-c20], [column group : 
c21-c60], [column group : c61-c80], [column group : c81-c100] 
+
+```
+_Layout for table with 100 records, 100 columns_
+
+**Indexing**: works as is, since the mapping from key to file-group is intact. 
+
+**Cleaning**: Column groups can be updated at different rates. i.e. one column 
group can receive more updates than others. 
+To retain versions belonging to the last `x` writes, each column group can 
simply enforce retention on its own file slices, 
+like today. Should work since its based off the same timeline anyway. 
+
+**Queries**: Time-travel / Snapshot queries should work as-is, filtering each 
column group like a normal file group today, just reading the
+columns in the projection/filter from the right column group. CDC / 
Incremental queries can work again by reconciling commit time across column 
groups.
+(Column-level change tracking is a separate problem.)
+
+**Compaction**: Works on column groups based on existing strategies. We may 
need to add a few different strategies for tables with blob columns. 
+
+**Clustering**: This is where we take a hit. Even when clustering across only 
a few column groups, we may need to rewrite all columns to preserve the 
+file-group -> column group hierarchy. Otherwise, some columns of a record may 
be in one file group while others are in another if clustering created new file 
groups.
+⸻
+
+### Approach B: File Groups under Column Groups 
+We treat column groups as separate virtual tables sharing the same timeline. 
But this needs pre-splitting columns into groups at the table level, losing 
flexibility to evolve the table. 
+Managing different ways combinations of columns split across records may be 
overwhelming.
+
+```
+columns 1-25 ==> column group 1 ==> [file group : c1-c10], [file group : 
c11-c74], [file group : c75-c100]
+
+columns 26-50 ==> column group 2 ==> [file group : c1-c40], [file group : 
c41-c100]
+
+columns 51-100 ==> column group 3 ==> [file group : c1-c20], [file group : 
c21-c60], [file group : c61-c80], [file group : c81-c100] 
+
+```
+_Layout for table with 100 records, 100 columns_
+
+
+**Indexing**: RLI (Record Location Index) needs to track multiple positions 
per record key, since it can be in different file groups in each column group.
+
+**Cleaning**: Achieved independently by each virtual table, enforcing cleaner 
retention.
+
+**Queries**: CDC / Incremental/Snapshot / Time-travel queries are all UNIONs 
over query results from relevant column groups.
+
+**Compaction**: works like Approach A.
+
+**Clustering**: each column group can be clustered independently, without 
causing any write amplification.
+
+This "virtual table" abstraction can enable other cool things e.g. 
Materializing the same data in multiple ways.
+
+
diff --git a/rfc/template.md b/rfc/template.md
index 384928d7b138..8be30dd2765c 100644
--- a/rfc/template.md
+++ b/rfc/template.md
@@ -27,7 +27,7 @@
 
 ## Status
 
-JIRA: <link to umbrella JIRA>
+Issue: <Link to GH feature issue>
 
 > Please keep the status updated in `rfc/README.md`.
 
@@ -36,7 +36,7 @@ JIRA: <link to umbrella JIRA>
 Describe the problem you are trying to solve and a brief description of why 
it’s needed
 
 ## Background
-Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+Introduce any background context that is relevant or necessary to understand 
the feature and design choices.
 
 ## Implementation
 Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture.

(hudi) branch master updated: docs: RFC-100 - Unstructured Data Storage in Hudi (Initial strawman proposal) (#13924)

Reply via email to