This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new ab5b3752ceb3 docs: RFC-100 - Unstructured Data Storage in Hudi
(Initial strawman proposal) (#13924)
ab5b3752ceb3 is described below
commit ab5b3752ceb38e8814c3f4a86c6e3f68bfd3dbc6
Author: vinoth chandar <[email protected]>
AuthorDate: Thu Oct 9 19:27:19 2025 -0700
docs: RFC-100 - Unstructured Data Storage in Hudi (Initial strawman
proposal) (#13924)
---
rfc/README.md | 4 +-
rfc/rfc-100/rfc-100-autonomous-driving.png | Bin 0 -> 64224 bytes
rfc/rfc-100/rfc-100.md | 238 +++++++++++++++++++++++++++++
rfc/rfc-80/rfc-80.md | 193 +++++++++++++++--------
rfc/template.md | 4 +-
5 files changed, 371 insertions(+), 68 deletions(-)
diff --git a/rfc/README.md b/rfc/README.md
index 1867381ce1e1..68e7866a8abd 100644
--- a/rfc/README.md
+++ b/rfc/README.md
@@ -115,7 +115,7 @@ The list of all RFCs can be found here.
| 77 | [Secondary Index](./rfc-77/rfc-77.md)
|
:white_check_mark: `COMPLETED` |
| 78 | [1.0 Migration](./rfc-78/rfc-78.md)
|
:hammer_and_wrench: `IN PROGRESS` |
| 79 | [Robust handling of spark task retries and
failures](./rfc-79/rfc-79.md)
| :x: `ABANDONED` |
-| 80 | [Column Families](./rfc-80/rfc-80.md)
|
:hammer_and_wrench: `IN PROGRESS` |
+| 80 | [Column Groups](./rfc-80/rfc-80.md)
|
:hammer_and_wrench: `IN PROGRESS` |
| 81 | [Log Compaction with Merge Sort](./rfc-81/rfc-81.md)
| :eyes:
`UNDER REVIEW` |
| 82 | [Concurrent schema evolution detection](./rfc-82/rfc-82.md)
|
:white_check_mark: `COMPLETED` |
| 83 | [Incremental Table Service](./rfc-83/rfc-83.md)
|
:white_check_mark: `COMPLETED` |
@@ -136,4 +136,4 @@ The list of all RFCs can be found here.
| 98 | [Spark Datasource V2 Read](./rfc-98/rfc-98.md)
| :eyes:
`UNDER REVIEW` |
| 99 | [Hudi Type System Redesign](./rfc-99/rfc-99.md)
| :eyes:
`UNDER REVIEW` |
| 100 | [Unstructured Data Storage in Hudi](./rfc-100/rfc-100.md)
| :eyes:
`UNDER REVIEW` |
-| 100 | [Updates to the HoodieRecordMerger API](./rfc-101/rfc-101.md)
|
:hammer_and_wrench: `IN PROGRESS` |
\ No newline at end of file
+| 101 | [Updates to the HoodieRecordMerger API](./rfc-101/rfc-101.md)
|
:hammer_and_wrench: `IN PROGRESS` |
\ No newline at end of file
diff --git a/rfc/rfc-100/rfc-100-autonomous-driving.png
b/rfc/rfc-100/rfc-100-autonomous-driving.png
new file mode 100644
index 000000000000..87e5c0317bbb
Binary files /dev/null and b/rfc/rfc-100/rfc-100-autonomous-driving.png differ
diff --git a/rfc/rfc-100/rfc-100.md b/rfc/rfc-100/rfc-100.md
new file mode 100644
index 000000000000..0802f2f67935
--- /dev/null
+++ b/rfc/rfc-100/rfc-100.md
@@ -0,0 +1,238 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-100: Unstructured Data Storage in Hudi
+
+## Proposers
+
+- @rahil-c
+- @the-other-tim-brown
+- @vinothchandar
+
+## Approvers
+ - @balaji-varadarajan-ai
+ - @yihua
+
+## Status
+
+Issue: <Link to GH feature issue>
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+This RFC proposes extending Apache Hudi storage to support unstructured data
storage alongside traditional structured data within a unified table format.
Building on RFC-80's column groups and RFC-99's BLOB type system,
+this feature enables users to create tables with mixed structured and
unstructured columns, with intelligent storage strategies for different data
types. The proposal introduces hybrid storage formats within
+file groups, dynamic inline/out-of-line storage decisions for BLOB data, and
seamless integration with existing Hudi table services and other table
lifecycle operations.
+
+## Background
+
+The modern data landscape is rapidly evolving beyond traditional structured
data. In the era of AI and machine learning, organizations need to manage
diverse data types including images, videos, audio files, documents,
+embeddings, and other unstructured content alongside their traditional tabular
data. Current lakehouse architectures, including Hudi, are primarily optimized
for structured data storage and querying,
+creating significant limitations for AI-driven workloads.
+
+### Lakehouses + Unstructured Data. Why?
+
+**AI/ML Workload Requirements**: Modern AI applications require co-location of
structured metadata with unstructured content. For example, a computer vision
pipeline might need product metadata (structured)
+alongside product images (unstructured) in the same table. Currently, users
must maintain separate storage systems, leading to data consistency issues and
complex pipeline orchestration.
+
+**Unified Data Management**: Organizations benefit from applying the same
governance, versioning, and ACID properties to both structured and unstructured
data. Hudi's ACID semantics, metadata tracking, indexing,
+incremental processing, and table services should extend to unstructured data
to provide a unified data management experience.
+
+**Performance and Scalability**: Unstructured data storage as raw files in
object storage, suffers from the same bottlenecks -- too many small objects,
costly cloud storage GET calls, misaligned partitioning schemes --
+that Hudi solves for structured data storage already.
+
+**Leap to AI-enabled open data storage**: By extending Hudi to unstructured
data storage, in a way that seamlessly co-exists with current structured data
columns, users can't seamlessly adapt their workflows to use of AI. For e.g.
+a SEO company now pivoting to optimize content for AI search, can store raw
Internet documents right inside the same table, by simply adding and
backfilling a new BLOB column `doc` that is populated by reading a `url`
+column that already exists.
+
+
+
+### Building on Existing Foundation
+
+This RFC leverages two key foundation pieces:
+
+1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
+
+2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+
+## Requirements
+
+Below are the high-level requirements for this feature.
+
+1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
+ columns
+2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
+ file group, structured and unstructured columns are split into different
column groups. This way the table can
+ also scalably grow in terms of number of columns.
+3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
+ the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
+ records across file groups or even redistribute columns across column
groups within the same file group.
+4. Storage should support different column group distributions i.e different
membership of columns
+ across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
+ table grows, without re-writing all of the data.
+5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
+ groups or expand an existing column group within a file group.
+
+
+## High-Level Design
+
+The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
+Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
+
+### 1. Mixed Base File Format Support
+
+**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
+- **Structured Column Groups**: Continue using Parquet with standard
optimizations
+- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
+
+**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
+But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
+
+
+```sql
+CREATE TABLE multimedia_catalog (
+ id BIGINT,
+ product_name STRING,
+ category STRING,
+ image BINARY,
+ video LARGE_BINARY,
+ embeddings ARRAY<FLOAT>
+) USING HUDI
+TBLPROPERTIES (
+ 'hoodie.table.type' = 'MERGE_ON_READ',
+ 'hoodie.bucket.index.hash.field' = 'id',
+ 'hoodie.columngroup.structured' = 'id,product_name,category;id',
+ 'hoodie.columngroup.images' = 'id,image;id',
+ 'hoodie.columngroup.videos' = 'id,video;id',
+ 'hoodie.columngroup.ml' = 'id,embeddings;id',
+ 'hoodie.columngroup.images.format' = 'parquet',
+ 'hoodie.columngroup.videos.format' = 'lance',
+ 'hoodie.columngroup.ml.format' = 'hfile'
+)
+```
+
+### 2. Dynamic Inline/Out-of-Line Storage
+
+**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
+- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
+- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+
+
+**Storage Pointer Schema**:
+```json
+{
+ "type": "record",
+ "name": "BlobPointer",
+ "fields": [
+ {"name": "storage_type", "type": "string"},
+ {"name": "size", "type": "long"},
+ {"name": "checksum", "type": "string"},
+ {"name": "external_path", "type": ["null", "string"]},
+ {"name": "compression", "type": ["null", "string"]}
+ ]
+}
+```
+
+**External Storage Layout**:
+```
+{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_group}/{instant}/{blob_id}
+```
+Alternatively, User should be able to specify external storage location per
BLOB during writes, as needed.
+
+### 3. Parquet Optimization for BLOB Storage
+
+For unstructured column groups using Parquet:
+- **Disable Compression**: Avoid double compression of already compressed
media files
+- **Plain Encoding**: Use PLAIN encoding instead of dictionary encoding for
BLOB columns
+- **Large Page Sizes**: Configure larger page sizes to optimize for sequential
BLOB access
+- **Metadata Index**: Maintain BLOB metadata in Hudi metadata table for
efficient retrieval of a single blob value.
+- **Disable stats**: Not very useful for BLOB columns
+
+### 4. Lance Format Integration
+
+**Lance Advantages for Unstructured Data**:
+- Native support for high-dimensional vectors and embeddings
+- Efficient columnar storage for mixed structured/unstructured data
+- Better compression for certain unstructured data types
+
+Supporting Lance, working across Hudi + Lance communities will help users
unlock benefits of both currently supported
+file formats in Hudi (parquet, orc), along with benefits of Lance. Over time,
we could also incorporate newer emerging
+file formats in the space and other well-established unstructured file formats.
+
+### 5. Enhanced Table Services
+
+**Cleaning Service Extensions**:
+- Track external BLOB references in metadata table
+- Implement cascading deletion of external BLOB files during cleaning
+- Add BLOB-specific retention policies, using reference counting to reclaim
out-of-line blobs.
+
+**Compaction Service Extensions**:
+- Support cross-format compaction (merge Lance and Parquet column groups)
+- Implement BLOB deduplication during major compaction
+- Optimize external BLOB consolidation
+
+**Clustering Service Extensions**:
+- Enable redistribution of BLOB data across file groups
+- Support column group reconfiguration during clustering
+- Implement BLOB-aware data skipping strategies
+
+### 6. Flexible Column Group Management
+
+**Dynamic Column Group Creation**:
+```java
+// Writer API extensions
+HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
+ .withColumnGroupStrategy(ADAPTIVE) // AUTO, FIXED, ADAPTIVE
+ .withNewColumnGroupThreshold(100_000_000L) // 100MB
+ .withBlobStorageThreshold(1_048_576L) // 1MB
+ .build();
+```
+
+**Column Group Reconfiguration**:
+- Support splitting existing column groups when they grow too large
+- Enable merging small column groups during maintenance operations
+- Allow migration of columns between column groups
+
+### 7. Query Engine Integration
+
+**Spark Integration**:
+- Extend DataSource API to handle mixed column group formats
+- Implement vectorized readers for new file formats like Lance
+- Support predicate pushdown across different storage formats
+- Dynamically, lazily fetch BLOB values to avoid shuffling large blobs.
+
+**Ray Integration**:
+- Native support for reading unstructured data into Ray datasets using
Ray/Hudi integration.
+- Efficient BLOB streaming for distributed ML workloads
+- Integration with Ray's object store for large BLOB caching
+
+### 8. Metadata Table Extensions
+
+- Track BLOB references for garbage collection
+- Store maintain indexes for parquet based blob storage
+- Maintain size statistics for storage optimization
+- Support BLOB-based query optimization
+
+## Rollout/Adoption Plan
+
+WIP
+
+## Test Plan
+
+WIP
\ No newline at end of file
diff --git a/rfc/rfc-80/rfc-80.md b/rfc/rfc-80/rfc-80.md
index 18abd2a1e9ea..9ec39f36ee7b 100644
--- a/rfc/rfc-80/rfc-80.md
+++ b/rfc/rfc-80/rfc-80.md
@@ -14,7 +14,7 @@
See the License for the specific language governing permissions and
limitations under the License.
-->
-# RFC-80: Support column families for wide tables
+# RFC-80: Support column groups for wide tables
## Proposers
- @xiarixiaoyao
@@ -27,14 +27,14 @@
## Status
-JIRA: https://issues.apache.org/jira/browse/HUDI-7947
+Issue: https://github.com/apache/hudi/issues/13922
## Abstract
In streaming processing, there are often scenarios where the table is widened.
The current mainstream real-time wide table concatenation is completed through
Flink's multi-layer join;
Flink's join will cache a large amount of data in the state backend. As the
data set increases, the pressure on the Flink task state backend will gradually
increase, and may even become unavailable.
In multi-layer join scenarios, this problem is more obvious.
-1.x also supports partial updates being encoded in logfiles. That should be
able to handle this scenario. But even with partial-update, the column families
will reduce write amplification on compaction.
+1.x also supports partial updates being encoded in logfiles. That should be
able to handle this scenario. But even with partial-update, the column groups
will reduce write amplification on compaction.
So, main gains of clustering columns for wide tables are:
Write performance:
@@ -45,14 +45,14 @@ Read performance:
Since the data is already sorted when it is written, the SortMerge method can
be used directly to merge the data; compared with the native bucket data
reading performance is improved a lot, and the memory consumption is reduced
significantly.
Compaction performance:
-The logic of compaction and reading is the same. Compaction costs across
column families is where there real savings are.
+The logic of compaction and reading is the same. Compaction costs across
column groups is where there real savings are.
The log merge we can make it pluggable to decide between hash or sort merge -
we need to introduce new log headers or standard mechanism for merging to
determine if base file or log files are sorted.
## Background
-Currently, Hudi organizes data according to fileGroup granularity. The
fileGroup is further divided into column clusters to introduce the columnFamily
concept.
+Currently, Hudi organizes data according to fileGroup granularity. The
fileGroup is further divided into column clusters to introduce the columngroup
concept.
The organizational form of Hudi files is divided according to the following
rules:
-The data in the partition is divided into buckets according to hash (each
bucket maps to a file group); the files in each bucket are divided according to
columnFamily; multiple colFamily files in the bucket form a completed
fileGroup; when there is only one columnFamily, it degenerates into the native
Hudi bucket table.
+The data in the partition is divided into buckets according to hash (each
bucket maps to a file group); the files in each bucket are divided according to
columngroup; multiple colgroup files in the bucket form a completed fileGroup;
when there is only one columngroup, it degenerates into the native Hudi bucket
table.

@@ -61,84 +61,84 @@ This feature should be implemented for both Spark and
Flink. So, a table written
### Constraints and Restrictions
1. The overall design relies on the non-blocking concurrent writing feature of
Hudi 1.0.
-2. Lower version Hudi cannot read and write column family tables.
-3. Only MOR bucketed tables support setting column families.
- MOR+Bucket is more suitable because it has higher write performance, but
this does not mean that column family is incompatible with other indexes and
cow tables.
-4. Column families do not support repartitioning and renaming.
-5. Schema evolution does not take effect on the current column family table.
+2. Lower version Hudi cannot read and write column group tables.
+3. Only MOR bucketed tables support setting column groups.
+ MOR+Bucket is more suitable because it has higher write performance, but
this does not mean that column group is incompatible with other indexes and cow
tables.
+4. Column groups do not support repartitioning and renaming.
+5. Schema evolution does not take effect on the current column group table.
Not supporting Schema evolution does not mean users can not add/delete
columns in their table, they just need to do it explicitly.
6. Like native bucket tables, clustering operations are not supported.
### Model change
-After the column family is introduced, the storage structure of the entire
Hudi bucket table changes:
+After the column group is introduced, the storage structure of the entire Hudi
bucket table changes:

-The bucket is divided into multiple columnFamilies by column cluster. When
columnFamily is 1, it will automatically degenerate into the native bucket
table.
+The bucket is divided into multiple columngroups by column cluster. When
columngroup is 1, it will automatically degenerate into the native bucket table.

### Proposed Storage Format Changes
-After splitting the fileGroup by columnFamily, the naming rules for base files
and log files change. We add the cfName suffix at the end of all file names to
facilitate Hudi itself to distinguish column families. If it's not present, we
assume default column family.
+After splitting the fileGroup by columngroup, the naming rules for base files
and log files change. We add the cfName suffix at the end of all file names to
facilitate Hudi itself to distinguish column groups. If it's not present, we
assume default column group.
So, new file name templates will be as follows:
- Base file: [file_id]\_[write_token]\_[begin_time][_cfName].[extension]
- Log file:
[file_id]\_[begin_instant_time][_cfName].log.[version]_[write_token]
-Also, we should evolve the metadata table files schema to additionally track a
column family name.
+Also, we should evolve the metadata table files schema to additionally track a
column group name.
-### Specifying column families when creating a table
-In the table creation statement, column family division is specified in the
options/tblproperties attribute;
-Column family attributes are specified in key-value mode:
-* Key is the column family name. Format: hoodie.colFamily. Column family name
naming rules specified.
-* Value is the specific content of the column family: it consists of all the
columns included in the column family plus the preCombine field. Format: "
col1,col2...colN; precombineCol", the column family list and the preCombine
field are separated by ";"; in the column family list the columns are split by
",".
+### Specifying column groups when creating a table
+In the table creation statement, column group division is specified in the
options/tblproperties attribute;
+Column group attributes are specified in key-value mode:
+* Key is the column group name. Format: hoodie.colgroup. Column group name
naming rules specified.
+* Value is the specific content of the column group: it consists of all the
columns included in the column group plus the preCombine field. Format: "
col1,col2...colN; precombineCol", the column group list and the preCombine
field are separated by ";"; in the column group list the columns are split by
",".
-Constraints: The column family list must contain the primary key, and columns
contained in different column families cannot overlap except for the primary
key. The preCombine field does not need to be specified. If it is not
specified, the primary key will be taken by default.
+Constraints: The column group list must contain the primary key, and columns
contained in different column groups cannot overlap except for the primary key.
The preCombine field does not need to be specified. If it is not specified, the
primary key will be taken by default.
-After the table is created, the column family attributes will be persisted to
hoodie's metadata for subsequent use.
+After the table is created, the column group attributes will be persisted to
hoodie's metadata for subsequent use.
-### Adding and deleting column families in existing table
-Use the SQL alter command to modify the column family attributes and persist
it:
-* Execute ALTER TABLE table_name SET TBLPROPERTIES
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.
-* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k');
to delete the column family.
+### Adding and deleting column groups in existing table
+Use the SQL alter command to modify the column group attributes and persist
it:
+* Execute ALTER TABLE table_name SET TBLPROPERTIES
('hoodie.columngroup.k'='a,b,c;a'); to add a new column group.
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columngroup.k');
to delete the column group.
Specific steps are as follows:
-1. Execute the ALTER command to modify the column family
-2. Verify whether the column family modified by alter is legal. Column family
modification must meet the following conditions, otherwise the verification
will not pass:
- * The column family name of an existing column family cannot be modified.
- * Columns in other column families cannot be divided into new column
families.
- * When creating a new column family, it must meet the format requirements
from previous chapter.
-3. Save the modified column family to the .hoodie directory.
+1. Execute the ALTER command to modify the column group
+2. Verify whether the column group modified by alter is legal. Column group
modification must meet the following conditions, otherwise the verification
will not pass:
+ * The column group name of an existing column group cannot be modified.
+ * Columns in other column groups cannot be divided into new column groups.
+ * When creating a new column group, it must meet the format requirements
from previous chapter.
+3. Save the modified column group to the .hoodie directory.
### Writing data
-The Hudi kernel divides the input data according to column families; the data
belonging to a certain column family is sorted and directly written to the
corresponding column family log file.
+The Hudi kernel divides the input data according to column groups; the data
belonging to a certain column group is sorted and directly written to the
corresponding column group log file.

Specific steps:
1. The engine divides the written data into buckets according to hash and
shuffles the data (the writing engine completes it by itself and is consistent
with the current writing of the native bucket).
2. The Hudi kernel sorts the data to be written to each bucket by primary key
(both Spark and Flink has its own ExternalSorter, we can refer those
ExternalSorter to finish sort).
-3. After sorting, split the data into column families.
-4. Write the segmented data into the log file of the corresponding column
family.
+3. After sorting, split the data into column groups.
+4. Write the segmented data into the log file of the corresponding column
group.
#### Common API interface
-After the table columns are clustered, the writing process includes the
process of sorting and splitting the data compared to the original bucket
bucketing. A new append interface needs to be introduced to support column
families.
-Introduce ColumnFamilyAppendHandle extend AppendHandle to implement column
family writing.
+After the table columns are clustered, the writing process includes the
process of sorting and splitting the data compared to the original bucket
bucketing. A new append interface needs to be introduced to support column
groups.
+Introduce ColumngroupAppendHandle extend AppendHandle to implement column
group writing.

### Reading data
-#### ColumnFamilyReader and RowReader
+#### ColumngroupReader and RowReader

Hudi internal row reader reading steps:
-1. Hudi organizes files by column families to be read.
-2. Introduce familyReader to merge and read each column family's own baseFile
and logfile to achieve column family-level data reading.
- * Since log files are written after being sorted by primary key,
familyReader merges its own baseFile and logFile by primary key using sortMerge.
- * familyReader supports upstream incoming column pruning to reduce IO
overhead when reading data.
- * During the merging process, if the user specifies the precombie field
for the column family, the merging strategy will be selected based on the
precombie field. This logic reuses Hudi's own precombine logic and does not
need to be modified.
-3. Row reader merges the data read by multiple familyReaders according to the
primary key.
+1. Hudi organizes files by column groups to be read.
+2. Introduce groupReader to merge and read each column group's own baseFile
and logfile to achieve column group-level data reading.
+ * Since log files are written after being sorted by primary key,
groupReader merges its own baseFile and logFile by primary key using sortMerge.
+ * groupReader supports upstream incoming column pruning to reduce IO
overhead when reading data.
+ * During the merging process, if the user specifies the precombie field
for the column group, the merging strategy will be selected based on the
precombie field. This logic reuses Hudi's own precombine logic and does not
need to be modified.
+3. Row reader merges the data read by multiple groupReaders according to the
primary key.
-Since the data read by each familyReader is sorted by the primary key, the row
reader merges the data read by each familyReader in the form of sortMergeJoin
and returns the complete data.
+Since the data read by each groupReader is sorted by the primary key, the row
reader merges the data read by each groupReader in the form of sortMergeJoin
and returns the complete data.
The entire reading process involves a large amount of data merging, but
because the data itself is sorted, the memory consumption of the entire merging
process is very low and the merging is fast. Compared with Hudi's native
merging method, the memory pressure and the merging time are significantly
reduced.
@@ -150,38 +150,103 @@ The entire reading process involves a large amount of
data merging, but because
3) The Hudi kernel completes the data reading of rowReader and returns
complete data. The data format is Avro.
4) The engine gets the Avro format data and needs to convert it into the data
format it needs. For example, spark needs to be converted into unsaferow, hetu
into block, flink into row, and hive into arrayWritable.
-### Column family level compaction
-Extend Hudi's compaction schedule module to merge each column family's own
base file and log file:
+### Column group level compaction
+Extend Hudi's compaction schedule module to merge each column group's own base
file and log file:
-
+
### Full compaction
-Extend Hudi's compaction schedule module to merge and update all column
families in the entire table.
-After merging at the column family level, multiple column families are finally
merged into a complete row and saved.
+Extend Hudi's compaction schedule module to merge and update all column groups
in the entire table.
+After merging at the column group level, multiple column groups are finally
merged into a complete row and saved.

-Full compaction will be optional, only column family level compaction is
required.
-Different users might need different columns, some might need columns come
from multiple column families, some might need columns from only one column
family.
+Full compaction will be optional, only column group level compaction is
required.
+Different users might need different columns, some might need columns come
from multiple column groups, some might need columns from only one column group.
It's better to allow users to choose whether enable full compaction or not.
-Besides, after full compaction, projection on the reader side is less
efficient because the projection could only be based on the full parquet file
with all complete fields instead of based on column family names.
+Besides, after full compaction, projection on the reader side is less
efficient because the projection could only be based on the full parquet file
with all complete fields instead of based on column group names.
-ColumnFamily can be used not only in ML scenarios and AI feature table, but
also can be used to simulate multi-stream join and concatenation of wide
tables. 。
+Columngroup can be used not only in ML scenarios and AI feature table, but
also can be used to simulate multi-stream join and concatenation of wide
tables. 。
In simulation of multi-stream join scenario, Hudi should produce complete
rows, so full compaction is needed in this case.
## Rollout/Adoption Plan
-This feature itself is a brand-new feature. If you don’t actively turn it on,
you will not be able to reach the logic of the column families.
-Business behavior compatibility: No impact, this function will not be actively
turned on, and column family logic will not be enabled.
-Syntax compatibility: No impact, the column family attributes are in the table
attributes and are executed through SQL standard syntax.
+This feature itself is a brand-new feature. If you don’t actively turn it on,
you will not be able to reach the logic of the column groups.
+Business behavior compatibility: No impact, this function will not be actively
turned on, and column group logic will not be enabled.
+Syntax compatibility: No impact, the column group attributes are in the table
attributes and are executed through SQL standard syntax.
Compatibility of data type processing methods: No impact, this design will not
modify the bottom-level data field format.
## Test Plan
List to check that the implementation works as expected:
-1. All existing tests pass successfully on tables without column families
defined.
-2. Hudi SQL supports setting column families when creating MOR bucketed
tables.
-3. Column families support adding and deleting in SQL for MOR bucketed tables.
-4. Hudi supports writing data by column families.
-5. Hudi supports reading data by column families.
-7. Hudi supports compaction by column family.
-8. Hudi supports full compaction, merging the data of all column families to
achieve data widening.
\ No newline at end of file
+1. All existing tests pass successfully on tables without column groups
defined.
+2. Hudi SQL supports setting column groups when creating MOR bucketed tables.
+3. Column groups support adding and deleting in SQL for MOR bucketed tables.
+4. Hudi supports writing data by column groups.
+5. Hudi supports reading data by column groups.
+7. Hudi supports compaction by column group.
+8. Hudi supports full compaction, merging the data of all column groups to
achieve data widening.
+
+
+## Potential Approaches
+
+### Approach A: Column Groups under File Group
+
+We map each record key to a single file group, consistently across column
groups. Each column group has file slices,
+like we do today in file groups. How columns are split into column groups is
fluid and can be different across file groups.
+
+
+
+```
+records 1-25 ==> file group 1 ==> [column group : c1-c10], [column group :
c11-c74], [column group : c75-c100]
+
+records 26-50 ==> file group 2 ==> [column group : c1-c40], [column group :
c41-c100]
+
+records 51-100 ==> file group 3 ==> [column group : c1-c20], [column group :
c21-c60], [column group : c61-c80], [column group : c81-c100]
+
+```
+_Layout for table with 100 records, 100 columns_
+
+**Indexing**: works as is, since the mapping from key to file-group is intact.
+
+**Cleaning**: Column groups can be updated at different rates. i.e. one column
group can receive more updates than others.
+To retain versions belonging to the last `x` writes, each column group can
simply enforce retention on its own file slices,
+like today. Should work since its based off the same timeline anyway.
+
+**Queries**: Time-travel / Snapshot queries should work as-is, filtering each
column group like a normal file group today, just reading the
+columns in the projection/filter from the right column group. CDC /
Incremental queries can work again by reconciling commit time across column
groups.
+(Column-level change tracking is a separate problem.)
+
+**Compaction**: Works on column groups based on existing strategies. We may
need to add a few different strategies for tables with blob columns.
+
+**Clustering**: This is where we take a hit. Even when clustering across only
a few column groups, we may need to rewrite all columns to preserve the
+file-group -> column group hierarchy. Otherwise, some columns of a record may
be in one file group while others are in another if clustering created new file
groups.
+⸻
+
+### Approach B: File Groups under Column Groups
+We treat column groups as separate virtual tables sharing the same timeline.
But this needs pre-splitting columns into groups at the table level, losing
flexibility to evolve the table.
+Managing different ways combinations of columns split across records may be
overwhelming.
+
+```
+columns 1-25 ==> column group 1 ==> [file group : c1-c10], [file group :
c11-c74], [file group : c75-c100]
+
+columns 26-50 ==> column group 2 ==> [file group : c1-c40], [file group :
c41-c100]
+
+columns 51-100 ==> column group 3 ==> [file group : c1-c20], [file group :
c21-c60], [file group : c61-c80], [file group : c81-c100]
+
+```
+_Layout for table with 100 records, 100 columns_
+
+
+**Indexing**: RLI (Record Location Index) needs to track multiple positions
per record key, since it can be in different file groups in each column group.
+
+**Cleaning**: Achieved independently by each virtual table, enforcing cleaner
retention.
+
+**Queries**: CDC / Incremental/Snapshot / Time-travel queries are all UNIONs
over query results from relevant column groups.
+
+**Compaction**: works like Approach A.
+
+**Clustering**: each column group can be clustered independently, without
causing any write amplification.
+
+This "virtual table" abstraction can enable other cool things e.g.
Materializing the same data in multiple ways.
+
+
diff --git a/rfc/template.md b/rfc/template.md
index 384928d7b138..8be30dd2765c 100644
--- a/rfc/template.md
+++ b/rfc/template.md
@@ -27,7 +27,7 @@
## Status
-JIRA: <link to umbrella JIRA>
+Issue: <Link to GH feature issue>
> Please keep the status updated in `rfc/README.md`.
@@ -36,7 +36,7 @@ JIRA: <link to umbrella JIRA>
Describe the problem you are trying to solve and a brief description of why
it’s needed
## Background
-Introduce any much background context which is relevant or necessary to
understand the feature and design choices.
+Introduce any background context that is relevant or necessary to understand
the feature and design choices.
## Implementation
Describe the new thing you want to do in appropriate detail, how it fits into
the project architecture.