Re: [PR] docs: RFC-100 - Unstructured Data Storage in Hudi (Strawman proposal) [hudi]

via GitHub Sat, 18 Oct 2025 00:36:47 -0700


the-other-tim-brown commented on code in PR #13924:
URL: https://github.com/apache/hudi/pull/13924#discussion_r2377618784



##########
rfc/rfc-80/rfc-80.md:
##########
@@ -150,38 +150,103 @@ The entire reading process involves a large amount of 
data merging, but because
 3) The Hudi kernel completes the data reading of rowReader and returns 
complete data. The data format is Avro.  
 4) The engine gets the Avro format data and needs to convert it into the data 
format it needs. For example, spark needs to be converted into unsaferow, hetu 
into block, flink into row, and hive into arrayWritable.
 
-### Column family level compaction
-Extend Hudi's compaction schedule module to merge each column family's own 
base file and log file:
+### Column group level compaction
+Extend Hudi's compaction schedule module to merge each column group's own base 
file and log file:
 
-![family-compaction](family-compaction.png)
+![group-compaction](group-compaction.png)
 
 ### Full compaction
-Extend Hudi's compaction schedule module to merge and update all column 
families in the entire table.  
-After merging at the column family level, multiple column families are finally 
merged into a complete row and saved.  
+Extend Hudi's compaction schedule module to merge and update all column groups 
in the entire table.  
+After merging at the column group level, multiple column groups are finally 
merged into a complete row and saved.  
 
 ![grand-compaction](grand-compaction.png)
 
-Full compaction will be optional, only column family level compaction is 
required.  
-Different users might need different columns, some might need columns come 
from multiple column families, some might need columns from only one column 
family.
+Full compaction will be optional, only column group level compaction is 
required.  
+Different users might need different columns, some might need columns come 
from multiple column groups, some might need columns from only one column group.
 It's better to allow users to choose whether enable full compaction or not.  
-Besides, after full compaction, projection on the reader side is less 
efficient because the projection could only be based on the full parquet file 
with all complete fields instead of based on column family names.    
+Besides, after full compaction, projection on the reader side is less 
efficient because the projection could only be based on the full parquet file 
with all complete fields instead of based on column group names.    
 
-ColumnFamily can be used not only in ML scenarios and AI feature table, but 
also can be used to simulate multi-stream join and concatenation of wide 
tables. 。
+Columngroup can be used not only in ML scenarios and AI feature table, but 
also can be used to simulate multi-stream join and concatenation of wide 
tables. 。
 In simulation of multi-stream join scenario, Hudi should produce complete 
rows, so full compaction is needed in this case.
 
 ## Rollout/Adoption Plan
 
-This feature itself is a brand-new feature. If you don’t actively turn it on, 
you will not be able to reach the logic of the column families.    
-Business behavior compatibility: No impact, this function will not be actively 
turned on, and column family logic will not be enabled.  
-Syntax compatibility: No impact, the column family attributes are in the table 
attributes and are executed through SQL standard syntax.  
+This feature itself is a brand-new feature. If you don’t actively turn it on, 
you will not be able to reach the logic of the column groups.    
+Business behavior compatibility: No impact, this function will not be actively 
turned on, and column group logic will not be enabled.  
+Syntax compatibility: No impact, the column group attributes are in the table 
attributes and are executed through SQL standard syntax.  
 Compatibility of data type processing methods: No impact, this design will not 
modify the bottom-level data field format.  
 
 ## Test Plan
 List to check that the implementation works as expected:  
-1. All existing tests pass successfully on tables without column families 
defined.  
-2. Hudi SQL supports setting column families when creating MOR bucketed 
tables.  
-3. Column families support adding and deleting in SQL for MOR bucketed tables. 
 
-4. Hudi supports writing data by column families.  
-5. Hudi supports reading data by column families.  
-7. Hudi supports compaction by column family.  
-8. Hudi supports full compaction, merging the data of all column families to 
achieve data widening.  
\ No newline at end of file
+1. All existing tests pass successfully on tables without column groups 
defined.  
+2. Hudi SQL supports setting column groups when creating MOR bucketed tables.  
+3. Column groups support adding and deleting in SQL for MOR bucketed tables.  
+4. Hudi supports writing data by column groups.  
+5. Hudi supports reading data by column groups.  
+7. Hudi supports compaction by column group.  
+8. Hudi supports full compaction, merging the data of all column groups to 
achieve data widening.  
+
+
+## Potential Approaches 
+
+### Approach A: Column Groups under File Group
+
+We map each record key to a single file group, consistently across column 
groups. Each column group has file slices,
+like we do today in file groups. How columns are split into column groups is 
fluid and can be different across file groups.
+
+
+
+```
+records 1-25 ==> file group 1 ==> [column group : c1-c10], [column group : 
c11-c74], [column group : c75-c100]
+
+records 26-50 ==> file group 2 ==> [column group : c1-c40], [column group : 
c41-c100]
+
+records 51-100 ==> file group 3 ==> [column group : c1-c20], [column group : 
c21-c60], [column group : c61-c80], [column group : c81-c100] 
+
+```
+_Layout for table with 100 records, 100 columns_
+
+**Indexing**: works as is, since the mapping from key to file-group is intact. 
+
+**Cleaning**: Column groups can be updated at different rates. i.e. one column 
group can receive more updates than others. 
+To retain versions belonging to the last `x` writes, each column group can 
simply enforce retention on its own file slices, 
+like today. Should work since its based off the same timeline anyway. 
+
+**Queries**: Time-travel / Snapshot queries should work as-is, filtering each 
column group like a normal file group today, just reading the
+columns in the projection/filter from the right column group. CDC / 
Incremental queries can work again by reconciling commit time across column 
groups.
+(Column-level change tracking is a separate problem.)
+
+**Compaction**: Works on column groups based on existing strategies. We may 
need to add a few different strategies for tables with blob columns. 
+
+**Clustering**: This is where we take a hit. Even when clustering across only 
a few column groups, we may need to rewrite all columns to preserve the 
+file-group -> column group hierarchy. Otherwise, some columns of a record may 
be in one file group while others are in another if clustering created new file 
groups.
+⸻
+
+### Approach B: File Groups under Column Groups 
+We treat column groups as separate virtual tables sharing the same timeline. 
But this needs pre-splitting columns into groups at the table level, losing 
flexibility to evolve the table. 
+Managing different ways combinations of columns split across records may be 
overwhelming.
+
+```
+columns 1-25 ==> column group 1 ==> [file group : c1-c10], [file group : 
c11-c74], [file group : c75-c100]
+
+columns 26-50 ==> column group 2 ==> [file group : c1-c40], [file group : 
c41-c100]
+
+columns 51-100 ==> column group 3 ==> [file group : c1-c20], [file group : 
c21-c60], [file group : c61-c80], [file group : c81-c100] 
+
+```
+_Layout for table with 100 records, 100 columns_
+
+
+**Indexing**: RLI (Record Location Index) needs to track multiple positions 
per record key, since it can be in different file groups in each column group.
+
+**Cleaning**: Achieved independently by each virtual table, enforcing cleaner 
retention.
+
+**Queries**: CDC / Incremental/Snapshot / Time-travel queries are all UNIONs 
over query results from relevant column groups.

Review Comment:
   Since the record keys span multiple file groups, would we need to shuffle 
the data on read now to materialize the full rows?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: RFC-100 - Unstructured Data Storage in Hudi (Strawman proposal) [hudi]

Reply via email to