Re: [PR] docs: RFC-100 - Unstructured Data Storage in Hudi (Strawman proposal) [hudi]

via GitHub Sat, 18 Oct 2025 04:31:52 -0700


the-other-tim-brown commented on code in PR #13924:
URL: https://github.com/apache/hudi/pull/13924#discussion_r2414812586



##########
rfc/rfc-80/rfc-80.md:
##########
@@ -61,84 +61,84 @@ This feature should be implemented for both Spark and 
Flink. So, a table written
 
 ### Constraints and Restrictions
 1. The overall design relies on the non-blocking concurrent writing feature of 
Hudi 1.0.  
-2. Lower version Hudi cannot read and write column family tables.  
-3. Only MOR bucketed tables support setting column families.  
-   MOR+Bucket is more suitable because it has higher write performance, but 
this does not mean that column family is incompatible with other indexes and 
cow tables.  
-4. Column families do not support repartitioning and renaming.  
-5. Schema evolution does not take effect on the current column family table.  
+2. Lower version Hudi cannot read and write column group tables.  
+3. Only MOR bucketed tables support setting column groups.  
+   MOR+Bucket is more suitable because it has higher write performance, but 
this does not mean that column group is incompatible with other indexes and cow 
tables.  
+4. Column groups do not support repartitioning and renaming.  
+5. Schema evolution does not take effect on the current column group table.  
    Not supporting Schema evolution does not mean users can not add/delete 
columns in their table, they just need to do it explicitly.
 6. Like native bucket tables, clustering operations are not supported.
 
 ### Model change
-After the column family is introduced, the storage structure of the entire 
Hudi bucket table changes:
+After the column group is introduced, the storage structure of the entire Hudi 
bucket table changes:
 
 ![bucket](bucket.png)
 
-The bucket is divided into multiple columnFamilies by column cluster. When 
columnFamily is 1, it will automatically degenerate into the native bucket 
table.
+The bucket is divided into multiple columngroups by column cluster. When 
columngroup is 1, it will automatically degenerate into the native bucket table.
 
 ![file-group](file-group.png)
 
 ### Proposed Storage Format Changes
-After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix at the end of all file names to 
facilitate Hudi itself to distinguish column families. If it's not present, we 
assume default column family.
+After splitting the fileGroup by columngroup, the naming rules for base files 
and log files change. We add the cfName suffix at the end of all file names to 
facilitate Hudi itself to distinguish column groups. If it's not present, we 
assume default column group.
 So, new file name templates will be as follows:  
 - Base file: [file_id]\_[write_token]\_[begin_time][_cfName].[extension]  
 - Log file: 
[file_id]\_[begin_instant_time][_cfName].log.[version]_[write_token]  
 
-Also, we should evolve the metadata table files schema to additionally track a 
column family name.  
+Also, we should evolve the metadata table files schema to additionally track a 
column group name.  
 
-### Specifying column families when creating a table
-In the table creation statement, column family division is specified in the 
options/tblproperties attribute;
-Column family attributes are specified in key-value mode:  
-* Key is the column family name. Format: hoodie.colFamily. Column family name  
  naming rules specified.  
-* Value is the specific content of the column family: it consists of all the 
columns included in the column family plus the preCombine field. Format: " 
col1,col2...colN; precombineCol", the column family list and the preCombine 
field are separated by ";"; in the column family list the columns are split by 
",".  
+### Specifying column groups when creating a table
+In the table creation statement, column group division is specified in the 
options/tblproperties attribute;
+Column group attributes are specified in key-value mode:  
+* Key is the column group name. Format: hoodie.colgroup. Column group name    
naming rules specified.  
+* Value is the specific content of the column group: it consists of all the 
columns included in the column group plus the preCombine field. Format: " 
col1,col2...colN; precombineCol", the column group list and the preCombine 
field are separated by ";"; in the column group list the columns are split by 
",".  
 
-Constraints: The column family list must contain the primary key, and columns 
contained in different column families cannot overlap except for the primary 
key. The preCombine field does not need to be specified. If it is not 
specified, the primary key will be taken by default.
+Constraints: The column group list must contain the primary key, and columns 
contained in different column groups cannot overlap except for the primary key. 
The preCombine field does not need to be specified. If it is not specified, the 
primary key will be taken by default.
 
-After the table is created, the column family attributes will be persisted to 
hoodie's metadata for subsequent use.
+After the table is created, the column group attributes will be persisted to 
hoodie's metadata for subsequent use.
 
-### Adding and deleting column families in existing table
-Use the SQL alter command to modify the column family attributes and persist 
it:    
-* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.  
-* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); 
to delete the column family.
+### Adding and deleting column groups in existing table
+Use the SQL alter command to modify the column group attributes and persist 
it:    
+* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columngroup.k'='a,b,c;a'); to add a new column group.  
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columngroup.k'); 
to delete the column group.
 
 Specific steps are as follows:
-1. Execute the ALTER command to modify the column family
-2. Verify whether the column family modified by alter is legal. Column family 
modification must meet the following conditions, otherwise the verification 
will not pass:
-    * The column family name of an existing column family cannot be modified.  
-    * Columns in other column families cannot be divided into new column 
families.  
-    * When creating a new column family, it must meet the format requirements 
from previous chapter.  
-3. Save the modified column family to the .hoodie directory.
+1. Execute the ALTER command to modify the column group
+2. Verify whether the column group modified by alter is legal. Column group 
modification must meet the following conditions, otherwise the verification 
will not pass:
+    * The column group name of an existing column group cannot be modified.  
+    * Columns in other column groups cannot be divided into new column groups. 
 
+    * When creating a new column group, it must meet the format requirements 
from previous chapter.  
+3. Save the modified column group to the .hoodie directory.
 
 ### Writing data
-The Hudi kernel divides the input data according to column families; the data 
belonging to a certain column family is sorted and directly written to the 
corresponding column family log file.
+The Hudi kernel divides the input data according to column groups; the data 
belonging to a certain column group is sorted and directly written to the 
corresponding column group log file.
 
 ![process-write](process-write.png)
 
 Specific steps:  
 1. The engine divides the written data into buckets according to hash and 
shuffles the data (the writing engine completes it by itself and is consistent 
with the current writing of the native bucket).  
 2. The Hudi kernel sorts the data to be written to each bucket by primary key 
(both Spark and Flink has its own ExternalSorter, we can refer those 
ExternalSorter to finish sort).  
-3. After sorting, split the data into column families.  
-4. Write the segmented data into the log file of the corresponding column 
family.  
+3. After sorting, split the data into column groups.  
+4. Write the segmented data into the log file of the corresponding column 
group.  
 
 #### Common API interface
-After the table columns are clustered, the writing process includes the 
process of sorting and splitting the data compared to the original bucket 
bucketing. A new append interface needs to be introduced to support column 
families.  
-Introduce ColumnFamilyAppendHandle extend AppendHandle to implement column 
family writing.
+After the table columns are clustered, the writing process includes the 
process of sorting and splitting the data compared to the original bucket 
bucketing. A new append interface needs to be introduced to support column 
groups.  
+Introduce ColumngroupAppendHandle extend AppendHandle to implement column 
group writing.
 
 ![append-handle](append-handle.png)
 
 ### Reading data
-#### ColumnFamilyReader and RowReader
+#### ColumngroupReader and RowReader
 ![row-reader](row-reader.png)
 
 Hudi internal row reader reading steps:  
-1. Hudi organizes files by column families to be read.
-2. Introduce familyReader to merge and read each column family's own baseFile 
and logfile to achieve column family-level data reading.  
-    * Since log files are written after being sorted by primary key, 
familyReader merges its own baseFile and logFile by primary key using sortMerge.
-    * familyReader supports upstream incoming column pruning to reduce IO 
overhead when reading data.  
-    * During the merging process, if the user specifies the precombie field 
for the column family, the merging strategy will be selected based on the 
precombie field. This logic reuses Hudi's own precombine logic and does not 
need to be modified.    
-3. Row reader merges the data read by multiple familyReaders according to the 
primary key.  
+1. Hudi organizes files by column groups to be read.
+2. Introduce groupReader to merge and read each column group's own baseFile 
and logfile to achieve column group-level data reading.  

Review Comment:
   Assuming that the files also have Hudi metadata fields, then each file will 
overlap on these fields. How do we determine the final value of these fields? 
What about the file name meta field?



##########
rfc/rfc-80/rfc-80.md:
##########
@@ -61,84 +61,84 @@ This feature should be implemented for both Spark and 
Flink. So, a table written
 
 ### Constraints and Restrictions
 1. The overall design relies on the non-blocking concurrent writing feature of 
Hudi 1.0.  
-2. Lower version Hudi cannot read and write column family tables.  
-3. Only MOR bucketed tables support setting column families.  
-   MOR+Bucket is more suitable because it has higher write performance, but 
this does not mean that column family is incompatible with other indexes and 
cow tables.  
-4. Column families do not support repartitioning and renaming.  
-5. Schema evolution does not take effect on the current column family table.  
+2. Lower version Hudi cannot read and write column group tables.  
+3. Only MOR bucketed tables support setting column groups.  
+   MOR+Bucket is more suitable because it has higher write performance, but 
this does not mean that column group is incompatible with other indexes and cow 
tables.  
+4. Column groups do not support repartitioning and renaming.  
+5. Schema evolution does not take effect on the current column group table.  
    Not supporting Schema evolution does not mean users can not add/delete 
columns in their table, they just need to do it explicitly.
 6. Like native bucket tables, clustering operations are not supported.
 
 ### Model change
-After the column family is introduced, the storage structure of the entire 
Hudi bucket table changes:
+After the column group is introduced, the storage structure of the entire Hudi 
bucket table changes:
 
 ![bucket](bucket.png)
 
-The bucket is divided into multiple columnFamilies by column cluster. When 
columnFamily is 1, it will automatically degenerate into the native bucket 
table.
+The bucket is divided into multiple columngroups by column cluster. When 
columngroup is 1, it will automatically degenerate into the native bucket table.
 
 ![file-group](file-group.png)
 
 ### Proposed Storage Format Changes
-After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix at the end of all file names to 
facilitate Hudi itself to distinguish column families. If it's not present, we 
assume default column family.
+After splitting the fileGroup by columngroup, the naming rules for base files 
and log files change. We add the cfName suffix at the end of all file names to 
facilitate Hudi itself to distinguish column groups. If it's not present, we 
assume default column group.
 So, new file name templates will be as follows:  
 - Base file: [file_id]\_[write_token]\_[begin_time][_cfName].[extension]  
 - Log file: 
[file_id]\_[begin_instant_time][_cfName].log.[version]_[write_token]  
 
-Also, we should evolve the metadata table files schema to additionally track a 
column family name.  
+Also, we should evolve the metadata table files schema to additionally track a 
column group name.  
 
-### Specifying column families when creating a table
-In the table creation statement, column family division is specified in the 
options/tblproperties attribute;
-Column family attributes are specified in key-value mode:  
-* Key is the column family name. Format: hoodie.colFamily. Column family name  
  naming rules specified.  
-* Value is the specific content of the column family: it consists of all the 
columns included in the column family plus the preCombine field. Format: " 
col1,col2...colN; precombineCol", the column family list and the preCombine 
field are separated by ";"; in the column family list the columns are split by 
",".  
+### Specifying column groups when creating a table
+In the table creation statement, column group division is specified in the 
options/tblproperties attribute;
+Column group attributes are specified in key-value mode:  
+* Key is the column group name. Format: hoodie.colgroup. Column group name    
naming rules specified.  
+* Value is the specific content of the column group: it consists of all the 
columns included in the column group plus the preCombine field. Format: " 
col1,col2...colN; precombineCol", the column group list and the preCombine 
field are separated by ";"; in the column group list the columns are split by 
",".  
 
-Constraints: The column family list must contain the primary key, and columns 
contained in different column families cannot overlap except for the primary 
key. The preCombine field does not need to be specified. If it is not 
specified, the primary key will be taken by default.
+Constraints: The column group list must contain the primary key, and columns 
contained in different column groups cannot overlap except for the primary key. 
The preCombine field does not need to be specified. If it is not specified, the 
primary key will be taken by default.
 
-After the table is created, the column family attributes will be persisted to 
hoodie's metadata for subsequent use.
+After the table is created, the column group attributes will be persisted to 
hoodie's metadata for subsequent use.
 
-### Adding and deleting column families in existing table
-Use the SQL alter command to modify the column family attributes and persist 
it:    
-* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.  
-* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); 
to delete the column family.
+### Adding and deleting column groups in existing table
+Use the SQL alter command to modify the column group attributes and persist 
it:    
+* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columngroup.k'='a,b,c;a'); to add a new column group.  
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columngroup.k'); 
to delete the column group.
 
 Specific steps are as follows:
-1. Execute the ALTER command to modify the column family
-2. Verify whether the column family modified by alter is legal. Column family 
modification must meet the following conditions, otherwise the verification 
will not pass:
-    * The column family name of an existing column family cannot be modified.  
-    * Columns in other column families cannot be divided into new column 
families.  
-    * When creating a new column family, it must meet the format requirements 
from previous chapter.  
-3. Save the modified column family to the .hoodie directory.
+1. Execute the ALTER command to modify the column group
+2. Verify whether the column group modified by alter is legal. Column group 
modification must meet the following conditions, otherwise the verification 
will not pass:
+    * The column group name of an existing column group cannot be modified.  
+    * Columns in other column groups cannot be divided into new column groups. 
 
+    * When creating a new column group, it must meet the format requirements 
from previous chapter.  
+3. Save the modified column group to the .hoodie directory.
 
 ### Writing data
-The Hudi kernel divides the input data according to column families; the data 
belonging to a certain column family is sorted and directly written to the 
corresponding column family log file.
+The Hudi kernel divides the input data according to column groups; the data 
belonging to a certain column group is sorted and directly written to the 
corresponding column group log file.
 
 ![process-write](process-write.png)
 
 Specific steps:  
 1. The engine divides the written data into buckets according to hash and 
shuffles the data (the writing engine completes it by itself and is consistent 
with the current writing of the native bucket).  
 2. The Hudi kernel sorts the data to be written to each bucket by primary key 
(both Spark and Flink has its own ExternalSorter, we can refer those 
ExternalSorter to finish sort).  
-3. After sorting, split the data into column families.  
-4. Write the segmented data into the log file of the corresponding column 
family.  
+3. After sorting, split the data into column groups.  
+4. Write the segmented data into the log file of the corresponding column 
group.  

Review Comment:
   If there is an ordering field, it seems like we will need to always read 
that value. This implies we will potentially read from multiple segments to 
determine how to merge the records. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: RFC-100 - Unstructured Data Storage in Hudi (Strawman proposal) [hudi]

Reply via email to