danny0405 commented on code in PR #13924:
URL: https://github.com/apache/hudi/pull/13924#discussion_r2415366636
##########
rfc/rfc-80/rfc-80.md:
##########
@@ -61,84 +61,84 @@ This feature should be implemented for both Spark and
Flink. So, a table written
### Constraints and Restrictions
1. The overall design relies on the non-blocking concurrent writing feature of
Hudi 1.0.
-2. Lower version Hudi cannot read and write column family tables.
-3. Only MOR bucketed tables support setting column families.
- MOR+Bucket is more suitable because it has higher write performance, but
this does not mean that column family is incompatible with other indexes and
cow tables.
-4. Column families do not support repartitioning and renaming.
-5. Schema evolution does not take effect on the current column family table.
+2. Lower version Hudi cannot read and write column group tables.
+3. Only MOR bucketed tables support setting column groups.
+ MOR+Bucket is more suitable because it has higher write performance, but
this does not mean that column group is incompatible with other indexes and cow
tables.
+4. Column groups do not support repartitioning and renaming.
+5. Schema evolution does not take effect on the current column group table.
Not supporting Schema evolution does not mean users can not add/delete
columns in their table, they just need to do it explicitly.
6. Like native bucket tables, clustering operations are not supported.
### Model change
-After the column family is introduced, the storage structure of the entire
Hudi bucket table changes:
+After the column group is introduced, the storage structure of the entire Hudi
bucket table changes:

-The bucket is divided into multiple columnFamilies by column cluster. When
columnFamily is 1, it will automatically degenerate into the native bucket
table.
+The bucket is divided into multiple columngroups by column cluster. When
columngroup is 1, it will automatically degenerate into the native bucket table.

### Proposed Storage Format Changes
-After splitting the fileGroup by columnFamily, the naming rules for base files
and log files change. We add the cfName suffix at the end of all file names to
facilitate Hudi itself to distinguish column families. If it's not present, we
assume default column family.
+After splitting the fileGroup by columngroup, the naming rules for base files
and log files change. We add the cfName suffix at the end of all file names to
facilitate Hudi itself to distinguish column groups. If it's not present, we
assume default column group.
So, new file name templates will be as follows:
- Base file: [file_id]\_[write_token]\_[begin_time][_cfName].[extension]
- Log file:
[file_id]\_[begin_instant_time][_cfName].log.[version]_[write_token]
-Also, we should evolve the metadata table files schema to additionally track a
column family name.
+Also, we should evolve the metadata table files schema to additionally track a
column group name.
-### Specifying column families when creating a table
-In the table creation statement, column family division is specified in the
options/tblproperties attribute;
-Column family attributes are specified in key-value mode:
-* Key is the column family name. Format: hoodie.colFamily. Column family name
naming rules specified.
-* Value is the specific content of the column family: it consists of all the
columns included in the column family plus the preCombine field. Format: "
col1,col2...colN; precombineCol", the column family list and the preCombine
field are separated by ";"; in the column family list the columns are split by
",".
+### Specifying column groups when creating a table
+In the table creation statement, column group division is specified in the
options/tblproperties attribute;
+Column group attributes are specified in key-value mode:
+* Key is the column group name. Format: hoodie.colgroup. Column group name
naming rules specified.
+* Value is the specific content of the column group: it consists of all the
columns included in the column group plus the preCombine field. Format: "
col1,col2...colN; precombineCol", the column group list and the preCombine
field are separated by ";"; in the column group list the columns are split by
",".
-Constraints: The column family list must contain the primary key, and columns
contained in different column families cannot overlap except for the primary
key. The preCombine field does not need to be specified. If it is not
specified, the primary key will be taken by default.
+Constraints: The column group list must contain the primary key, and columns
contained in different column groups cannot overlap except for the primary key.
The preCombine field does not need to be specified. If it is not specified, the
primary key will be taken by default.
-After the table is created, the column family attributes will be persisted to
hoodie's metadata for subsequent use.
+After the table is created, the column group attributes will be persisted to
hoodie's metadata for subsequent use.
-### Adding and deleting column families in existing table
-Use the SQL alter command to modify the column family attributes and persist
it:
-* Execute ALTER TABLE table_name SET TBLPROPERTIES
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.
-* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k');
to delete the column family.
+### Adding and deleting column groups in existing table
+Use the SQL alter command to modify the column group attributes and persist
it:
+* Execute ALTER TABLE table_name SET TBLPROPERTIES
('hoodie.columngroup.k'='a,b,c;a'); to add a new column group.
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columngroup.k');
to delete the column group.
Specific steps are as follows:
-1. Execute the ALTER command to modify the column family
-2. Verify whether the column family modified by alter is legal. Column family
modification must meet the following conditions, otherwise the verification
will not pass:
- * The column family name of an existing column family cannot be modified.
- * Columns in other column families cannot be divided into new column
families.
- * When creating a new column family, it must meet the format requirements
from previous chapter.
-3. Save the modified column family to the .hoodie directory.
+1. Execute the ALTER command to modify the column group
+2. Verify whether the column group modified by alter is legal. Column group
modification must meet the following conditions, otherwise the verification
will not pass:
+ * The column group name of an existing column group cannot be modified.
+ * Columns in other column groups cannot be divided into new column groups.
+ * When creating a new column group, it must meet the format requirements
from previous chapter.
+3. Save the modified column group to the .hoodie directory.
### Writing data
-The Hudi kernel divides the input data according to column families; the data
belonging to a certain column family is sorted and directly written to the
corresponding column family log file.
+The Hudi kernel divides the input data according to column groups; the data
belonging to a certain column group is sorted and directly written to the
corresponding column group log file.

Specific steps:
1. The engine divides the written data into buckets according to hash and
shuffles the data (the writing engine completes it by itself and is consistent
with the current writing of the native bucket).
2. The Hudi kernel sorts the data to be written to each bucket by primary key
(both Spark and Flink has its own ExternalSorter, we can refer those
ExternalSorter to finish sort).
-3. After sorting, split the data into column families.
-4. Write the segmented data into the log file of the corresponding column
family.
+3. After sorting, split the data into column groups.
+4. Write the segmented data into the log file of the corresponding column
group.
Review Comment:
If the number of column families are kind of small, additional reading
should not be a huge cost or bottleneck I think.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]