the-other-tim-brown commented on code in PR #13924: URL: https://github.com/apache/hudi/pull/13924#discussion_r2377618784
########## rfc/rfc-80/rfc-80.md: ########## @@ -150,38 +150,103 @@ The entire reading process involves a large amount of data merging, but because 3) The Hudi kernel completes the data reading of rowReader and returns complete data. The data format is Avro. 4) The engine gets the Avro format data and needs to convert it into the data format it needs. For example, spark needs to be converted into unsaferow, hetu into block, flink into row, and hive into arrayWritable. -### Column family level compaction -Extend Hudi's compaction schedule module to merge each column family's own base file and log file: +### Column group level compaction +Extend Hudi's compaction schedule module to merge each column group's own base file and log file: - + ### Full compaction -Extend Hudi's compaction schedule module to merge and update all column families in the entire table. -After merging at the column family level, multiple column families are finally merged into a complete row and saved. +Extend Hudi's compaction schedule module to merge and update all column groups in the entire table. +After merging at the column group level, multiple column groups are finally merged into a complete row and saved.  -Full compaction will be optional, only column family level compaction is required. -Different users might need different columns, some might need columns come from multiple column families, some might need columns from only one column family. +Full compaction will be optional, only column group level compaction is required. +Different users might need different columns, some might need columns come from multiple column groups, some might need columns from only one column group. It's better to allow users to choose whether enable full compaction or not. -Besides, after full compaction, projection on the reader side is less efficient because the projection could only be based on the full parquet file with all complete fields instead of based on column family names. +Besides, after full compaction, projection on the reader side is less efficient because the projection could only be based on the full parquet file with all complete fields instead of based on column group names. -ColumnFamily can be used not only in ML scenarios and AI feature table, but also can be used to simulate multi-stream join and concatenation of wide tables. 。 +Columngroup can be used not only in ML scenarios and AI feature table, but also can be used to simulate multi-stream join and concatenation of wide tables. 。 In simulation of multi-stream join scenario, Hudi should produce complete rows, so full compaction is needed in this case. ## Rollout/Adoption Plan -This feature itself is a brand-new feature. If you don’t actively turn it on, you will not be able to reach the logic of the column families. -Business behavior compatibility: No impact, this function will not be actively turned on, and column family logic will not be enabled. -Syntax compatibility: No impact, the column family attributes are in the table attributes and are executed through SQL standard syntax. +This feature itself is a brand-new feature. If you don’t actively turn it on, you will not be able to reach the logic of the column groups. +Business behavior compatibility: No impact, this function will not be actively turned on, and column group logic will not be enabled. +Syntax compatibility: No impact, the column group attributes are in the table attributes and are executed through SQL standard syntax. Compatibility of data type processing methods: No impact, this design will not modify the bottom-level data field format. ## Test Plan List to check that the implementation works as expected: -1. All existing tests pass successfully on tables without column families defined. -2. Hudi SQL supports setting column families when creating MOR bucketed tables. -3. Column families support adding and deleting in SQL for MOR bucketed tables. -4. Hudi supports writing data by column families. -5. Hudi supports reading data by column families. -7. Hudi supports compaction by column family. -8. Hudi supports full compaction, merging the data of all column families to achieve data widening. \ No newline at end of file +1. All existing tests pass successfully on tables without column groups defined. +2. Hudi SQL supports setting column groups when creating MOR bucketed tables. +3. Column groups support adding and deleting in SQL for MOR bucketed tables. +4. Hudi supports writing data by column groups. +5. Hudi supports reading data by column groups. +7. Hudi supports compaction by column group. +8. Hudi supports full compaction, merging the data of all column groups to achieve data widening. + + +## Potential Approaches + +### Approach A: Column Groups under File Group + +We map each record key to a single file group, consistently across column groups. Each column group has file slices, +like we do today in file groups. How columns are split into column groups is fluid and can be different across file groups. + + + +``` +records 1-25 ==> file group 1 ==> [column group : c1-c10], [column group : c11-c74], [column group : c75-c100] + +records 26-50 ==> file group 2 ==> [column group : c1-c40], [column group : c41-c100] + +records 51-100 ==> file group 3 ==> [column group : c1-c20], [column group : c21-c60], [column group : c61-c80], [column group : c81-c100] + +``` +_Layout for table with 100 records, 100 columns_ + +**Indexing**: works as is, since the mapping from key to file-group is intact. + +**Cleaning**: Column groups can be updated at different rates. i.e. one column group can receive more updates than others. +To retain versions belonging to the last `x` writes, each column group can simply enforce retention on its own file slices, +like today. Should work since its based off the same timeline anyway. + +**Queries**: Time-travel / Snapshot queries should work as-is, filtering each column group like a normal file group today, just reading the +columns in the projection/filter from the right column group. CDC / Incremental queries can work again by reconciling commit time across column groups. +(Column-level change tracking is a separate problem.) + +**Compaction**: Works on column groups based on existing strategies. We may need to add a few different strategies for tables with blob columns. + +**Clustering**: This is where we take a hit. Even when clustering across only a few column groups, we may need to rewrite all columns to preserve the +file-group -> column group hierarchy. Otherwise, some columns of a record may be in one file group while others are in another if clustering created new file groups. +⸻ + +### Approach B: File Groups under Column Groups +We treat column groups as separate virtual tables sharing the same timeline. But this needs pre-splitting columns into groups at the table level, losing flexibility to evolve the table. +Managing different ways combinations of columns split across records may be overwhelming. + +``` +columns 1-25 ==> column group 1 ==> [file group : c1-c10], [file group : c11-c74], [file group : c75-c100] + +columns 26-50 ==> column group 2 ==> [file group : c1-c40], [file group : c41-c100] + +columns 51-100 ==> column group 3 ==> [file group : c1-c20], [file group : c21-c60], [file group : c61-c80], [file group : c81-c100] + +``` +_Layout for table with 100 records, 100 columns_ + + +**Indexing**: RLI (Record Location Index) needs to track multiple positions per record key, since it can be in different file groups in each column group. + +**Cleaning**: Achieved independently by each virtual table, enforcing cleaner retention. + +**Queries**: CDC / Incremental/Snapshot / Time-travel queries are all UNIONs over query results from relevant column groups. Review Comment: Since the record keys span multiple file groups, would we need to shuffle the data on read now to materialize the full rows? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
