Hi folks, PR 7984【 https://github.com/apache/hudi/pull/7984 】 implements hash partitioning. As you know, It is often difficult to find an appropriate partition key in the existing big data. Hash partitioning can easily solve this problem. it can greatly improve the performance of hudi's big data processing. The idea is to use the hash partition field as one of the partition fields of the ComplexKeyGenerator, so this PR implementation does not involve logic modification of core code. The codes are easy to review, but I think hash partition is very usefull. we really need it. How to use hash partition in spark data source can refer to https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala #testHashPartition
No public API or user-facing feature change or any performance impact if the hash partition parameters are not specified. When hash.partition.fields is specified and partition.fields contains _hoodie_hash_partition, a column named _hoodie_hash_partition will be added in this table as one of the partition key. If predicates of hash.partition.fields appear in the query statement, the _hoodie_hash_partition = X predicate will be automatically added to the query statement for partition pruning. Hope folks help and review! Thanks! Lvhu