Thanks for your contribution, Lvhu! I think we should actually kick-start this effort with an small RFC outlining proposed changes first, as this is modifying the core read-flow for all Hudi tables and we want to make sure our approach there is rock-solid.
On Thu, Feb 16, 2023 at 6:34 AM 吕虎 <lvh...@163.com> wrote: > Hi folks, > PR 7984【 https://github.com/apache/hudi/pull/7984 】 implements hash > partitioning. > As you know, It is often difficult to find an appropriate partition > key in the existing big data. Hash partitioning can easily solve this > problem. it can greatly improve the performance of hudi's big data > processing. > The idea is to use the hash partition field as one of the partition > fields of the ComplexKeyGenerator, so this PR implementation does not > involve logic modification of core code. > The codes are easy to review, but I think hash partition is very > usefull. we really need it. > How to use hash partition in spark data source can refer to > https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala > #testHashPartition > > No public API or user-facing feature change or any performance > impact if the hash partition parameters are not specified. > > When hash.partition.fields is specified and partition.fields > contains _hoodie_hash_partition, a column named _hoodie_hash_partition will > be added in this table as one of the partition key. > > If predicates of hash.partition.fields appear in the query > statement, the _hoodie_hash_partition = X predicate will be automatically > added to the query statement for partition pruning. > > Hope folks help and review! > Thanks! > Lvhu >