Hi folks,
      PR 7984【 https://github.com/apache/hudi/pull/7984 】 implements hash 
partitioning.
      As you know, It is often difficult to find an appropriate partition key 
in the existing big data. Hash partitioning can easily solve this problem. it 
can greatly improve the performance of hudi's big data processing.
      The idea is to use the hash partition field as one of the partition 
fields of the ComplexKeyGenerator, so this PR  implementation does not involve 
logic modification of core code. 
      The codes are easy to review, but I think hash partition is very usefull. 
we really need it.
      How to use hash partition in spark data source can refer to 
https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
   #testHashPartition

      No public API or user-facing feature change or any performance impact if 
the hash partition parameters are not specified.  

      When hash.partition.fields is specified and partition.fields contains 
_hoodie_hash_partition, a column named _hoodie_hash_partition will be added in 
this table as one of the partition key.

      If predicates of hash.partition.fields appear in the query statement, the 
_hoodie_hash_partition = X predicate will be automatically added to the query 
statement for partition pruning.

        Hope folks help and review!
      Thanks!
Lvhu
      

Reply via email to