Thanks for your contribution, Lvhu!

I think we should actually kick-start this effort with an small RFC
outlining proposed changes first, as this is modifying the core read-flow
for all Hudi tables and we want to make sure our approach there is
rock-solid.

On Thu, Feb 16, 2023 at 6:34 AM 吕虎 <lvh...@163.com> wrote:

> Hi folks,
>       PR 7984【 https://github.com/apache/hudi/pull/7984 】 implements hash
> partitioning.
>       As you know, It is often difficult to find an appropriate partition
> key in the existing big data. Hash partitioning can easily solve this
> problem. it can greatly improve the performance of hudi's big data
> processing.
>       The idea is to use the hash partition field as one of the partition
> fields of the ComplexKeyGenerator, so this PR  implementation does not
> involve logic modification of core code.
>       The codes are easy to review, but I think hash partition is very
> usefull. we really need it.
>       How to use hash partition in spark data source can refer to
> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>  #testHashPartition
>
>       No public API or user-facing feature change or any performance
> impact if the hash partition parameters are not specified.
>
>       When hash.partition.fields is specified and partition.fields
> contains _hoodie_hash_partition, a column named _hoodie_hash_partition will
> be added in this table as one of the partition key.
>
>       If predicates of hash.partition.fields appear in the query
> statement, the _hoodie_hash_partition = X predicate will be automatically
> added to the query statement for partition pruning.
>
>         Hope folks help and review!
>       Thanks!
> Lvhu
>

Reply via email to