Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-637414299
Hello Vinoth,
I was just playing with different combination of shuffle parallelism, I am
able to reduce countByKey at WorkloadProfile.java through shuffle partition by
setting upto 20 or so but there is no impact on countByKey at
HoodieBloomIndex.java and count at HoodieSparkSqlWriter.scala
Data Stats are as followed -
1 - more than 500 keys/record
2 - 7k to 10k records/ partition
3 - upsets vs insert ratio around 70:30 but this can vary in most cases,
it's not fixed
4 - Keys are not ordered/partition, I have oredered the keys while
inserting into Hudi Dataset thorugh spark structured streaming.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]