Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-637414299


   Hello Vinoth,
   
   I was just playing with different combination of shuffle parallelism, I am 
able to reduce countByKey at WorkloadProfile.java through shuffle partition by 
setting upto 20 or so but there is no impact on countByKey at 
HoodieBloomIndex.java and count at HoodieSparkSqlWriter.scala
   Data Stats are as followed -
   1 - more than 500 keys/record
   2 - 7k to 10k records/ partition
   3 - upsets vs insert ratio around 70:30 but this can vary in most cases, 
it's not fixed  
   4 -  Keys are not ordered/partition, I have oredered the keys while 
inserting into Hudi Dataset thorugh spark structured streaming.
   
   
   
    
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to