vinothchandar commented on pull request #1721:
URL: https://github.com/apache/hudi/pull/1721#issuecomment-648885633


   > Regarding sampling, what if some of the partitions are skewed? Will that 
cause more overhead than flush the file out?
   
   IIRC the partitionRecordKeyPairRDD would have even distribution of keys from 
the precombine step which just does a `reduceByKey`. We can always support a 
config to increase the sampling rate, right? All depends on how much difference 
there is in the computed parallelism with samplingRate=0.1 and 1.0?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to