vinothchandar commented on pull request #1721: URL: https://github.com/apache/hudi/pull/1721#issuecomment-648885633
> Regarding sampling, what if some of the partitions are skewed? Will that cause more overhead than flush the file out? IIRC the partitionRecordKeyPairRDD would have even distribution of keys from the precombine step which just does a `reduceByKey`. We can always support a config to increase the sampling rate, right? All depends on how much difference there is in the computed parallelism with samplingRate=0.1 and 1.0? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
