EdwinGuo commented on pull request #1721:
URL: https://github.com/apache/hudi/pull/1721#issuecomment-648532688


   > @EdwinGuo @nsivabalan let's hash this out.. its an interesting one.. 
Although it may seem like we are computing the fully exploded RDD in both 
places.. if you look closely, we do
   > 
   > ```
   > fileToComparisons = 
explodeRecordRDDWithFileComparisons(partitionToFileInfo, 
partitionRecordKeyPairRDD)
   >           .mapToPair(t -> t).countByKey();
   > ```
   > 
   > countByKey() does not shuffle actual data, but just the counts per file.. 
We only pay the compute cost of exploding twice.. And all this to estimate the 
parallelism. given this is jsut an estimate, would it be better to introduce an 
option to simply down sample and estimate, rather than adding caching?
   > 
   > eg.
   > 
   > ```
   > fileToComparisons = 
explodeRecordRDDWithFileComparisons(partitionToFileInfo, 
partitionRecordKeyPairRDD.sample(true, 0.1))
   >           .mapToPair(t -> t).countByKey();
   > ```
   > 
   > would cut the cost down by 90% .. we need to adjust the computations in 
the map accordingly ofc..
   > Even spark sort does some kind of reservoir sampling.. So this could be a 
valid approach overall.
   > 
   > What do you both think? I am bit concerned about caching this exploded RDD 
(that's why I chose to recompute to begin with)
   
   I can provide some performance comparison tomorrow.  fileComparisonsRDD is 
being compute in a different patterns within findMatchingFilesForRecordKeys and 
computeComparisonsPerFileGroup, so yes, countByKey is light in shuffle but 
could be heavy in IO for some of the cases. I agree 
StorageLevel.MEMORY_AND_DISK_SER() could be heavy than recompute in some of the 
scenario, so let me conduct some performance testing and get back to you.
   
   Regarding sampling, what if some of the partitions are skewed? Will that 
cause more overhead than flush the file out?   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to