EdwinGuo commented on pull request #1721: URL: https://github.com/apache/hudi/pull/1721#issuecomment-648532688
> @EdwinGuo @nsivabalan let's hash this out.. its an interesting one.. Although it may seem like we are computing the fully exploded RDD in both places.. if you look closely, we do > > ``` > fileToComparisons = explodeRecordRDDWithFileComparisons(partitionToFileInfo, partitionRecordKeyPairRDD) > .mapToPair(t -> t).countByKey(); > ``` > > countByKey() does not shuffle actual data, but just the counts per file.. We only pay the compute cost of exploding twice.. And all this to estimate the parallelism. given this is jsut an estimate, would it be better to introduce an option to simply down sample and estimate, rather than adding caching? > > eg. > > ``` > fileToComparisons = explodeRecordRDDWithFileComparisons(partitionToFileInfo, partitionRecordKeyPairRDD.sample(true, 0.1)) > .mapToPair(t -> t).countByKey(); > ``` > > would cut the cost down by 90% .. we need to adjust the computations in the map accordingly ofc.. > Even spark sort does some kind of reservoir sampling.. So this could be a valid approach overall. > > What do you both think? I am bit concerned about caching this exploded RDD (that's why I chose to recompute to begin with) I can provide some performance comparison tomorrow. fileComparisonsRDD is being compute in a different patterns within findMatchingFilesForRecordKeys and computeComparisonsPerFileGroup, so yes, countByKey is light in shuffle but could be heavy in IO for some of the cases. I agree StorageLevel.MEMORY_AND_DISK_SER() could be heavy than recompute in some of the scenario, so let me conduct some performance testing and get back to you. Regarding sampling, what if some of the partitions are skewed? Will that cause more overhead than flush the file out? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
