danny0405 commented on pull request #2319: URL: https://github.com/apache/hudi/pull/2319#issuecomment-743034840
> Thanks for the PR @danny0405 ! > > This leaves the code easier to read. but unsure of the intended effects. I have actually thought more about how to avoid this. and it seems hard. we do need some evaluation of how much pruning is going to happen for us to size the parallelism and since that is very workload specific, it does need an evaluation. > > best idea I could come up with, was to see if we can do some sampling say 10% of the records to estimate. > > Something like below. > > ``` > / we will just try exploding the input and then count to determine comparisons > // FIX(vc): Only do sampling here and extrapolate? > fileToComparisons = explodeRecordRDDWithFileComparisons(partitionToFileInfo, partitionRecordKeyPairRDD.sample(0.1)) > .mapToPair(t -> t).countByKey(); > ``` Thanks so much for the review @vinothchandar ~ Can we really do a sampling here ? Then how can we ensure the `FileId -> HoodieKey` candidates are complete so that the final built Index is accurate. (e.g. No `UPDATE` records are tagged as `INSERT`). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
