danny0405 commented on pull request #2319:
URL: https://github.com/apache/hudi/pull/2319#issuecomment-743034840


   > Thanks for the PR @danny0405 !
   > 
   > This leaves the code easier to read. but unsure of the intended effects. I 
have actually thought more about how to avoid this. and it seems hard. we do 
need some evaluation of how much pruning is going to happen for us to size the 
parallelism and since that is very workload specific, it does need an 
evaluation.
   > 
   > best idea I could come up with, was to see if we can do some sampling say 
10% of the records to estimate.
   > 
   > Something like below.
   > 
   > ```
   > / we will just try exploding the input and then count to determine 
comparisons
   >       // FIX(vc): Only do sampling here and extrapolate?
   >       fileToComparisons = 
explodeRecordRDDWithFileComparisons(partitionToFileInfo, 
partitionRecordKeyPairRDD.sample(0.1))
   >           .mapToPair(t -> t).countByKey();
   > ```
   
   Thanks so much for the review @vinothchandar  ~ Can we really do a sampling 
here ? Then how can we ensure the `FileId -> HoodieKey` candidates are complete 
so that the final built Index is accurate. (e.g. No `UPDATE` records are tagged 
as `INSERT`).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to