maheshk114 commented on PR #41860:
URL: https://github.com/apache/spark/pull/41860#issuecomment-1635347890

   @beliefer 
   Doing some experiments to check the impact of size of tables on the 
performance number. As far as bloom is concern, the worst case seems to be the 
case when left side (bloom creation side) is largest and the right side (bloom 
application side) is smallest. These are the value for left side table of size 
~10MB (the max value, beyond this value bloom will not be applied) and right 
side table size ~10GB (this is the min size, below this bloom is not applied). 
Here the reduction is from 220M records to 7,952,642 records. Will try to do 
some experiments on this reduction ratio.
   
   
![image](https://github.com/apache/spark/assets/19660171/69dfccc9-6012-4193-b1de-c741e8672c1f)
   
   
![image](https://github.com/apache/spark/assets/19660171/4ef41bf2-ead7-420f-85b9-5c1f2bc9af00)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to