pavel0fadeev commented on PR #48025:
URL: https://github.com/apache/spark/pull/48025#issuecomment-2344382504

   Sorry if I don't understand or miss some basics, but I have a concern 
(non-binding).
   
   Let's change the example in the head of the PR a bit and imagine
   t1 with 2150 rows (all 2150 are distinct)
   and
   t2 with 1470064309 rows (all 1470064309 are distinct)
   So we have a small table and a large table with unique rows within each - 
looks like a pretty common pattern to me.
   The previous formula works quite well in this scenario:
   (2150 * 1470064309) / 1470064309 = 2150 (and we can't physically have more 
than 2150 rows after joining t1 and t2)
   but the new formula
   (2150 * 1470064309) / 2150 = 1470064309 greatly overestimates the 
cardinality of the join, which means that some (already written) jobs will not 
be able to use reasonable optimisation.
   
   To summarise, the old formula works well for the uniform distribution (and 
gives exact cardinality of the join if all values in the table with smaller 
distinctCount match in the other table, which can be strictly proven) and the 
new one can work much worse.
   
   I understand that in our case it is better to overestimate than 
underestimate, but the new formula looks much less precise for some basic cases 
where the previous formula shines.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to