pavel0fadeev commented on PR #48025: URL: https://github.com/apache/spark/pull/48025#issuecomment-2344382504
Sorry if I don't understand or miss some basics, but I have a concern (non-binding). Let's change the example in the head of the PR a bit and imagine t1 with 2150 rows (all 2150 are distinct) and t2 with 1470064309 rows (all 1470064309 are distinct) So we have a small table and a large table with unique rows within each - looks like a pretty common pattern to me. The previous formula works quite well in this scenario: (2150 * 1470064309) / 1470064309 = 2150 (and we can't physically have more than 2150 rows after joining t1 and t2) but the new formula (2150 * 1470064309) / 2150 = 1470064309 greatly overestimates the cardinality of the join, which means that some (already written) jobs will not be able to use reasonable optimisation. To summarise, the old formula works well for the uniform distribution (and gives exact cardinality of the join if all values in the table with smaller distinctCount match in the other table, which can be strictly proven) and the new one can work much worse. I understand that in our case it is better to overestimate than underestimate, but the new formula looks much less precise for some basic cases where the previous formula shines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
