wangyum opened a new pull request, #48025: URL: https://github.com/apache/spark/pull/48025
### What changes were proposed in this pull request? This PR improve Join stats estimate when column stats exist. For example: Left side: ``` 36126 bytes, 2150 rows ``` info_name | info_value -- | -- col_name | key-1980 data_type | string comment | NULL min | NULL max | NULL num_nulls | 1 distinct_count | 1980 avg_col_len | 9 max_col_len | 38 histogram | NULL Right side: ``` 13250653950 bytes, 1470064309 rows ``` info_name | info_value -- | -- col_name | key-3896196 data_type | string comment | NULL min | NULL max | NULL num_nulls | 320713790 distinct_count | 3896196 avg_col_len | 8 max_col_len | 69 histogram | NULL Before this PR: ``` +- Join Inner, (key-1980#388 = key-3896196#382), Statistics(sizeInBytes=71.1 MiB, rowCount=8.11E+5) ``` After this PR: ``` +- Join Inner, (key-1980#388 = key-3896196#382), Statistics(sizeInBytes=106.9 GiB, rowCount=1.25E+9) ``` The new statistics are closer to the actual number of rows: [380,857,863](https://issues.apache.org/jira/secure/attachment/13071429/enable%20CBO.png). ### Why are the changes needed? 1. Avoid broadcast join fail. ``` org.apache.spark.SparkException: Cannot broadcast the table that is larger than 4GB: 4GB ``` 2. Avoid Spark driver OOM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
