[PR] [SPARK-49537][SQL] Improve Join stats estimate [spark]

via GitHub Sat, 07 Sep 2024 09:51:36 -0700


wangyum opened a new pull request, #48025:
URL: https://github.com/apache/spark/pull/48025


   ### What changes were proposed in this pull request?
   
   This PR improve Join stats estimate when column stats exist.
   
   For example:
   Left side:
   ```
   36126 bytes, 2150 rows
   ```
   
   info_name | info_value
   -- | --
   col_name | key-1980
   data_type | string
   comment | NULL
   min | NULL
   max | NULL
   num_nulls | 1
   distinct_count | 1980
   avg_col_len | 9
   max_col_len | 38
   histogram | NULL
   
   Right side:
   ```
   13250653950 bytes, 1470064309 rows
   ```
   
   info_name | info_value
   -- | --
   col_name | key-3896196
   data_type | string
   comment | NULL
   min | NULL
   max | NULL
   num_nulls | 320713790
   distinct_count | 3896196
   avg_col_len | 8
   max_col_len | 69
   histogram | NULL
   
   Before this PR:
   ```
   +- Join Inner, (key-1980#388 = key-3896196#382), Statistics(sizeInBytes=71.1 
MiB, rowCount=8.11E+5)
   ```
   After this PR:
   ```
   +- Join Inner, (key-1980#388 = key-3896196#382), 
Statistics(sizeInBytes=106.9 GiB, rowCount=1.25E+9)
   ```
   The new statistics are closer to the actual number of rows: 
[380,857,863](https://issues.apache.org/jira/secure/attachment/13071429/enable%20CBO.png).
   
   
   ### Why are the changes needed?
   
   1. Avoid broadcast join fail.
      ```
      org.apache.spark.SparkException: Cannot broadcast the table that is 
larger than 4GB: 4GB
      ```
   2. Avoid Spark driver OOM.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-49537][SQL] Improve Join stats estimate [spark]

Reply via email to