[ 
https://issues.apache.org/jira/browse/SPARK-49537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49537:
-----------------------------------
    Labels: pull-request-available  (was: )

> Incorrect Join stats estimate
> -----------------------------
>
>                 Key: SPARK-49537
>                 URL: https://issues.apache.org/jira/browse/SPARK-49537
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Yuming Wang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: diable CBO.png, enable CBO.png
>
>
> Error message:
> {noformat}
> org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Cannot broadcast the table that is larger 
> than 4GB: 4GB
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:45)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:340)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:198)
> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> {noformat}
> Left side stats:
> {noformat}
> 36126 bytes, 2150 rows
> {noformat}
> |info_name|info_value|
> |col_name|brand|
> |data_type|string|
> |comment|NULL|
> |min|NULL|
> |max|NULL|
> |num_nulls|1|
> |distinct_count|1980|
> |avg_col_len|9|
> |max_col_len|38|
> |histogram|NULL|
> Right side stats:
> {noformat}
> 13250653950 bytes, 1470064309 rows
> {noformat}
> |info_name|info_value|
> |col_name|brand|
> |data_type|string|
> |comment|NULL|
> |min|NULL|
> |max|NULL|
> |num_nulls|320713790|
> |distinct_count|3896196|
> |avg_col_len|8|
> |max_col_len|69|
> |histogram|NULL|
> Join plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [brand#612428, leaf_categ_name#612429, leaf_categ_id#612430, 
> GMV_LC_AMT#615773, item_price#615665], Statistics(sizeInBytes=2.41E+25 B)
> +- Join Inner, ((item_id#615802 = item_id#612432) AND (leaf_categ_id#615805 = 
> leaf_categ_id#612430)), Statistics(sizeInBytes=3.07E+25 B)
>    :- Project [brand#612428, leaf_categ_name#612429, leaf_categ_id#612430, 
> item_id#612432], Statistics(sizeInBytes=55.7 MiB, rowCount=8.11E+5)
>    :  +- Join Inner, (brand#612434 = brand#612428), 
> Statistics(sizeInBytes=71.1 MiB, rowCount=8.11E+5)
>    :     :- Project [brand#612428, leaf_categ_name#612429, 
> leaf_categ_id#612430], Statistics(sizeInBytes=136.4 KiB, rowCount=2.15E+3)
>    :     :  +- Filter (isnotnull(leaf_categ_id#612430) AND 
> isnotnull(brand#612428)), Statistics(sizeInBytes=170.0 KiB, rowCount=2.15E+3)
>    :     :     +- Relation 
> spark_catalog.tableA[brand#612428,leaf_categ_name#612429,leaf_categ_id#612430,dom_gmv#612431]
>  parquet, Statistics(sizeInBytes=170.1 KiB, rowCount=2.15E+3)
>    :     +- Project [item_id#612432, brand#612434], 
> Statistics(sizeInBytes=38.5 GiB, rowCount=1.15E+9)
>    :        +- Filter (isnotnull(item_id#612432) AND 
> isnotnull(brand#612434)), Statistics(sizeInBytes=42.8 GiB, rowCount=1.15E+9)
>    :           +- Relation 
> spark_catalog.tableB[item_id#612432,auct_end_dt#612433,brand#612434] parquet, 
> Statistics(sizeInBytes=54.8 GiB, rowCount=1.47E+9)
>    +- Project [item_id#615802, leaf_categ_id#615805, CASE WHEN 
> tax_state#615824 IN (UK,EU) THEN cast(bround((((cast(quantity#615828 as 
> decimal(10,0)) * item_price#615827) + item_sales_tax_amt#615887) / 
> cast(quantity#615828 as decimal(10,0))), 2) as decimal(38,2)) ELSE 
> cast(item_price#615827 as decimal(38,2)) END AS item_price#615665, 
> coalesce(GMV_LC_AMT#615933, 0.000000) AS gmv_lc_amt#615773], 
> Statistics(sizeInBytes=466.4 PiB)
>       +- Join LeftOuter, (cast(byr_curncy_id#615921 as decimal(9,0)) = 
> curncy_id#615796), Statistics(sizeInBytes=799.5 PiB)
>          :- Project [item_id#615802, leaf_categ_id#615805, tax_state#615824, 
> item_price#615827, quantity#615828, item_sales_tax_amt#615887, 
> byr_curncy_id#615921, GMV_LC_AMT#615933], Statistics(sizeInBytes=756.7 TiB)
>          :  +- Join LeftOuter, (cast(lstg_curncy_id#615848 as decimal(9,0)) = 
> curncy_id#612267), Statistics(sizeInBytes=894.2 TiB)
>          :     :- Project [item_id#615802, leaf_categ_id#615805, 
> tax_state#615824, item_price#615827, quantity#615828, lstg_curncy_id#615848, 
> item_sales_tax_amt#615887, byr_curncy_id#615921, GMV_LC_AMT#615933], 
> Statistics(sizeInBytes=846.3 GiB)
>          :     :  +- Filter ((((((((((isnotnull(GMV_DT#615926) AND 
> isnotnull(seller_id#615806)) AND (GMV_DT#615926 >= 2023-09-01)) AND 
> (GMV_DT#615926 <= 2024-08-31)) AND isnotnull(item_id#615802)) AND 
> isnotnull(leaf_categ_id#615805)) AND site_id#615804 IN (0,100)) AND NOT 
> checkout_status#615818 IN (1,3)) AND sale_type#615812 IN (1,2,7,8,9,13)) AND 
> (seller_id#615806 = 118556856)) AND (lower(CASE WHEN (rprtd_wacko_yn#615875 = 
> ) THEN ck_wacko_yn#615845 ELSE coalesce(rprtd_wacko_yn#615875, 
> ck_wacko_yn#615845) END) = n)), Statistics(sizeInBytes=11.5 TiB)
>          :     :     +- Relation 
> spark_catalog.tableC[item_id#615802,auct_end_dt#615803,site_id#615804,leaf_categ_id#615805,seller_id#615806,slr_cntry_id#615807,buyer_id#615808,byr_cntry_id#615809,transaction_id#615810,shipping_address_id#615811,sale_type#615812,created_time#615813,created_dt#615814,last_modified#615815,last_modified_dt#615816,checkout_flags#615817,checkout_status#615818,checkout_status_details#615819,payment_method#615820,shipping_fee#615821,shipping_xfee#615822,tax#615823,tax_state#615824,instruction_flag#615825,...
>  116 more fields] parquet, Statistics(sizeInBytes=11.5 TiB)
>          :     +- Project [curncy_id#612267], Statistics(sizeInBytes=1082.0 B)
>          :        +- Filter isnotnull(curncy_id#612267), 
> Statistics(sizeInBytes=5.0 KiB)
>          :           +- Relation 
> spark_catalog.tableD[CURNCY_ID#612267,CURNCY_PLAN_RATE#612268,CRE_DATE#612269,CRE_USER#612270,UPD_DATE#612271,UPD_USER#612272]
>  parquet, Statistics(sizeInBytes=5.0 KiB)
>          +- Project [curncy_id#615796], Statistics(sizeInBytes=1082.0 B)
>             +- Filter isnotnull(curncy_id#615796), Statistics(sizeInBytes=5.0 
> KiB)
>                +- Relation 
> spark_catalog.tableD[CURNCY_ID#615796,CURNCY_PLAN_RATE#615797,CRE_DATE#615798,CRE_USER#615799,UPD_DATE#615800,UPD_USER#615801]
>  parquet, Statistics(sizeInBytes=5.0 KiB)
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to