[
https://issues.apache.org/jira/browse/SPARK-49537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-49537:
-----------------------------------
Labels: pull-request-available (was: )
> Incorrect Join stats estimate
> -----------------------------
>
> Key: SPARK-49537
> URL: https://issues.apache.org/jira/browse/SPARK-49537
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.0
> Reporter: Yuming Wang
> Priority: Major
> Labels: pull-request-available
> Attachments: diable CBO.png, enable CBO.png
>
>
> Error message:
> {noformat}
> org.apache.hive.service.cli.HiveSQLException: Error running query:
> org.apache.spark.SparkException: Cannot broadcast the table that is larger
> than 4GB: 4GB
> at
> org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:45)
> at
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:340)
> at
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:198)
> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> {noformat}
> Left side stats:
> {noformat}
> 36126 bytes, 2150 rows
> {noformat}
> |info_name|info_value|
> |col_name|brand|
> |data_type|string|
> |comment|NULL|
> |min|NULL|
> |max|NULL|
> |num_nulls|1|
> |distinct_count|1980|
> |avg_col_len|9|
> |max_col_len|38|
> |histogram|NULL|
> Right side stats:
> {noformat}
> 13250653950 bytes, 1470064309 rows
> {noformat}
> |info_name|info_value|
> |col_name|brand|
> |data_type|string|
> |comment|NULL|
> |min|NULL|
> |max|NULL|
> |num_nulls|320713790|
> |distinct_count|3896196|
> |avg_col_len|8|
> |max_col_len|69|
> |histogram|NULL|
> Join plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [brand#612428, leaf_categ_name#612429, leaf_categ_id#612430,
> GMV_LC_AMT#615773, item_price#615665], Statistics(sizeInBytes=2.41E+25 B)
> +- Join Inner, ((item_id#615802 = item_id#612432) AND (leaf_categ_id#615805 =
> leaf_categ_id#612430)), Statistics(sizeInBytes=3.07E+25 B)
> :- Project [brand#612428, leaf_categ_name#612429, leaf_categ_id#612430,
> item_id#612432], Statistics(sizeInBytes=55.7 MiB, rowCount=8.11E+5)
> : +- Join Inner, (brand#612434 = brand#612428),
> Statistics(sizeInBytes=71.1 MiB, rowCount=8.11E+5)
> : :- Project [brand#612428, leaf_categ_name#612429,
> leaf_categ_id#612430], Statistics(sizeInBytes=136.4 KiB, rowCount=2.15E+3)
> : : +- Filter (isnotnull(leaf_categ_id#612430) AND
> isnotnull(brand#612428)), Statistics(sizeInBytes=170.0 KiB, rowCount=2.15E+3)
> : : +- Relation
> spark_catalog.tableA[brand#612428,leaf_categ_name#612429,leaf_categ_id#612430,dom_gmv#612431]
> parquet, Statistics(sizeInBytes=170.1 KiB, rowCount=2.15E+3)
> : +- Project [item_id#612432, brand#612434],
> Statistics(sizeInBytes=38.5 GiB, rowCount=1.15E+9)
> : +- Filter (isnotnull(item_id#612432) AND
> isnotnull(brand#612434)), Statistics(sizeInBytes=42.8 GiB, rowCount=1.15E+9)
> : +- Relation
> spark_catalog.tableB[item_id#612432,auct_end_dt#612433,brand#612434] parquet,
> Statistics(sizeInBytes=54.8 GiB, rowCount=1.47E+9)
> +- Project [item_id#615802, leaf_categ_id#615805, CASE WHEN
> tax_state#615824 IN (UK,EU) THEN cast(bround((((cast(quantity#615828 as
> decimal(10,0)) * item_price#615827) + item_sales_tax_amt#615887) /
> cast(quantity#615828 as decimal(10,0))), 2) as decimal(38,2)) ELSE
> cast(item_price#615827 as decimal(38,2)) END AS item_price#615665,
> coalesce(GMV_LC_AMT#615933, 0.000000) AS gmv_lc_amt#615773],
> Statistics(sizeInBytes=466.4 PiB)
> +- Join LeftOuter, (cast(byr_curncy_id#615921 as decimal(9,0)) =
> curncy_id#615796), Statistics(sizeInBytes=799.5 PiB)
> :- Project [item_id#615802, leaf_categ_id#615805, tax_state#615824,
> item_price#615827, quantity#615828, item_sales_tax_amt#615887,
> byr_curncy_id#615921, GMV_LC_AMT#615933], Statistics(sizeInBytes=756.7 TiB)
> : +- Join LeftOuter, (cast(lstg_curncy_id#615848 as decimal(9,0)) =
> curncy_id#612267), Statistics(sizeInBytes=894.2 TiB)
> : :- Project [item_id#615802, leaf_categ_id#615805,
> tax_state#615824, item_price#615827, quantity#615828, lstg_curncy_id#615848,
> item_sales_tax_amt#615887, byr_curncy_id#615921, GMV_LC_AMT#615933],
> Statistics(sizeInBytes=846.3 GiB)
> : : +- Filter ((((((((((isnotnull(GMV_DT#615926) AND
> isnotnull(seller_id#615806)) AND (GMV_DT#615926 >= 2023-09-01)) AND
> (GMV_DT#615926 <= 2024-08-31)) AND isnotnull(item_id#615802)) AND
> isnotnull(leaf_categ_id#615805)) AND site_id#615804 IN (0,100)) AND NOT
> checkout_status#615818 IN (1,3)) AND sale_type#615812 IN (1,2,7,8,9,13)) AND
> (seller_id#615806 = 118556856)) AND (lower(CASE WHEN (rprtd_wacko_yn#615875 =
> ) THEN ck_wacko_yn#615845 ELSE coalesce(rprtd_wacko_yn#615875,
> ck_wacko_yn#615845) END) = n)), Statistics(sizeInBytes=11.5 TiB)
> : : +- Relation
> spark_catalog.tableC[item_id#615802,auct_end_dt#615803,site_id#615804,leaf_categ_id#615805,seller_id#615806,slr_cntry_id#615807,buyer_id#615808,byr_cntry_id#615809,transaction_id#615810,shipping_address_id#615811,sale_type#615812,created_time#615813,created_dt#615814,last_modified#615815,last_modified_dt#615816,checkout_flags#615817,checkout_status#615818,checkout_status_details#615819,payment_method#615820,shipping_fee#615821,shipping_xfee#615822,tax#615823,tax_state#615824,instruction_flag#615825,...
> 116 more fields] parquet, Statistics(sizeInBytes=11.5 TiB)
> : +- Project [curncy_id#612267], Statistics(sizeInBytes=1082.0 B)
> : +- Filter isnotnull(curncy_id#612267),
> Statistics(sizeInBytes=5.0 KiB)
> : +- Relation
> spark_catalog.tableD[CURNCY_ID#612267,CURNCY_PLAN_RATE#612268,CRE_DATE#612269,CRE_USER#612270,UPD_DATE#612271,UPD_USER#612272]
> parquet, Statistics(sizeInBytes=5.0 KiB)
> +- Project [curncy_id#615796], Statistics(sizeInBytes=1082.0 B)
> +- Filter isnotnull(curncy_id#615796), Statistics(sizeInBytes=5.0
> KiB)
> +- Relation
> spark_catalog.tableD[CURNCY_ID#615796,CURNCY_PLAN_RATE#615797,CRE_DATE#615798,CRE_USER#615799,UPD_DATE#615800,UPD_USER#615801]
> parquet, Statistics(sizeInBytes=5.0 KiB)
> {noformat}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]