[
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guram Savinov updated SPARK-46516:
----------------------------------
Summary: autoBroadcastJoinThreshold compared to project of a plan not a
relation size (was: autoBroadcastJoinThreshold compared to plan.statistics not
a table size)
> autoBroadcastJoinThreshold compared to project of a plan not a relation size
> ----------------------------------------------------------------------------
>
> Key: SPARK-46516
> URL: https://issues.apache.org/jira/browse/SPARK-46516
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.1
> Reporter: Guram Savinov
> Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum
> size in bytes for a table that will be broadcasted to all worker nodes when
> performing a join.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in
> join, not a table size.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]
> Join can select only a few columns and sizeInBytes will be lesser than
> autoBroadcastJoinThreshold, but broadcasted table can be huge and it is
> loaded entirely into drivers memory which can lead to OOM.
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not
> compared to broadcasted table size.
>
> Original task and test when autobroadcast compared to relation totalSize:
> https://issues.apache.org/jira/browse/SPARK-2393
> [https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39]
>
> Task and PR where autoBroadcastJoinThreshold sterted to be compared to
> project of a plan instead of relations:
> https://issues.apache.org/jira/browse/SPARK-13329
> [https://github.com/apache/spark/pull/11210]
>
> Related topic on SO:
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]