[ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-46516:
----------------------------------
    Description: 
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes  of a project (columns 
selected in join), not a relation size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and it is loaded 
entirely into drivers memory which can lead to OOM.

spark.sql.autoBroadcastJoinThreshold parameter compared to projection instead 
of broadcasted table size seems quite risky feature: there will be more 
broadcasted relations but more chances to get OOM on the driver too.

The solution is to disable spark.sql.autoBroadcastJoinThreshold and set hints 
on really small relations, but in that case autoBroadcastJoinThreshold seems 
useless.  It would be more usefull to have autoBroadcastJoinThreshold which 
campres to relations size and have predicted memory usage on the driver.

 

Original task and test when autobroadcast compared to relation totalSize:

https://issues.apache.org/jira/browse/SPARK-2393

[https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39]

 

Task and PR where autoBroadcastJoinThreshold started to be compared to project 
of a plan instead of relations:

https://issues.apache.org/jira/browse/SPARK-13329

[https://github.com/apache/spark/pull/11210]

 

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

  was:
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes  of a project (columns 
selected in join), not a relation size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and it is loaded 
entirely into drivers memory which can lead to OOM.

spark.sql.autoBroadcastJoinThreshold parameter compared to projection instead 
of broadcasted table size seems quite risky feature: there will be more 
broadcasted relations but more chances to get OOM on the driver too.

 

Original task and test when autobroadcast compared to relation totalSize:

https://issues.apache.org/jira/browse/SPARK-2393

[https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39]

 

Task and PR where autoBroadcastJoinThreshold started to be compared to project 
of a plan instead of relations:

https://issues.apache.org/jira/browse/SPARK-13329

[https://github.com/apache/spark/pull/11210]

 

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]


> autoBroadcastJoinThreshold compared to project of a plan not a relation size
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-46516
>                 URL: https://issues.apache.org/jira/browse/SPARK-46516
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Guram Savinov
>            Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> In fact Spark compares plan.statistics.sizeInBytes  of a project (columns 
> selected in join), not a relation size.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]
> Join can select only a few columns and sizeInBytes will be lesser than 
> autoBroadcastJoinThreshold, but broadcasted table can be huge and it is 
> loaded entirely into drivers memory which can lead to OOM.
> spark.sql.autoBroadcastJoinThreshold parameter compared to projection instead 
> of broadcasted table size seems quite risky feature: there will be more 
> broadcasted relations but more chances to get OOM on the driver too.
> The solution is to disable spark.sql.autoBroadcastJoinThreshold and set hints 
> on really small relations, but in that case autoBroadcastJoinThreshold seems 
> useless.  It would be more usefull to have autoBroadcastJoinThreshold which 
> campres to relations size and have predicted memory usage on the driver.
>  
> Original task and test when autobroadcast compared to relation totalSize:
> https://issues.apache.org/jira/browse/SPARK-2393
> [https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39]
>  
> Task and PR where autoBroadcastJoinThreshold started to be compared to 
> project of a plan instead of relations:
> https://issues.apache.org/jira/browse/SPARK-13329
> [https://github.com/apache/spark/pull/11210]
>  
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to