[ 
https://issues.apache.org/jira/browse/SPARK-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-12358:
---------------------------------
    Labels: bulk-closed  (was: )

> Spark SQL query with lots of small tables under broadcast threshold leading 
> to java.lang.OutOfMemoryError
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12358
>                 URL: https://issues.apache.org/jira/browse/SPARK-12358
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Deenar Toraskar
>            Priority: Major
>              Labels: bulk-closed
>
> Hi
> I have a Spark SQL query with a lot of small tables (5x plus) all  below the 
> broadcast threshold. Looking at the query plan Spark is broadcasting all 
> these tables together without checking if there is sufficient memory 
> available. This leads to 
> Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java 
> heap space 
> errors and causes the executors to die and query fail.
> I got around this issue by reducing the  spark.sql.autoBroadcastJoinThreshold 
> to stop broadcasting the bigger tables in the query.
> A fix would be to 
> a) ensure that in addition to the per table threshold 
> (spark.sql.autoBroadcastJoinThreshold), there is a total broadcast (say 
> spark.sql.autoBroadcastJoinThresholdCumulative ) threshold per query, so only 
> data up to that limit is broadcast preventing executors running out of memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to