[
https://issues.apache.org/jira/browse/SPARK-54354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hongze Zhang updated SPARK-54354:
---------------------------------
Summary: Driver / Executor hangs when there's not enough JVM heap memory
for broadcast hashed relation (was: Driver / Executor hangs when there's
enough JVM heap memory for broadcast hashed relation)
> Driver / Executor hangs when there's not enough JVM heap memory for broadcast
> hashed relation
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-54354
> URL: https://issues.apache.org/jira/browse/SPARK-54354
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.0.1
> Reporter: Hongze Zhang
> Priority: Major
>
> This is a bug specifically happening when very large UnsafeHashedRelations
> are created for broadcast hash join.
>
> The rationale of the bug explained as following:
>
> This code:
> [https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
> creates a unified memory manager with Int.Max as the heap memory size for
> the new hashed relation. This practice essentially creates the condition
> where the actual JVM memory is significantly smaller than the unified memory
> manager's heap size. And given we also have the recursive retry policy for
> this specific case:
> [https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
> and unified memory manager's heap size is too large, the program will likely
> hang for very long time. In a typical local test, it could hang for as long
> as 2 hours.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]