[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Franck Tago updated SPARK-44759: -------------------------------- Description: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whoe stage code gen node . Doing so leads to potential memory issues. !image-2023-08-10-02-18-24-758.png! !image-2023-08-10-09-19-23-973.png! !image-2023-08-10-09-21-32-371.png! WSCG method for first Generate node !image-2023-08-10-09-22-29-949.png! WSCG for second Generate node As you can see the execution of generate_doConsume_1 and generate_doConsume_0 triggers a nested loop. !image-2023-08-10-09-24-02-755.png! was: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whoe stage code gen node . Doing so leads to potential memory issues. !image-2023-08-10-02-18-24-758.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen because > it can easily cause OOM failures > -------------------------------------------------------------------------------------------------------------- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer > Affects Versions: 3.3.0, 3.3.1 > Reporter: Franck Tago > Priority: Major > Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org