mcdull-zhang opened a new pull request, #36178: URL: https://github.com/apache/spark/pull/36178
### What changes were proposed in this pull request? WholeStageCodegenExec is wrapped in BufferedRowIterator. BufferedRowIterator uses a LinkedList to hold the output of WholeStageCodegenExec. When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to append the output to this LinkedList. SortMergeJoin processes a record in streamedPlan each time. If all records in bufferedPlan can match this record, all records in bufferedPlan will be saved in LinkedList, resulting in OOM. The above situation is very common in our internal use, so it is best to add a configuration to the codegen code. If there are enough items in the LinkedList, stop SortMergeJoin and let the parent consume it first. ### Why are the changes needed? Enhanced stability to avoid OOM interfering with users ### Does this PR introduce _any_ user-facing change? yes, added a configuration spark.sql.sortMergeJoinExec.codegen.maxRecordPerCycle to ensure that the LinkedList length does not exceed the configuration value. ### How was this patch tested? pass all current tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
