mcdull-zhang opened a new pull request, #36178:
URL: https://github.com/apache/spark/pull/36178

   ### What changes were proposed in this pull request?
   WholeStageCodegenExec is wrapped in BufferedRowIterator.
   
   BufferedRowIterator uses a LinkedList to hold the output of 
WholeStageCodegenExec.
   
   When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to 
append the output to this LinkedList.
   
   SortMergeJoin processes a record in streamedPlan each time. If all records 
in bufferedPlan can match this record, all records in bufferedPlan will be 
saved in LinkedList, resulting in OOM.
   
   The above situation is very common in our internal use, so it is best to add 
a configuration to the codegen code. If there are enough items in the 
LinkedList, stop SortMergeJoin and let the parent consume it first.
   
   ### Why are the changes needed?
   Enhanced stability to avoid OOM interfering with users
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, added a configuration 
spark.sql.sortMergeJoinExec.codegen.maxRecordPerCycle to ensure that the 
LinkedList length does not exceed the configuration value.
   
   
   ### How was this patch tested?
   pass all current tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to