[PR] [SPARK-55098][PYTHON] Vectorized UDFs with output batch control fail with memory leak [spark]

via GitHub Tue, 20 Jan 2026 01:15:19 -0800


zhengruifeng opened a new pull request, #53867:
URL: https://github.com/apache/spark/pull/53867


   ### What changes were proposed in this pull request?
   Fix a memory leak that when the output stream is stopped before EOS
   
   
   ### Why are the changes needed?
   bug fix
   
   ```
   import pyarrow as pa
   
   spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "-1")
   spark.conf.set("spark.sql.execution.arrow.maxBytesPerOutputBatch", "3")
   
   
   def get_size(iterator):
       for batch in iterator:
           if batch.num_rows > 0:
               yield pa.RecordBatch.from_arrays([pa.array([batch.num_rows])], 
names=['size'])
   
   spark.range(10).mapInArrow(get_size, "size long").limit(1).collect()
   ```
   
   fails with
   ```
   SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 
4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 12) (10.68.161.10 
executor 0): org.apache.spark.util.TaskCompletionListenerException: Memory was 
leaked by query. Memory leaked: (12)
   Allocator(stdin reader for /databricks/python/bin/python) 
0/12/12/9223372036854775807 (res/actual/peak/limit)
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, bug-fix
   
   
   ### How was this patch tested?
   added tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55098][PYTHON] Vectorized UDFs with output batch control fail with memory leak [spark]

Reply via email to