andygrove opened a new issue, #2162:
URL: https://github.com/apache/datafusion-comet/issues/2162

   ### Describe the bug
   
   Native code fetches batches from JVM using `CometBatchIterator`. This is 
called from `ScanExec`.
   
   We have seen memory corruption unless ScanExec takes a deep copy of the 
arrays received from `CometBatchIterator`. On further analysis, it is now clear 
that the JVM is not retaining ownership of the arrays once they are exported to 
native. This means that the underlying Arrow buffers get released back to a 
pool and can be overwritten while native code is still referencing them.
   
   I have been able to prove with some debug logging that the JVM closes 
CometVectors after exporting them and while native code is still processing the 
data, leading to corruption:
   
   Native code (thread 1154171) gets a batch:
   
   ```
   [1154171] native got batch from jvm: RecordBatch { schema: Schema { fields: 
[Field { name: "col_0", data_type: Int32, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }, Field { name: "col_1", data_type: 
Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], 
metadata: {} }, columns: [PrimitiveArray<Int32>
   [
     6,
     7,
   ], PrimitiveArray<Int32>
   [
     8,
     9,
   ]], row_count: 2 }
   ```
   
   JVM closes the vectors:
   
   ```
   [Executor task launch worker for task 2.0 in stage 15.0 (TID 41)] 
CometVector.close() [6, 7]
   [Executor task launch worker for task 2.0 in stage 15.0 (TID 41)] 
CometVector.close() [8, 9]
   ```
   
   Native code (still thread 1154171) continues processing, but the buffer has 
been freed or overwritten.
   
   ```
   [1154171] writing shuffle batch: RecordBatch { schema: Schema { fields: 
[Field { name: "col_0", data_type: Int32, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }, Field { name: "col_1", data_type: 
Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], 
metadata: {} }, columns: [PrimitiveArray<Int32>
   [
     -1342011280,
     30839,
   ], PrimitiveArray<Int32>
   [
     -1342175264,
     30839,
   ]], row_count: 2 }
   ```
   
   
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to