David Li created ARROW-10670:
--------------------------------

             Summary: [Python] Make self_destruct and RecordBatchReader work 
better together
                 Key: ARROW-10670
                 URL: https://issues.apache.org/jira/browse/ARROW-10670
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: David Li


When reading record batches via IPC, Arrow generally constructs each batch as a 
single allocation, with each column in the batch composed of slices of that 
allocation. This doesn't play well with {{to_pandas(self_destruct=True)}} as 
even though Arrow will release references to each column, those references were 
just to slices of a larger allocation, so no memory actually gets freed until 
the end of the conversion - defeating the point.

Reallocating the batches via pa.concat_arrays avoids this but requires a copy. 
Additionally, it's unclear that pa.concat_arrays is suitable for this purpose. 
It would be convenient if the record batch readers could (at least in some 
cases) provide suitably allocated batches (this may be hard, e.g. in Flight, 
the batches are ultimately based on memory allocated by gRPC). If that's not 
possible, then at least, we should either guarantee that concat_arrays truly 
returns a copy, or provide an explicit way to copy arrays.

This came up when trying to integrate self_destruct into PySpark (see 
SPARK-32953/https://github.com/apache/spark/pull/29818)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to