[
https://issues.apache.org/jira/browse/ARROW-10670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Li reassigned ARROW-10670:
--------------------------------
Assignee: David Li
> [Python] Make self_destruct and RecordBatchReader work better together
> ----------------------------------------------------------------------
>
> Key: ARROW-10670
> URL: https://issues.apache.org/jira/browse/ARROW-10670
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: David Li
> Assignee: David Li
> Priority: Major
>
> When reading record batches via IPC, Arrow generally constructs each batch as
> a single allocation, with each column in the batch composed of slices of that
> allocation. This doesn't play well with {{to_pandas(self_destruct=True)}} as
> even though Arrow will release references to each column, those references
> were just to slices of a larger allocation, so no memory actually gets freed
> until the end of the conversion - defeating the point.
> Reallocating the batches via pa.concat_arrays avoids this but requires a
> copy. Additionally, it's unclear that pa.concat_arrays is suitable for this
> purpose. It would be convenient if the record batch readers could (at least
> in some cases) provide suitably allocated batches (this may be hard, e.g. in
> Flight, the batches are ultimately based on memory allocated by gRPC). If
> that's not possible, then at least, we should either guarantee that
> concat_arrays truly returns a copy, or provide an explicit way to copy arrays.
> This came up when trying to integrate self_destruct into PySpark (see
> SPARK-32953/https://github.com/apache/spark/pull/29818)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)