[ 
https://issues.apache.org/jira/browse/ARROW-10670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-10670:
--------------------------------

    Assignee: David Li

> [Python] Make self_destruct and RecordBatchReader work better together
> ----------------------------------------------------------------------
>
>                 Key: ARROW-10670
>                 URL: https://issues.apache.org/jira/browse/ARROW-10670
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: David Li
>            Assignee: David Li
>            Priority: Major
>
> When reading record batches via IPC, Arrow generally constructs each batch as 
> a single allocation, with each column in the batch composed of slices of that 
> allocation. This doesn't play well with {{to_pandas(self_destruct=True)}} as 
> even though Arrow will release references to each column, those references 
> were just to slices of a larger allocation, so no memory actually gets freed 
> until the end of the conversion - defeating the point.
> Reallocating the batches via pa.concat_arrays avoids this but requires a 
> copy. Additionally, it's unclear that pa.concat_arrays is suitable for this 
> purpose. It would be convenient if the record batch readers could (at least 
> in some cases) provide suitably allocated batches (this may be hard, e.g. in 
> Flight, the batches are ultimately based on memory allocated by gRPC). If 
> that's not possible, then at least, we should either guarantee that 
> concat_arrays truly returns a copy, or provide an explicit way to copy arrays.
> This came up when trying to integrate self_destruct into PySpark (see 
> SPARK-32953/https://github.com/apache/spark/pull/29818)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to