[
https://issues.apache.org/jira/browse/ARROW-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15771096#comment-15771096
]
Wes McKinney commented on ARROW-432:
------------------------------------
PR: https://github.com/apache/arrow/pull/251. This was arduous, but worth it.
> [Python] Avoid unnecessary memory copy in to_pandas conversion by using
> low-level pandas internals APIs
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-432
> URL: https://issues.apache.org/jira/browse/ARROW-432
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Wes McKinney
> Assignee: Wes McKinney
>
> I'll take this one on.
> While we're efficiently constructing individual NumPy arrays for pandas, even
> in the zero-copy case pandas.DataFrame will perform an extra memory copy and
> consolidation step internally at the end.
> This is particular to the pandas 0.x/1.x memory layout, and will change in
> the future with pandas 2.0, but that's quite a ways off from wide use.
> We can avoid this overhead for now by
> * computing the exact internal "block" structure of the DataFrame. Since we
> know the null counts of the Arrow data, we can determine if type casts to
> accommodate nulls are necessary up front
> * pre-allocating empty column-major blocks
> * writing out into the block slices
> * construct DataFrame from blocks with zero copy
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)