Wes McKinney created ARROW-432:
----------------------------------

             Summary: [Python] Avoid unnecessary memory copy in to_pandas 
conversion by using low-level pandas internals APIs
                 Key: ARROW-432
                 URL: https://issues.apache.org/jira/browse/ARROW-432
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Wes McKinney
            Assignee: Wes McKinney


I'll take this one on. 

While we're efficiently constructing individual NumPy arrays for pandas, even 
in the zero-copy case pandas.DataFrame will perform an extra memory copy and 
consolidation step internally at the end. 

This is particular to the pandas 0.x/1.x memory layout, and will change in the 
future with pandas 2.0, but that's quite a ways off from wide use. 

We can avoid this overhead for now by

* computing the exact internal "block" structure of the DataFrame. Since we 
know the null counts of the Arrow data, we can determine if type casts to 
accommodate nulls are necessary up front

* pre-allocating empty column-major blocks

* writing out into the block slices

* construct DataFrame from blocks with zero copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to