Pandas Block Manager

Nicholas White Sun, 08 Nov 2020 14:56:46 -0800

Hi - I've been looking through the Arrow specification format to look for
ways to allow zero-copy creation of Pandas DataFrames (beyond
`split_blocks`). Am I right in thinking that if you created an Arrow file
(let's say of `m` rows and `n` columns of `float64`s for now) as a single
RecordBatch then your arrow file would have all the `float64` buffers
end-to-end, effectively forming a long `m` x `n` 1D `float64` array
(equivalently, a `m` x `n` 2D array in C-order)? If so, then it seems like
you could pass the 2D grid of values to the Pandas BlockManager when
exposing the Arrow file as a DataFrame, and so no consolidations would be
triggered regardless of what operations you did on the DataFrame (unlike
`split_blocks`).


Do you think this is a useful optimisation to do (I'm happy to work on the
implementation)? Or would it be too fragile or rarely-triggered to be a
useful addition to the codebase? Thanks -

Nick

Pandas Block Manager

Reply via email to