westonpace commented on issue #35394:
URL: https://github.com/apache/arrow/issues/35394#issuecomment-1540512056
The Arrow spec defines two things:
* Instructions on how to layout data in buffers
* Instructions on how to write out all the metadata in the IPC format
When we go from IPC -> C++ we keep the buffers identical (this is what is
meant by "zero-copy"). However, the metadata is converted from flatbuffers to
C++ objects (we generally don't consider the metadata when we say "zero-copy").
For example, the flatbuffers "Schema" table (defined here
https://github.com/apache/arrow/blob/18c976048bc989cf9d2c31139b67f7cc8e143d66/format/Schema.fbs#L517)
becomes the Arrow-C++ `arrow::Schema` object (which has, for example,
`std::vector`). A `pyarrow.Schema` object then has (via cython) a
`std::shared_ptr<arrow::Schema>`.
So there are a few options:
* If you just need the buffers you can easily get them with `pyarrow` (e.g.
`pa.array([1, 2, 3]).buffers()[1].to_pybytes()`). The contents of these
buffers are stable and defined by the Arrow spec.
* Serialize to the IPC format (e.g. `pa.ipc.RecordBatchStreamWriter`). The
contents are stable and defined by the Arrow IPC spec.
* Serialize to the C data format. The contents are stable and defined by
the Arrow C Data spec.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]