rustyconover opened a new issue, #49285:
URL: https://github.com/apache/arrow/issues/49285
### Describe the enhancement requested
When serializing many `RecordBatch` objects in a hot loop (e.g. streaming
IPC to a socket, writing to shared memory), `RecordBatch.serialize()` allocates
a new buffer on every call. This creates unnecessary allocation pressure when
the caller already knows the required size and could reuse a single buffer
across calls.
It would be useful if `serialize()` accepted an optional `buffer` parameter
so callers can provide a pre-allocated mutable buffer to serialize into
directly.
## Example usage
```python
import pyarrow as pa
batches = [...] # many RecordBatches with the same schema
# Pre-allocate once
size = max(pa.ipc.get_record_batch_size(b) for b in batches)
buf = pa.allocate_buffer(size)
for batch in batches:
result = batch.serialize(buffer=buf)
send_over_network(result) # result is a zero-copy slice of buf
```
New behavior
- `batch.serialize(buffer=buf)` serializes directly into the provided buffer
and returns a slice of it with the exact serialized size.
- If the buffer is too small, a `ValueError` is raised with a message
indicating the required vs. available size.
- If the buffer is not mutable, a `ValueError` is raised.
- When buffer is not provided, behavior is unchanged from today.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]