[GitHub] [arrow-rs] alamb commented on issue #208: flight_data_from_arrow_batch sends too much data

GitBox Sun, 12 Sep 2021 04:15:35 -0700


alamb commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917615261

> And the issue is that the data buffer points to the original larger array.
Then, that larger array is ultimately turned into the FlightData which is a
waste.

Yes, that is the crux of the issue we found.

> Assuming that's all correct is there a preference as to where a fix should
be applied? i.e. whether at flight_data_from_arrow_batch, encoded_batch, or
record_batch_to_bytes?

I am not sure to be honest as I am not familiar with the flight code.
Perhaps @nevi-me or @jorgecarleitao who have more experience in how IPC /
flight is supposed to work might have thoughts on how to handle serializing
bytes for an Array whose backing `Buffer` is much larger. Another avenue we can
explore is to review how the C++ implementation handles the case and/or ask
about this on [email protected].

One way to reduce potential unintended side effects could be to make the
optimization optional (an option on
[`IpcWriteOptions`](https://docs.rs/arrow/5.3.0/arrow/ipc/writer/struct.IpcWriteOptions.html),
perhaps) while we test it out more broadly, and then switch the default value
in a later version.

> Naively I was thinking at the record_batch_to_bytes level - but i think
that might impact IPC in general.

Yes. However, maybe that is ok (as that seems to be optimizing the
serialization of Arrow Arrays). However, I am not sure what the expectations
are here.

> Separately, ive been looking if there are any methods / helpers for
recreating a RecordBatch out of the data / offsets / len of another RecordBatch.

`RecordBatch::slice` is what I know of for this purpose:
https://docs.rs/arrow/5.3.0/arrow/record_batch/struct.RecordBatch.html#method.slice

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on issue #208: flight_data_from_arrow_batch sends too much data

Reply via email to