ianmcook commented on issue #40597: URL: https://github.com/apache/arrow/issues/40597#issuecomment-2509849854
@paleolimbot @kou after working on https://github.com/apache/arrow-experiments/pull/43 and apache/arrow-experiments#44, now I better understand your comments here. I think a common scenario for Arrow-over-HTTP will be this: - On the server, there is an ordered set of Arrow IPC stream files with the same schema. - On the client, you want to create a lazy iterator of record batches for these files. You want to defer downloading record batches from the server until the user/application calls for record batches. In this scenario, server support for range requests is needed. It would also help to have some metadata about the Arrow IPC stream files, including the sizes of the files and the offsets of the record batches in the files. This would allow the client to avoid using inefficient trial-and-error approaches. I think I am still -1 on the idea to use the IPC _file_ format on the server to achieve this. For one, this is problematic: > In the file format, there is no requirement that dictionary keys should be defined in a DictionaryBatch before they are used in a RecordBatch But I am +1 on the idea to define a standard way to serialize IPC stream file metadata (file sizes and record batch offsets) to JSON to enable this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
