ianmcook commented on issue #40597:
URL: https://github.com/apache/arrow/issues/40597#issuecomment-2509849854

   @paleolimbot @kou after working on 
https://github.com/apache/arrow-experiments/pull/43 and 
apache/arrow-experiments#44, now I better understand your comments here.
   
   I think a common scenario for Arrow-over-HTTP will be this:
   - On the server, there is an ordered set of Arrow IPC stream files with the 
same schema.
   - On the client, you want to create a lazy iterator of record batches for 
these files. You want to defer downloading record batches from the server until 
the user/application calls for record batches.
   
   In this scenario, server support for range requests is needed. It would also 
help to have some metadata about the Arrow IPC stream files, including the 
sizes of the files and the offsets of the record batches in the files. This 
would allow the client to avoid using inefficient trial-and-error approaches.
   
   I think I am still -1 on the idea to use the IPC _file_ format on the server 
to achieve this. For one, this is problematic:
   > In the file format, there is no requirement that dictionary keys should be 
defined in a DictionaryBatch before they are used in a RecordBatch
   
   But I am +1 on the idea to define a standard way to serialize IPC stream 
file metadata (file sizes and record batch offsets) to JSON to enable this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to