paleolimbot commented on issue #40597:
URL: https://github.com/apache/arrow/issues/40597#issuecomment-2510478910

   There are quite a lot of ways to write a scanner (and quite a lot opinions 
about how they should be written), but in general they all benefit from having 
more information available sooner (unless they are given so much information 
that the act of downloading it or generating it is itself a slow operation). 
This information could be opt-in if that's a concern.
   
   Every version of this I know about would want to have the schema available 
from the JSON response, and would like to have enough information to make a 
plan for downloading. If there are more stream files than consumer IO threads 
(e.g., the "downloading from my laptop" scenario), the batch information is 
probably not needed. But for clients with a massive number of IO threads (e.g., 
the "spark cluster" scenario), the batch information would allow that scan to 
be distributed much more effectively.
   
   > Can we reuse any existing specification/mechanism for this?
   
   The file footer? But maybe as JSON and one per URI?
   
   
https://github.com/apache/arrow/blob/88c704eb6b6432fd69df9492baa1da2e3c3f5a31/format/File.fbs#L26-L37


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to