martin-traverse commented on issue #24575:
URL: https://github.com/apache/arrow/issues/24575#issuecomment-1483293086

   Hello - just wondering if anyone is still thinking about this? We have a 
data platform where this would be extremely useful.
   
   Looking at the Footer structure, the block information is already in there 
as a list. Since this structure needs to be read anyway, is it sufficient to 
e.g. add a recordOffset property to the Block, only meaningful for record 
batches? A simple solution like this would allow paginated retrieval and 
processing of moderately large datasets I would think, at least up to a few 
million batches. Compared to scanning through the batches one by one from cloud 
storage it would be a big win.
   
   In terms of adding to the language APIs, in our case we are already working 
at the FB / batch level because we need non-blocking data streams, so just 
having it in the file format would be enough for us. My guess is that adding it 
to the language APIs would make it more generally useful though.
   
   For the pre-built index discussed above, I'd think this is only needed if 
(a) the number of batches is very large, and (b) the arrangement of batches is 
very asymmetrical (e.g. lots of big batches followed by lots of small batches) 
and (c) the file is read a lot more often than it is written. Perhaps this 
index structure could be added later, and fall back to either bisecting the 
list of blocks or generating an index at read time. Depends I guess, how much 
extra effort is needed.
   
   We have a very crude solution for now - use a constant batch size and write 
that value to the custom metadata in the footer for datasets created by our 
platform. This is really all the functionality we need, the only problem is 
it's not portable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to