Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

via GitHub Wed, 01 Nov 2023 12:32:49 -0700


Jefffrey commented on PR #7962:
URL: 
https://github.com/apache/arrow-datafusion/pull/7962#issuecomment-1789537022


   > FWIW for consistency we might want to do something closer to what we do 
for parquet where:
   > 
   >     * We have an estimate of the size of the footer which we fetch
   > 
   >     * We read the actual footer size from the fetched data
   > 
   >     * We then fetch any extra data needed
   > 
   >     * Once decoded the footer provides information on the schema and where 
the data blocks are located
   > 
   > 
   > This PR instead appears to read the first RecordBatch, whilst I _think_ 
this should work (provided the file contains data), the more standard approach 
might be to read the footer.
   > 
   > Edit: I also filed 
[apache/arrow-rs#5021](https://github.com/apache/arrow-rs/issues/5021) which 
outlines some APIs we could add upstream that might help here
   
   We could read the first chunk of the stream of the file similar to reading 
the last chunk of parquet and hoping it contains all the necessary data to 
decode the schema. I chose the current approach as we don't have control over 
the number of bytes each await brings from the stream, whereas when reading 
parquet we generally have more control over that I believe.
   
   This method shouldn't read any record batches, it simply reads the first 
flatbuffer message in the IPC file contents which is expected to be a schema 
message, per the specification stating that an IPC streaming format should have 
the schema message come first and the IPC file format is simply an 
encapsulation of the IPC streaming format with some addiitonal wrapping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Reply via email to