Jefffrey commented on PR #7962: URL: https://github.com/apache/arrow-datafusion/pull/7962#issuecomment-1789537022
> FWIW for consistency we might want to do something closer to what we do for parquet where: > > * We have an estimate of the size of the footer which we fetch > > * We read the actual footer size from the fetched data > > * We then fetch any extra data needed > > * Once decoded the footer provides information on the schema and where the data blocks are located > > > This PR instead appears to read the first RecordBatch, whilst I _think_ this should work (provided the file contains data), the more standard approach might be to read the footer. > > Edit: I also filed [apache/arrow-rs#5021](https://github.com/apache/arrow-rs/issues/5021) which outlines some APIs we could add upstream that might help here We could read the first chunk of the stream of the file similar to reading the last chunk of parquet and hoping it contains all the necessary data to decode the schema. I chose the current approach as we don't have control over the number of bytes each await brings from the stream, whereas when reading parquet we generally have more control over that I believe. This method shouldn't read any record batches, it simply reads the first flatbuffer message in the IPC file contents which is expected to be a schema message, per the specification stating that an IPC streaming format should have the schema message come first and the IPC file format is simply an encapsulation of the IPC streaming format with some addiitonal wrapping. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
