Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

via GitHub Wed, 01 Nov 2023 12:29:32 -0700


Jefffrey commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1379219105



##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,177 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> 
Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];

Review Comment:
   > What do you think about moving this logic upstream into the arrow-rs 
reader.
   
   Yes, this logic was essentially ripped from `StreamReader` and `FileReader` 
of `arrow-ipc`, but adjusted to be made compatible with async stream of bytes. 
We could move this logic to `arrow-ipc`, but we need to keep in mind that 
though we are getting a stream of bytes, this is a stream of bytes in the IPC 
file format and not the IPC streaming format. So an `AsyncStreamReader` might 
not exactly fit our use, whereas an `AsyncFileReader` could but might be 
limited if we don't read its footer when attempting to decode the rest of the 
data.
   
   > I think I would rather the approach described on 
https://github.com/apache/arrow-rs/issues/5021, reading the footer is more 
generally useful, providing information beyond just the schema
   
   It seems this ticket could be appropriate for that. Just to note, that we 
can't exactly read the footer in a stream without reverting to the old method 
of reading the entire stream just to decode the schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Reply via email to