2010YOUY01 commented on code in PR #18457:
URL: https://github.com/apache/datafusion/pull/18457#discussion_r2509455733


##########
datafusion/sqllogictest/test_files/arrow_files.slt:
##########
@@ -128,3 +128,228 @@ physical_plan DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/
 # Errors in partition filters should be reported
 query error Divide by zero error
 SELECT f0 FROM arrow_partitioned WHERE CASE WHEN true THEN 1 / 0 ELSE part END 
= 1;
+
+#############
+## Arrow IPC stream format support
+#############

Review Comment:
   Test coverage is great! I recommend to include two additional edge cases:
   1. Query from an empty Arrow IPC Stream file: we can generate a valid but 
empty IPC Stream test file, if you `select * from strema_file` it returns an 
empty set (if this is possible)
   2. Query from an corrupted stream file: the file ends with `.arrow` so we're 
inferring it as arrow file, but its internal is broken somewhere so the result 
is an error



##########
datafusion/datasource-arrow/src/file_format.rs:
##########
@@ -349,94 +376,122 @@ impl DataSink for ArrowFileSink {
     }
 }
 
+// Custom implementation of inferring schema. Should eventually be moved 
upstream to arrow-rs.
+// See <https://github.com/apache/arrow-rs/issues/5021>
+
 const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
 const CONTINUATION_MARKER: [u8; 4] = [0xff; 4];
 
-/// Custom implementation of inferring schema. Should eventually be moved 
upstream to arrow-rs.
-/// See <https://github.com/apache/arrow-rs/issues/5021>
-async fn infer_schema_from_file_stream(
+async fn infer_stream_schema(

Review Comment:
   Let's add a link to the format spec here.



##########
datafusion/datasource-arrow/src/source.rs:
##########
@@ -36,19 +51,18 @@ use futures::StreamExt;
 use itertools::Itertools;
 use object_store::{GetOptions, GetRange, GetResultPayload, ObjectStore};
 
-/// Arrow configuration struct that is given to DataSourceExec
-/// Does not hold anything special, since [`FileScanConfig`] is sufficient for 
arrow
+/// `FileSource` for Arrow IPC file format. Supports range-based parallel 
reading.
 #[derive(Clone)]
-pub struct ArrowSource {

Review Comment:
   Here there seems to be several public API changes like `ArrowSource` and 
`ArrowOpener`, it would be great to include them in the upgrade guide 
https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md
   Though I'm not sure if we can make it in `51.0.0`, if not this should be 
under `52.0.0` section



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to