2010YOUY01 commented on code in PR #18457:
URL: https://github.com/apache/datafusion/pull/18457#discussion_r2509455733
##########
datafusion/sqllogictest/test_files/arrow_files.slt:
##########
@@ -128,3 +128,228 @@ physical_plan DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/
# Errors in partition filters should be reported
query error Divide by zero error
SELECT f0 FROM arrow_partitioned WHERE CASE WHEN true THEN 1 / 0 ELSE part END
= 1;
+
+#############
+## Arrow IPC stream format support
+#############
Review Comment:
Test coverage is great! I recommend to include two additional edge cases:
1. Query from an empty Arrow IPC Stream file: we can generate a valid but
empty IPC Stream test file, if you `select * from strema_file` it returns an
empty set (if this is possible)
2. Query from an corrupted stream file: the file ends with `.arrow` so we're
inferring it as arrow file, but its internal is broken somewhere so the result
is an error
##########
datafusion/datasource-arrow/src/file_format.rs:
##########
@@ -349,94 +376,122 @@ impl DataSink for ArrowFileSink {
}
}
+// Custom implementation of inferring schema. Should eventually be moved
upstream to arrow-rs.
+// See <https://github.com/apache/arrow-rs/issues/5021>
+
const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
const CONTINUATION_MARKER: [u8; 4] = [0xff; 4];
-/// Custom implementation of inferring schema. Should eventually be moved
upstream to arrow-rs.
-/// See <https://github.com/apache/arrow-rs/issues/5021>
-async fn infer_schema_from_file_stream(
+async fn infer_stream_schema(
Review Comment:
Let's add a link to the format spec here.
##########
datafusion/datasource-arrow/src/source.rs:
##########
@@ -36,19 +51,18 @@ use futures::StreamExt;
use itertools::Itertools;
use object_store::{GetOptions, GetRange, GetResultPayload, ObjectStore};
-/// Arrow configuration struct that is given to DataSourceExec
-/// Does not hold anything special, since [`FileScanConfig`] is sufficient for
arrow
+/// `FileSource` for Arrow IPC file format. Supports range-based parallel
reading.
#[derive(Clone)]
-pub struct ArrowSource {
Review Comment:
Here there seems to be several public API changes like `ArrowSource` and
`ArrowOpener`, it would be great to include them in the upgrade guide
https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md
Though I'm not sure if we can make it in `51.0.0`, if not this should be
under `52.0.0` section
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]