alamb commented on code in PR #19924:
URL: https://github.com/apache/datafusion/pull/19924#discussion_r2717395665
##########
datafusion/datasource-json/src/file_format.rs:
##########
@@ -132,7 +132,23 @@ impl Debug for JsonFormatFactory {
}
}
-/// New line delimited JSON `FileFormat` implementation.
+/// JSON `FileFormat` implementation supporting both line-delimited and array
formats.
+///
+/// # Supported Formats
+///
+/// ## Line-Delimited JSON (default)
+/// ```text
+/// {"key1": 1, "key2": "val"}
Review Comment:
Just to confirm, is this a change in default behavior? I don't think so but
I wanted to double check
##########
datafusion/datasource-json/src/file_format.rs:
##########
@@ -166,6 +182,49 @@ impl JsonFormat {
self.options.compression = file_compression_type.into();
self
}
+
+ /// Set whether to expect JSON array format instead of line-delimited
format.
+ ///
+ /// When `true`, expects input like: `[{"a": 1}, {"a": 2}]`
+ /// When `false` (default), expects input like:
+ /// ```text
+ /// {"a": 1}
+ /// {"a": 2}
+ /// ```
+ pub fn with_format_array(mut self, format_array: bool) -> Self {
+ self.options.format_array = format_array;
+ self
+ }
+}
+
+/// Infer schema from a JSON array format file.
+///
+/// This function reads JSON data in array format `[{...}, {...}]` and infers
+/// the Arrow schema from the contained objects.
+fn infer_json_schema_from_json_array<R: Read>(
+ reader: &mut R,
+ max_records: usize,
+) -> std::result::Result<Schema, ArrowError> {
+ let mut content = String::new();
+ reader.read_to_string(&mut content).map_err(|e| {
+ ArrowError::JsonError(format!("Failed to read JSON content: {e}"))
+ })?;
+
+ // Parse as JSON array using serde_json
+ let values: Vec<serde_json::Value> = serde_json::from_str(&content)
Review Comment:
this is likely to be super slow -- it parses the entire JSON file (and then
throws the parsed results away) -- if there is some way to avoid the whole
thing it is probably better (maybe as a follow on PR)
##########
datafusion/core/src/datasource/file_format/json.rs:
##########
@@ -391,4 +392,276 @@ mod tests {
assert_eq!(metadata.len(), 0);
Ok(())
}
+
+ #[tokio::test]
+ async fn test_json_array_format() -> Result<()> {
+ let session = SessionContext::new();
+ let ctx = session.state();
+ let store = Arc::new(LocalFileSystem::new()) as _;
+
+ // Create a temporary file with JSON array format
+ let tmp_dir = tempfile::TempDir::new()?;
+ let path = format!("{}/array.json", tmp_dir.path().to_string_lossy());
+ std::fs::write(
+ &path,
+ r#"[
+ {"a": 1, "b": 2.0, "c": true},
+ {"a": 2, "b": 3.5, "c": false},
+ {"a": 3, "b": 4.0, "c": true}
+ ]"#,
+ )?;
+
Review Comment:
I think this standard preamble is could be reduced so there were fewer test
lines (and thus it was easier to veriy what was being tested)
For example, it looks like you maybe could make a function like
```rust
let file_schema = create_json_with_format({..}", format);
```
I bet the tests would be less than half the size
##########
datafusion/common/src/config.rs:
##########
@@ -3065,6 +3065,22 @@ config_namespace! {
/// If not specified, the default level for the compression algorithm
is used.
pub compression_level: Option<u32>, default = None
pub schema_infer_max_rec: Option<usize>, default = None
+ /// The format of JSON input files.
+ ///
+ /// When `false` (default), expects newline-delimited JSON (NDJSON):
+ /// ```text
+ /// {"key1": 1, "key2": "val"}
+ /// {"key1": 2, "key2": "vals"}
+ /// ```
+ ///
+ /// When `true`, expects JSON array format:
+ /// ```text
+ /// [
+ /// {"key1": 1, "key2": "val"},
+ /// {"key1": 2, "key2": "vals"}
+ /// ]
+ /// ```
+ pub format_array: bool, default = false
Review Comment:
I think `format_array` will be hard to discover / find and we should call
this parameter something more standard.
I looked at what other systems did and there is no consistency.
I reviewed Spark's doc and they seem to use 'multiLine =true` for what you
have labelled `format_array`
https://spark.apache.org/docs/latest/sql-data-sources-json.html
Duckdb seems to call it `format=newline_delimited`:
https://duckdb.org/docs/stable/data/json/loading_json#parameters
postgres seems to have two separate functions `row_to_json` and
`array_to_json`
https://www.postgresql.org/docs/9.5/functions-json.html
I think I prefer the duckdb style `newline_delimited` of the options,
though maybe the spark `multiline` would be more widely understood
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]