Jefffrey commented on code in PR #6643:
URL: https://github.com/apache/arrow-rs/pull/6643#discussion_r1905610652


##########
arrow-json/src/lib.rs:
##########
@@ -74,6 +74,23 @@ pub use self::writer::{ArrayWriter, LineDelimitedWriter, 
Writer, WriterBuilder};
 use half::f16;
 use serde_json::{Number, Value};
 
+/// Specifies what is considered valid JSON when parsing StructArrays.
+///
+/// If a struct with fields `("a", Int32)` and `("b", Utf8)`, it could be 
represented as
+/// a JSON object (`{"a": 1, "b": "c"}`) or a JSON list (`[1, "c"]`).  This 
enum controls
+/// which form(s) the Reader will accept.
+///
+/// For objects, the order of the key does not matter.
+/// For lists, the entries must be the same number and in the same order as 
the struct fields.

Review Comment:
   Current doc here seems specific to reading/decoding; should also mention 
write since same enum is used as write option.
   
   Could also put extra information such as justification for ListOnly (i.e. 
condensed version assuming schema is known separately, as you've explained in 
PR)



##########
arrow-json/src/writer/mod.rs:
##########
@@ -247,12 +248,43 @@ impl WriterBuilder {
     /// {"foo":null,"bar":null}
     /// ```
     ///
-    /// Default is to skip nulls (set to `false`).
+    /// Default is to skip nulls (set to `false`). If `struct_mode == 
ListOnly`,
+    /// nulls will be written explicitly regardless of this setting.
     pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
         self.0.explicit_nulls = explicit_nulls;
         self
     }
 
+    /// Returns if this writer is configured to write structs as JSON Objects 
or Arrays.
+    pub fn struct_mode(&self) -> StructMode {
+        self.0.struct_mode
+    }
+
+    /// Set whether to write structs as JSON Objects or Lists.
+    ///
+    /// For example, a writer (with [`LineDelimited`] format) writing the 
schema
+    /// `[("a", Int32), ("m": Struct<b: Boolean, c: Utf8>)] would write with
+    /// `StructMode::ObjectOnly`:
+    ///
+    /// ```json
+    /// {"a": 1, "m": {"b": true, "c": "cat"}}
+    /// ```
+    ///
+    /// With `StructMode::ListOnly`:
+    ///
+    /// ```json
+    /// [1, [true, "cat"]]
+    /// ```
+    ///
+    /// Map columns are not affected by this option.
+    ///
+    /// Default is to use `ObjectOnly`. If this is set to `ListOnly`, nulls 
will
+    /// be written explicitly regardless of the `explicit_nulls` setting.

Review Comment:
   Same here for writer docstring, which is similar to the read version



##########
arrow-json/src/writer/encoder.rs:
##########
@@ -185,11 +189,16 @@ fn is_some_and<T>(opt: Option<T>, f: impl FnOnce(T) -> 
bool) -> bool {
 
 impl Encoder for StructArrayEncoder<'_> {
     fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
-        out.push(b'{');
+        match self.struct_mode {
+            StructMode::ObjectOnly => out.push(b'{'),
+            StructMode::ListOnly => out.push(b'['),
+        }
         let mut is_first = true;
+        // Nulls can only be dropped in explicit mode
+        let drop_nulls = (self.struct_mode == StructMode::ObjectOnly) && 
!self.explicit_nulls;
         for field_encoder in &mut self.encoders {
             let is_null = is_some_and(field_encoder.nulls.as_ref(), |n| 
n.is_null(idx));
-            if is_null && !self.explicit_nulls {
+            if is_null && drop_nulls {

Review Comment:
   ```suggestion
               if drop_nulls && is_null {
   ```
   
   I wonder if this makes any difference, as `drop_nulls` should hopefully 
short-circuit; but maybe compiler can already figure this out :sweat_smile: 



##########
arrow-json/src/reader/mod.rs:
##########
@@ -2343,4 +2374,219 @@ mod tests {
             .unwrap()
         );
     }
+
+    #[test]
+    fn test_struct_decoding_list_length() {

Review Comment:
   Is it worth adding a test for decoding to wrong datatype?



##########
arrow-json/src/reader/mod.rs:
##########
@@ -195,6 +197,7 @@ impl ReaderBuilder {
             coerce_primitive: false,
             strict_mode: false,
             is_field: false,
+            struct_mode: StructMode::ObjectOnly,

Review Comment:
   ```suggestion
               struct_mode: Default::default(),
   ```
   
   Considering we already have default defined for it



##########
arrow-json/src/reader/mod.rs:
##########
@@ -235,6 +238,7 @@ impl ReaderBuilder {
             coerce_primitive: false,
             strict_mode: false,
             is_field: true,
+            struct_mode: StructMode::ObjectOnly,

Review Comment:
   ```suggestion
               struct_mode: Default::default(),
   ```



##########
arrow-json/src/reader/mod.rs:
##########
@@ -262,6 +266,26 @@ impl ReaderBuilder {
         }
     }
 
+    /// Set the [`StructMode`] for the reader, which determines whether
+    /// structs can be represented by JSON objects, lists, or either.
+    ///
+    /// For example, if the RecordBatch Schema is
+    /// `[("a", Int32), ("r", Struct([("b", Boolean), ("c", Utf8)]))]`
+    /// then [`StructMode::ObjectOnly`] would read rows of the form
+    /// `{"a": 1, "r": {"b": true, "c": "cat"}}`
+    /// while ['StructMode::ListOnly'] would read rows of the form
+    /// `[1, [true, "cat"]]`
+    ///
+    /// JSON data of this form is returned by
+    /// 
[Presto](https://prestodb.io/docs/current/develop/client-protocol.html#important-queryresults-attributes)
+    /// and 
[Trino](https://trino.io/docs/current/develop/client-protocol.html#important-queryresults-attributes).

Review Comment:
   I feel this documentation is better consolidated within `StructMode` 
docstring, to avoid duplication. e.g. can just simplify this to something like:
   
   ```rust
   /// [`StructMode`] defines how structs can be decoded from JSON; for more 
details refer to the enum documentation.
   ```



##########
arrow-json/src/reader/struct_array.rs:
##########
@@ -70,43 +74,81 @@ impl ArrayDecoder for StructArrayDecoder {
             .is_nullable
             .then(|| BooleanBufferBuilder::new(pos.len()));
 
-        for (row, p) in pos.iter().enumerate() {
-            let end_idx = match (tape.get(*p), nulls.as_mut()) {
-                (TapeElement::StartObject(end_idx), None) => end_idx,
-                (TapeElement::StartObject(end_idx), Some(nulls)) => {
-                    nulls.append(true);
-                    end_idx
-                }
-                (TapeElement::Null, Some(nulls)) => {
-                    nulls.append(false);
-                    continue;
-                }
-                _ => return Err(tape.error(*p, "{")),
-            };
-
-            let mut cur_idx = *p + 1;
-            while cur_idx < end_idx {
-                // Read field name
-                let field_name = match tape.get(cur_idx) {
-                    TapeElement::String(s) => tape.get_string(s),
-                    _ => return Err(tape.error(cur_idx, "field name")),
+        if self.struct_mode == StructMode::ObjectOnly {

Review Comment:
   Probably should be a match, just in case `StructMode` is expanded with more 
variants in future



##########
arrow-json/src/reader/struct_array.rs:
##########
@@ -70,43 +74,81 @@ impl ArrayDecoder for StructArrayDecoder {
             .is_nullable
             .then(|| BooleanBufferBuilder::new(pos.len()));
 
-        for (row, p) in pos.iter().enumerate() {
-            let end_idx = match (tape.get(*p), nulls.as_mut()) {
-                (TapeElement::StartObject(end_idx), None) => end_idx,
-                (TapeElement::StartObject(end_idx), Some(nulls)) => {
-                    nulls.append(true);
-                    end_idx
-                }
-                (TapeElement::Null, Some(nulls)) => {
-                    nulls.append(false);
-                    continue;
-                }
-                _ => return Err(tape.error(*p, "{")),
-            };
-
-            let mut cur_idx = *p + 1;
-            while cur_idx < end_idx {
-                // Read field name
-                let field_name = match tape.get(cur_idx) {
-                    TapeElement::String(s) => tape.get_string(s),
-                    _ => return Err(tape.error(cur_idx, "field name")),
+        if self.struct_mode == StructMode::ObjectOnly {
+            for (row, p) in pos.iter().enumerate() {
+                let end_idx = match (tape.get(*p), nulls.as_mut()) {
+                    (TapeElement::StartObject(end_idx), None) => end_idx,
+                    (TapeElement::StartObject(end_idx), Some(nulls)) => {
+                        nulls.append(true);
+                        end_idx
+                    }
+                    (TapeElement::Null, Some(nulls)) => {
+                        nulls.append(false);
+                        continue;
+                    }
+                    (_, _) => return Err(tape.error(*p, "{")),
                 };
 
-                // Update child pos if match found
-                match fields.iter().position(|x| x.name() == field_name) {
-                    Some(field_idx) => child_pos[field_idx][row] = cur_idx + 1,
-                    None => {
-                        if self.strict_mode {
-                            return Err(ArrowError::JsonError(format!(
-                                "column '{}' missing from schema",
-                                field_name
-                            )));
+                let mut cur_idx = *p + 1;
+                while cur_idx < end_idx {
+                    // Read field name
+                    let field_name = match tape.get(cur_idx) {
+                        TapeElement::String(s) => tape.get_string(s),
+                        _ => return Err(tape.error(cur_idx, "field name")),
+                    };
+
+                    // Update child pos if match found
+                    match fields.iter().position(|x| x.name() == field_name) {
+                        Some(field_idx) => child_pos[field_idx][row] = cur_idx 
+ 1,
+                        None => {
+                            if self.strict_mode {
+                                return Err(ArrowError::JsonError(format!(
+                                    "column '{}' missing from schema",
+                                    field_name
+                                )));
+                            }
                         }
                     }
+                    // Advance to next field
+                    cur_idx = tape.next(cur_idx + 1, "field value")?;
                 }
+            }
+        } else {
+            for (row, p) in pos.iter().enumerate() {

Review Comment:
   I notice for the `ListOnly` branch, `self.strict_mode` isn't being 
considered unlike above; is this noteworthy in a similar way to how 
`explicit_nulls` only applies for `ObjectOnly`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to