houqp commented on a change in pull request #9412:
URL: https://github.com/apache/arrow/pull/9412#discussion_r575753005



##########
File path: rust/arrow/test/data/mixed_arrays.json
##########
@@ -1,4 +1,4 @@
-{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}
+{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":["4.1"]}
 {"a":-10, "b":[2.0, 1.3, -6.1], "c":null, "d":null}
-{"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":"text"}
-{"a":3, "b":4, "c": true, "d":[1, false, "array", 2.4]}
+{"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":["text"]}
+{"a":3, "b":[], "c": [], "d":["array"]}

Review comment:
       I have updated the PR to implement the old behavior so the inference 
rules are consistent with rules used in the actual value conversion code.
   
   After thinking more on this, I think Spark's behavior would be a better 
design in the long run for the following two reasons:
   
   * It preserves more info from the original JSON data. Once we convert scalar 
to array of scalar in Arrow, we will lose the information that some of the 
original values were just scalars. It's better to let users decide how they 
want to handle incompatible types in their query plans. For example, one might 
want to simply discard all rows with scalar values as a data cleaning step.
   * Converting fields with incompatible types into string is unavoidable for 
more complex cases anyway. For example, when a field contains scalar, list and 
object type, there no other option other than reading the value as JSON string.
   
   Perhaps we can refactor the code to match Spark's behavior after we added 
more JSON parsing kernels.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to