houqp commented on a change in pull request #9412: URL: https://github.com/apache/arrow/pull/9412#discussion_r575753005
########## File path: rust/arrow/test/data/mixed_arrays.json ########## @@ -1,4 +1,4 @@ -{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1} +{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":["4.1"]} {"a":-10, "b":[2.0, 1.3, -6.1], "c":null, "d":null} -{"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":"text"} -{"a":3, "b":4, "c": true, "d":[1, false, "array", 2.4]} +{"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":["text"]} +{"a":3, "b":[], "c": [], "d":["array"]} Review comment: I have updated the PR to implement the old behavior so the inference rules are consistent with rules used in the actual value conversion code. After thinking more on this, I think Spark's behavior would be a better design in the long run for the following two reasons: * It preserves more info from the original JSON data. Once we convert scalar to array of scalar in Arrow, we will lose the information that some of the original values were just scalars. It's better to let users decide how they want to handle incompatible types in their query plans. For example, one might want to simply discard all rows with scalar values as a data cleaning step. * Converting fields with incompatible types into string is unavoidable for more complex cases anyway. For example, when a field contains scalar, list and object type, there no other option other than reading the value as JSON string. Perhaps we can refactor the code to match Spark's behavior after we added more JSON parsing kernels. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org