jorgecarleitao commented on a change in pull request #9412:
URL: https://github.com/apache/arrow/pull/9412#discussion_r575753025
##########
File path: rust/arrow/test/data/mixed_arrays.json
##########
@@ -1,4 +1,4 @@
-{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}
+{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":["4.1"]}
{"a":-10, "b":[2.0, 1.3, -6.1], "c":null, "d":null}
-{"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":"text"}
-{"a":3, "b":4, "c": true, "d":[1, false, "array", 2.4]}
+{"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":["text"]}
+{"a":3, "b":[], "c": [], "d":["array"]}
Review comment:
I think that spark's behavior is correct. Our JSON reader and schema
inferer is in my opinion very broken for list arrays.
```
{"a": "hello"}
{"a": "world"}
{"a": "this"}
{"a": null}
{"a": "hello"}
{"a": "world"}
{"a": "this"}
{"a": null}
{"a": "hello"}
{"a": "world"}
{"a": "this"}
{"a": null}
{"a": ["a"]}
```
yields a `ListArray<String>` where each scalar is being interpreted as a
one-element list. I can't see how this is reasonable.
@houqp I have code to handle arbitrary nested lists on my computer, but it
requires that we do not make this kind of crazy stuff, i.e. it works as
expected if no casting like the above is allowed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]