antoniobadia commented on issue #46040:
URL: https://github.com/apache/arrow/issues/46040#issuecomment-3176227414
Arrow cannot parse JSON correctly when the conversion to column fails. That
is, it expects every path in JSON to be associated to a unique data type. It
fails in all the following (examples produced with pyarrow):
"name": {
"firstName": "Duckota",
"lastName": "Fanning"
},
...
"name": "Jim Jones",
#pyarrow.lib.ArrowInvalid: JSON parse error: Column(/name) changed from
object to string in row 3
"dimensions": {
"height": "six foot",
"weight": "165 lbs"
},
...
"dimensions": {
"height": 6.2,
"weight": 185
},
#pyarrow.lib.ArrowInvalid: JSON parse error: Column(/dimensions/height)
changed from string to number in row 3
{"a":1,"b":"foo"}
{"a":2,"b":"bar"}
{"a":3, "c":{"d":4, "e":5}}
#pyarrow.lib.ArrowInvalid: JSON parse error: Column(/d/[]) changed from
number to string in row 3
{"a":11, "d":[12, "h"]}
{"a":6, "d":[7, 8, 9]}
#pyarrow.lib.ArrowInvalid: JSON parse error: Column(/d/[]) changed from
number to string in row 4
{"a":10, "d":["f", "g"]}
This may be a limitation made on purpose because of the difficulties of
transforming into columnar storage. I'm currently working on a project to take
care of this by using UNIONs automatically in such cases. Please let me know if
this is already being done/addressed (or if you are interested in joining the
project)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]