antoniobadia commented on issue #46040: URL: https://github.com/apache/arrow/issues/46040#issuecomment-3176227414
Arrow cannot parse JSON correctly when the conversion to column fails. That is, it expects every path in JSON to be associated to a unique data type. It fails in all the following (examples produced with pyarrow): "name": { "firstName": "Duckota", "lastName": "Fanning" }, ... "name": "Jim Jones", #pyarrow.lib.ArrowInvalid: JSON parse error: Column(/name) changed from object to string in row 3 "dimensions": { "height": "six foot", "weight": "165 lbs" }, ... "dimensions": { "height": 6.2, "weight": 185 }, #pyarrow.lib.ArrowInvalid: JSON parse error: Column(/dimensions/height) changed from string to number in row 3 {"a":1,"b":"foo"} {"a":2,"b":"bar"} {"a":3, "c":{"d":4, "e":5}} #pyarrow.lib.ArrowInvalid: JSON parse error: Column(/d/[]) changed from number to string in row 3 {"a":11, "d":[12, "h"]} {"a":6, "d":[7, 8, 9]} #pyarrow.lib.ArrowInvalid: JSON parse error: Column(/d/[]) changed from number to string in row 4 {"a":10, "d":["f", "g"]} This may be a limitation made on purpose because of the difficulties of transforming into columnar storage. I'm currently working on a project to take care of this by using UNIONs automatically in such cases. Please let me know if this is already being done/addressed (or if you are interested in joining the project) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org