Rafferty97 commented on PR #9494:
URL: https://github.com/apache/arrow-rs/pull/9494#issuecomment-4794432498
Hi @alamb, thanks for having another look over the PR. I definitely
appreciate that it's a little on the heavier side, and the aversion to changing
the code's behaviour.
You're right that these changes would cause some schema inferences that
previously succeeded to return an error. In particular, it would now reject
cases where a given key appears with both scalar and array values, which is
what `mixed_arrays.json` specifically tests:
```
{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}
{"a":-10, "b":[2.0, 1.3, -6.1], "c":null, "d":null}
{"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":"text"}
{"a":3, "b":4, "c": true, "d":[1, false, "array", 2.4]}
```
Here, the keys `b`, `c` and `d` appear in some rows as scalar values, and as
other rows as array values. In the existing code, these keys get inferred as
array types, on the assumption that when the file is read into actual
`RecordBatches`, those scalar values will be coerced into single-element arrays.
However, the JSON reader doesn't actually do this. It appears that it used
to but that this functionality was removed. So, even though the schema
inference succeeds, the JSON file will fail to be read at query execution time.
So, my opinion is that it would actually be better for the inference itself to
fail, as this is more informative to the user.
With that being said, I'm open to preserving the existing behaviour for now
to get the PR over the line. If that's your preference, I'll re-implement that
functionality and reinstate those tests I removed.
I also agree with your earlier comment about preferring not to commit gziped
files in the repo itself. I'd be happy to change those tests that need them to
instead do the compression themselves rather than relying on an
already-compressed file existing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]