serramatutu opened a new pull request, #833: URL: https://github.com/apache/arrow-go/pull/833
### Rationale for this change I raised an issue at the Arrow Community meeting regarding how Arrow Go allows the user to construct invalid record batches by reading JSON with nulls in it even if the schema says it's not null. It is also possible to construct invalid batches by calling `AppendNull()` in the builders, and currently there is no way to validate that. I was instructed to look at how Arrow C++/PyArrow do it, and replicate it here. TL;DR: 1. Arrow C++ raises errors when reading data that does not conform to the schema 2. Arrow C++ never JSON-encodes values as `null` if the parent field is non-nullable 3. In Arrow C++, if the user manually constructs something invalid, `validate()` will catch it Here's a link to my full investigation, including the code so you can run it yourself: https://github.com/serramatutu/arrow-internal-nulls ### What changes are included in this PR? You can review commit by commit. To address Issue 1: - Make `RecordBuilder.UnmarshalOne()` return an error if a `null` is present for a non-nullable field, or if the value is not given. - Same for `StructBuilder.UnmarshalOne()`, so that it all works recursively. To address Issue 2: - Added a `nullable bool` parameter to `GetOneForMarshal()`. It is `true` in most cases, but `arrow.Record` and `arrow.Struct` might pass it as `false` in case the field we're serializing is non-nullable. This prevents it from writing `null` to the JSON output. I had to change a bunch of tests that implicitly depended on fields being nullable, even if they didn't declare the fields as such. I also included Issue 3 in this PR (but maybe it should be a separate PR, let me know?): - Added a recursive check (inspects deep structs) for `NullN() > 0` in case fields are not nullable in `simpleRecord.validate()`. While users can still manually append to inner buffers, things like `bldr.NewRecordBatch()` will run `validate()` and check for nullability conformance before returning the record batch. TODO: Regarding Issue 3, it looks like the CSV parser blindly calls `AppendNull()` even when the user gives it a schema that is not nullable. I was not sure if the expected behavior is to `AppendEmptyValue()`, return an error or something else. So some CSV tests are failing... ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, this makes the JSON decoder less lenient and validation stricter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
