serramatutu opened a new pull request, #833:
URL: https://github.com/apache/arrow-go/pull/833

   ### Rationale for this change
   
   I raised an issue at the Arrow Community meeting regarding how Arrow Go 
allows the user to construct invalid record batches by reading JSON with nulls 
in it even if the schema says it's not null. It is also possible to construct 
invalid batches by calling `AppendNull()` in the builders, and currently there 
is no way to validate that. I was instructed to look at how Arrow C++/PyArrow 
do it, and replicate it here.
   
   TL;DR: 
   1. Arrow C++ raises errors when reading data that does not conform to the 
schema
   2. Arrow C++ never JSON-encodes values as `null` if the parent field is 
non-nullable
   3. In Arrow C++, if the user manually constructs something invalid, 
`validate()` will catch it
   
   Here's a link to my full investigation, including the code so you can run it 
yourself: https://github.com/serramatutu/arrow-internal-nulls
   
   
   ### What changes are included in this PR?
   
   You can review commit by commit.
   
   To address Issue 1:
   - Make `RecordBuilder.UnmarshalOne()` return an error if a `null` is present 
for a non-nullable field, or if the value is not given.
   - Same for `StructBuilder.UnmarshalOne()`, so that it all works recursively.
   
   To address Issue 2:
   - Added a `nullable bool` parameter to `GetOneForMarshal()`. It is `true` in 
most cases, but `arrow.Record` and `arrow.Struct` might pass it as `false` in 
case the field we're serializing is non-nullable. This prevents it from writing 
`null` to the JSON output.
   
   I had to change a bunch of tests that implicitly depended on fields being 
nullable, even if they didn't declare the fields as such.
   
   I also included Issue 3 in this PR (but maybe it should be a separate PR, 
let me know?):
   - Added a recursive check (inspects deep structs) for `NullN() > 0` in case 
fields are not nullable in `simpleRecord.validate()`. While users can still 
manually append to inner buffers, things like `bldr.NewRecordBatch()` will run 
`validate()` and check for nullability conformance before returning the record 
batch.
   
   TODO: Regarding Issue 3, it looks like the CSV parser blindly calls 
`AppendNull()` even when the user gives it a schema that is not nullable. I was 
not sure if the expected behavior is to `AppendEmptyValue()`, return an error 
or something else. So some CSV tests are failing...
   
   ### Are these changes tested?
   
   Yes.
   
   
   ### Are there any user-facing changes?
   
   Yes, this makes the JSON decoder less lenient and validation stricter.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to