Just a personal thought, but I think option 3 is valid in a scenario where the column has been filtered and then changed to non null. I believe this enables some filtering cases to be zero-copy?
I could be confusing how child arrays could be referenced though. # ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene Sent from Proton Mail for iOS. -------- Original Message -------- On Friday, 01/30/26 at 06:19 Weston Pace <[email protected]> wrote: I agree with Raphael that this is probably too late to change. There are many tools out there that produce Arrow data now and they are not all going to conform to definition 1. In fact, as Antoine points out, many tools do not even guarantee validity at all (a batch created with pyarrow may have a field marked non-nullable that has nulls). As a result, my personal stance has been to ignore the nullability flag on all external data and independently determine whether an array has or does not have nulls. > the problem I have is that this is an undefined behavior, the accepted behavior can be (I don't think this should be the behavior) that there should be no requirement on the child nulls, and it can have nulls anywhere they want even if the parent does not have null there. There is very little mention of the nullable flag in the spec at all. The only thing I see is: > Whether the field is semantically nullable. While this has no bearing on the array’s physical layout, > many systems distinguish nullable and non-nullable fields and we want to allow them to preserve > this metadata to enable faithful schema round trips. Since the spec explicitly states "this has no bearing on the array's physical layout" I think your accepted behavior could, in fact, be seen as valid, if not wise. That being said, my view might be a little out there :). I am content if we want to consolidate on a definition. I think definition 3 is the most flexible and likely to be adopted. On Thu, Jan 29, 2026 at 11:55 AM Raz Luvaton <[email protected]> wrote: > > If something had been > > standardised at the start that would be one thing, but retroactively > > adding schema restrictions now is likely to break existing workflows, > > and is therefore probably best avoided. > > the problem I have is that this is an undefined behavior, the accepted > behavior can be (I don't think this should be the behavior) that there > should be no requirement on the child nulls, and it can have nulls anywhere > they want even if the parent does not have null there. > > On 2026/01/29 19:40:01 Raphael Taylor-Davies wrote: > > For what it is worth arrow-rs takes the most permission interpretation 3 > > - we only reject unambiguously malformed StructArray. For further > > context I believe the instigator of this email thread is [1]. > > > > I think the main question with taking one of the more strict > > interpretations is what value is assigned to "masked" values when > > parsing from some other format, such as JSON or parquet, that doesn't > > encode them. Some people think it should be NULL, others arbitrary. For > > example, when arrow-rs changed the parquet reader from using NULL to > > arbitrary it was actually reported as a bug [2]. > > > > My 2 cents is that this is a bit like the question around whether > > StructArray can have fields with the same name. If something had been > > standardised at the start that would be one thing, but retroactively > > adding schema restrictions now is likely to break existing workflows, > > and is therefore probably best avoided. > > > > Kind Regards, > > > > Raphael > > > > [1]: https://github.com/apache/arrow-rs/issues/9302 > > [2]: https://github.com/apache/arrow-rs/issues/7119 > > > > On 29/01/2026 19:10, Raz Luvaton wrote: > > > Currently there is ambiguity on what the validity buffer for non > nullable > > > field of a nullable struct can be. > > > > > > Lets take for example the following type: > > > ``` > > > nullable StructArray with non nullable field Int32 > > > ``` > > > The struct validity is: valid, null, null, valid. > > > > > > which of the following should be: > > > 1. The child array (the int32 array) FORBIDDEN from having nulls at all > > > (i.e. in our example the validity buffer for the child must be valid, > > > valid, valid, valid) as the field is marked as non nullable? > > > 2. The child array REQUIRED to have nulls at the same positions of the > > > struct nulls, i.e. the validity buffer for the child MUST be valid, > null, > > > null, valid in our example? > > > 3. The child array MAY have nulls but it is FORBIDDEN to have nulls > where > > > the struct does not have nulls, i.e. it can't have null, null, valid, > valid > > > but it can have valid, null, valid, valid in our example. > > > > > > I would argue that 1 is the correct and expected requirement, as the > field > > > is marked as non nullable. > > > > > > The chosen behavior will be applicable for other nested types as well > > > > > > > > > Thanks, Raz Luvaton > > > > > >
publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
