jacques-n commented on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-648507985
> We can decide to stipulate that union types never have non-valid values at the Union cell level, only at the child cell level. But then a union value cannot be "made null" by changing the validity bitmap of the Union. I believe a union can still express this with a child type null type, no? I think that is how we either modeled it or planned to model it no the java side. > All nested types including union are composed from well-formed child arrays which may have null values. I'm in agreement on this. Decomposing would be complex. > In the case of union, a null at the top level would indicate that the type of the child is not known. This seems algebraically consistent to me. I think it's where the model breaks down because of the weird situation where you actually need to evaluate two validity buffers to determine whether something is valid: the parent and the child. And an inconsistency would be really weird. As such, I'm think it would be better to avoid the top-level validity buffer. > FTR I'm OK with dropping the top-level validity bitmap from Union, especially if it helps us move forward That would be my preference. It seems to ultimately reduce the risk of inconsistency and doesn't seem to have any functional loss (given the use of null type to indicate a non-alternatively-typed value). I also think this works well in the most common case of union types, e.g. two files where one has fieldA with schemaA and another where you have fieldA with schemaB. Compositing those two doesn't require some kind of introspection and AND'ing of the individual children to build an additional validity buffer (or simply setting true for all and then having an inconsistency with the child array) and allows a fast set of the type vector for each independent chunk. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
