scovich commented on issue #7119: URL: https://github.com/apache/arrow-rs/issues/7119#issuecomment-2652344533
> I think what might help is adding a representation section, similar to we have for [ListArray](https://docs.rs/arrow-array/latest/arrow_array/array/struct.GenericListArray.html#representation) as added by @alamb in https://github.com/apache/arrow-rs/pull/7039. This could clearly show that null slots are arbitrary. That does seem like a glaring gap, given how nice the `ListArray` doc is... > the arrow_json reader avoids [#6510](https://github.com/apache/arrow-rs/issues/6510) by computing null masks for all children of nullable StructArray see [here](https://github.com/apache/arrow-rs/blob/main/arrow-json/src/reader/struct_array.rs#L44-L54). The parquet reader could do this, in fact I suggested this [#6510 (comment)](https://github.com/apache/arrow-rs/issues/6510#issuecomment-2394955461), but it is effectively wasted work Could you elaborate why it's wasted work? Presumably the column should be read at least once (else why materialize it -- should have pruned the read schema). If whoever consumes a given leaf column anyway has to union the null masks (possibly multiple times, if more than one read), it seems not-worse to just have the writer do it once up front. Especially when the writer is in a better position to do that unioning efficiently as it assembles the columns into structs, vs. readers having to reverse engineer it every time they access each column. Or is the main concern that storing null masks for non-nullable nested columns would consume too much memory? > if anything [storing pre-unioned null masks] encourages incorrect assumptions If the spec (or the implementation) doesn't _require_ valid null masks at every level, then I have to agree with you there. No point computing _any_ null masks unless they are trustworthy -- it just forces both reader and writer to pay the ~same cost because they don't trust each other's work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org