scovich commented on issue #7119:
URL: https://github.com/apache/arrow-rs/issues/7119#issuecomment-2652344533

   > I think what might help is adding a representation section, similar to we 
have for 
[ListArray](https://docs.rs/arrow-array/latest/arrow_array/array/struct.GenericListArray.html#representation)
 as added by @alamb in https://github.com/apache/arrow-rs/pull/7039. This could 
clearly show that null slots are arbitrary.
   
   That does seem like a glaring gap, given how nice the `ListArray` doc is...
   
   > the arrow_json reader avoids 
[#6510](https://github.com/apache/arrow-rs/issues/6510) by computing null masks 
for all children of nullable StructArray see 
[here](https://github.com/apache/arrow-rs/blob/main/arrow-json/src/reader/struct_array.rs#L44-L54).
 The parquet reader could do this, in fact I suggested this [#6510 
(comment)](https://github.com/apache/arrow-rs/issues/6510#issuecomment-2394955461),
 but it is effectively wasted work
   
   Could you elaborate why it's wasted work? Presumably the column should be 
read at least once (else why materialize it -- should have pruned the read 
schema). If whoever consumes a given leaf column anyway has to union the null 
masks (possibly multiple times, if more than one read), it seems not-worse to 
just have the writer do it once up front. Especially when the writer is in a 
better position to do that unioning efficiently as it assembles the columns 
into structs, vs. readers having to reverse engineer it every time they access 
each column. 
   
   Or is the main concern that storing null masks for non-nullable nested 
columns would consume too much memory? 
   
   > if anything [storing pre-unioned null masks] encourages incorrect 
assumptions
   
   If the spec (or the implementation) doesn't _require_ valid null masks at 
every level, then I have to agree with you there. No point computing _any_ null 
masks unless they are trustworthy -- it just forces both reader and writer to 
pay the ~same cost because they don't trust each other's work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to