timarmstrong commented on pull request #35483:
URL: https://github.com/apache/spark/pull/35483#issuecomment-1042349079


   Hi,
     I found out about this change because I work on some things that depends 
on the memory layout of `OffHeapColumnVector`s and assumes that data for struct 
columns has the same layout as top-level columns (i.e. storage is allocated for 
NULL values).
     
   I'm biased, but I have a few reasons I think this is not the right direction 
for Spark:
     * Other users and tools might be depending on the memory layout of 
`{On,Off}HeapColumnVector` since it's possible to access the underlying memory 
directly. I.e. this could cause breakage to user code. I don't know whether we 
consider this part of the public API or not.
     * This adds memory and CPU overhead for a common case (non-NULL struct) to 
optimize for a different case (NULL structs). I don't have data to prove that 
NULL structs are queried less frequently than NULL structs but it seems likely. 
It doesn't help with worst-case memory consumption either.
     * This makes the memory layout for top-level fields inconsistent with 
memory layout for fields inside structs, which can complicate the rest of the 
engine (need separate code for top-level column vectors and nested column 
vectors).
     * Unnesting a struct field, e.g. projecting `SELECT x.y as x` would 
require a conversion of the value data because the memory layout is different.
     * The current memory layout is more consistent with Arrow - 
https://arrow.apache.org/docs/format/Columnar.html#struct-layout - so as more 
of the data ecosystem moves towards Arrow as the interchange format, we're 
locked into doing data format conversions.
   
   > In addition, this doesn't work well with the ongoing Parquet vectorized 
support for complex types, since when creating the child column vectors for a 
struct, we don't yet have the information of which slot is for null struct 
(Parquet packs values contiguously instead of having null slots).
   As someone who's fairly familiar with the Parquet spec and has worked on a 
couple of Parquet readers, this motivation seems incomplete. Data for top-level 
fields in Parquet is stored in exactly the same way as nested fields. Why does 
the approach used to scatter top-level values to the output vector not work for 
structs? Conceptually it should be the same algorithm except you are scattering 
values with `def_level >= field_def_level` instead of `def_level >= 1`. 
Something like:
   ```
   for (int i = 0; i < num_values; i++) {
     bool is_null = parquet_def_levels[i] >= field_def_level
     out.nulls[i] = is_null
     if (!is_null) {
       out.values[i] = parquet_values[curr_value_idx]
       curr_value_idx++  
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to