timarmstrong commented on pull request #35483:
URL: https://github.com/apache/spark/pull/35483#issuecomment-1042349079
Hi,
I found out about this change because I work on some things that depends
on the memory layout of `OffHeapColumnVector`s and assumes that data for struct
columns has the same layout as top-level columns (i.e. storage is allocated for
NULL values).
I'm biased, but I have a few reasons I think this is not the right direction
for Spark:
* Other users and tools might be depending on the memory layout of
`{On,Off}HeapColumnVector` since it's possible to access the underlying memory
directly. I.e. this could cause breakage to user code. I don't know whether we
consider this part of the public API or not.
* This adds memory and CPU overhead for a common case (non-NULL struct) to
optimize for a different case (NULL structs). I don't have data to prove that
NULL structs are queried less frequently than NULL structs but it seems likely.
It doesn't help with worst-case memory consumption either.
* This makes the memory layout for top-level fields inconsistent with
memory layout for fields inside structs, which can complicate the rest of the
engine (need separate code for top-level column vectors and nested column
vectors).
* Unnesting a struct field, e.g. projecting `SELECT x.y as x` would
require a conversion of the value data because the memory layout is different.
* The current memory layout is more consistent with Arrow -
https://arrow.apache.org/docs/format/Columnar.html#struct-layout - so as more
of the data ecosystem moves towards Arrow as the interchange format, we're
locked into doing data format conversions.
> In addition, this doesn't work well with the ongoing Parquet vectorized
support for complex types, since when creating the child column vectors for a
struct, we don't yet have the information of which slot is for null struct
(Parquet packs values contiguously instead of having null slots).
As someone who's fairly familiar with the Parquet spec and has worked on a
couple of Parquet readers, this motivation seems incomplete. Data for top-level
fields in Parquet is stored in exactly the same way as nested fields. Why does
the approach used to scatter top-level values to the output vector not work for
structs? Conceptually it should be the same algorithm except you are scattering
values with `def_level >= field_def_level` instead of `def_level >= 1`.
Something like:
```
for (int i = 0; i < num_values; i++) {
bool is_null = parquet_def_levels[i] >= field_def_level
out.nulls[i] = is_null
if (!is_null) {
out.values[i] = parquet_values[curr_value_idx]
curr_value_idx++
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]