sunchao commented on pull request #35483:
URL: https://github.com/apache/spark/pull/35483#issuecomment-1042384416
> Other users and tools might be depending on the memory layout of
{On,Off}HeapColumnVector since it's possible to access the underlying memory
directly. I.e. this could cause breakage to user code. I don't know whether we
consider this part of the public API or not.
As mentioned in the PR description, these are not public APIs and hence the
change.
> This adds memory and CPU overhead for a common case (non-NULL struct) to
optimize for a different case (NULL structs). I don't have data to prove that
NULL structs are queried less frequently than NULL structs but it seems likely.
It doesn't help with worst-case memory consumption either.
Well, this is subjective though and there is no data point proving that NULL
structs is _not_ used commonly (at least I've seen quite some of these before).
Spark also reuses these vectors so the overhead is bounded by the number of
projected columns.
> This makes the memory layout for top-level fields inconsistent with memory
layout for fields inside structs, which can complicate the rest of the engine
(need separate code for top-level column vectors and nested column vectors).
Unnesting a struct field, e.g. projecting SELECT x.y as x would require a
conversion of the value data because the memory layout is different.
Hm I don't get this: what complication and what conversion? as you can see
the PR itself is pretty simple and it doesn't involve changes on the "rest of
the engine".
> The current memory layout is more consistent with Arrow -
https://arrow.apache.org/docs/format/Columnar.html#struct-layout - so as more
of the data ecosystem moves towards Arrow as the interchange format, we're
locked into doing data format conversions.
For Arrow there is a separate code path already `ArrowColumnVector`. All the
data eco-systems I know (e.g., RAPIDS, Intel OAP, Iceberg) are already using
that. So, I'm not really convinced this is an issue.
> As someone who's fairly familiar with the Parquet spec and has worked on a
couple of Parquet readers, this motivation seems incomplete. Data for top-level
fields in Parquet is stored in exactly the same way as nested fields. Why does
the approach used to scatter top-level values to the output vector not work for
structs? Conceptually it should be the same algorithm except you are scattering
values with def_level >= field_def_level instead of def_level >= 1. Something
like
The problem is when creating child column vectors for a struct, we don't
have the info whether a slot in the vector is for a null struct or not, as we
are comparing with the child column's max definition level. The null structs
are created when we are assembling the child columns. Please checkout #34659.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]