[GitHub] [spark] sunchao commented on pull request #35483: [SPARK-38179][SQL] Improve `WritableColumnVector` to better support null struct

GitBox Wed, 16 Feb 2022 14:49:04 -0800


sunchao commented on pull request #35483:
URL: https://github.com/apache/spark/pull/35483#issuecomment-1042384416



   > Other users and tools might be depending on the memory layout of 
{On,Off}HeapColumnVector since it's possible to access the underlying memory 
directly. I.e. this could cause breakage to user code. I don't know whether we 
consider this part of the public API or not.
   
   As mentioned in the PR description, these are not public APIs and hence the 
change.
   
   > This adds memory and CPU overhead for a common case (non-NULL struct) to 
optimize for a different case (NULL structs). I don't have data to prove that 
NULL structs are queried less frequently than NULL structs but it seems likely. 
It doesn't help with worst-case memory consumption either.
   
   Well, this is subjective though and there is no data point proving that NULL 
structs is _not_ used commonly (at least I've seen quite some of these before). 
Spark also reuses these vectors so the overhead is bounded by the number of 
projected columns.
   
   > This makes the memory layout for top-level fields inconsistent with memory 
layout for fields inside structs, which can complicate the rest of the engine 
(need separate code for top-level column vectors and nested column vectors).
   Unnesting a struct field, e.g. projecting SELECT x.y as x would require a 
conversion of the value data because the memory layout is different.
   
   Hm I don't get this: what complication and what conversion? as you can see 
the PR itself is pretty simple and it doesn't involve changes on the "rest of 
the engine".
   
   > The current memory layout is more consistent with Arrow - 
https://arrow.apache.org/docs/format/Columnar.html#struct-layout - so as more 
of the data ecosystem moves towards Arrow as the interchange format, we're 
locked into doing data format conversions.
   
   For Arrow there is a separate code path already `ArrowColumnVector`. All the 
data eco-systems I know (e.g., RAPIDS, Intel OAP, Iceberg) are already using 
that. So, I'm not really convinced this is an issue.
   
   > As someone who's fairly familiar with the Parquet spec and has worked on a 
couple of Parquet readers, this motivation seems incomplete. Data for top-level 
fields in Parquet is stored in exactly the same way as nested fields. Why does 
the approach used to scatter top-level values to the output vector not work for 
structs? Conceptually it should be the same algorithm except you are scattering 
values with def_level >= field_def_level instead of def_level >= 1. Something 
like
   
   The problem is when creating child column vectors for a struct, we don't 
have the info whether a slot in the vector is for a null struct or not, as we 
are comparing with the child column's max definition level. The null structs 
are created when we are assembling the child columns. Please checkout #34659.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on pull request #35483: [SPARK-38179][SQL] Improve `WritableColumnVector` to better support null struct

Reply via email to