[GitHub] [spark] timarmstrong commented on pull request #35483: [SPARK-38179][SQL] Improve `WritableColumnVector` to better support null struct

GitBox Thu, 17 Feb 2022 00:01:58 -0800


timarmstrong commented on pull request #35483:
URL: https://github.com/apache/spark/pull/35483#issuecomment-1042670516



   > The problem is when creating child column vectors for a struct, we don't 
have the info whether a slot in the vector is for a null struct or not, as we 
are comparing with the child column's max definition level. The null structs 
are created when we are assembling the child columns. Please checkout 
https://github.com/apache/spark/pull/34659.
   I spent about half an hour looking at that PR and I don't get why the 
vectorized reader needs this WritableColumnVector change. I see the code in 
`VectorizedRleValuesReader` is written to use this change (it only increments 
state.valueOffset if the parent struct is non-NULL) but I don't see why the 
logic needs to be like that.
   
   This wasn't a problem in other vectorized Parquet readers I've worked with 
or looked at. E.g. the Apache Impala and the Apache Arrow/parquet-cpp Parquet 
readers read values in a spaced manner for structs without problems - 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L1014
   
   
   > Well, this is subjective though and there is no data point proving that 
NULL structs is not used commonly (at least I've seen quite some of these 
before). Spark also reuses these vectors so the overhead is bounded by the 
number of projected columns.
   I don't see any evidence that making this different trade-off is better 
though - it seems like we should have some stronger reason to think this change 
is a net positive given how many CPUs this code will ultimately run on. Is 
there some benchmark or workload where this is significantly better?
   
   > Hm I don't get this: what complication and what conversion? as you can see 
the PR itself is pretty simple and it doesn't involve changes on the "rest of 
the engine".
   I guess most of spark doesn't use these column vectors, so maybe not too 
important. If there is ever more efforts to add vectorized operations in Spark 
it would be simpler if you can apply vectorized operations to struct elements 
in same way as top-level elements. This maybe isn't a strong reason right now 
but the current Spark layout seems more future proof.
   
   > For Arrow there is a separate code path already ArrowColumnVector. All the 
data eco-systems I know (e.g., RAPIDS, Intel OAP, Iceberg) are already using 
that. So, I'm not really convinced this is an issue.
   I was thinking that people in future will want conversions between different 
column vector types with minimal data copying where possible, e.g. if you want 
to scan from parquet and call a UDF using Arrow. Producing one style of 
ColumnVector then converting to another is inefficient. You could avoid the 
copy in principle when the memory layouts of vectors are identical (pretty 
common for numeric types). This really isn't super-important - maybe nobody 
will ever want to do this this - but just another reason to not change this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] timarmstrong commented on pull request #35483: [SPARK-38179][SQL] Improve `WritableColumnVector` to better support null struct

Reply via email to