timarmstrong commented on pull request #35483: URL: https://github.com/apache/spark/pull/35483#issuecomment-1042670516
> The problem is when creating child column vectors for a struct, we don't have the info whether a slot in the vector is for a null struct or not, as we are comparing with the child column's max definition level. The null structs are created when we are assembling the child columns. Please checkout https://github.com/apache/spark/pull/34659. I spent about half an hour looking at that PR and I don't get why the vectorized reader needs this WritableColumnVector change. I see the code in `VectorizedRleValuesReader` is written to use this change (it only increments state.valueOffset if the parent struct is non-NULL) but I don't see why the logic needs to be like that. This wasn't a problem in other vectorized Parquet readers I've worked with or looked at. E.g. the Apache Impala and the Apache Arrow/parquet-cpp Parquet readers read values in a spaced manner for structs without problems - https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L1014 > Well, this is subjective though and there is no data point proving that NULL structs is not used commonly (at least I've seen quite some of these before). Spark also reuses these vectors so the overhead is bounded by the number of projected columns. I don't see any evidence that making this different trade-off is better though - it seems like we should have some stronger reason to think this change is a net positive given how many CPUs this code will ultimately run on. Is there some benchmark or workload where this is significantly better? > Hm I don't get this: what complication and what conversion? as you can see the PR itself is pretty simple and it doesn't involve changes on the "rest of the engine". I guess most of spark doesn't use these column vectors, so maybe not too important. If there is ever more efforts to add vectorized operations in Spark it would be simpler if you can apply vectorized operations to struct elements in same way as top-level elements. This maybe isn't a strong reason right now but the current Spark layout seems more future proof. > For Arrow there is a separate code path already ArrowColumnVector. All the data eco-systems I know (e.g., RAPIDS, Intel OAP, Iceberg) are already using that. So, I'm not really convinced this is an issue. I was thinking that people in future will want conversions between different column vector types with minimal data copying where possible, e.g. if you want to scan from parquet and call a UDF using Arrow. Producing one style of ColumnVector then converting to another is inefficient. You could avoid the copy in principle when the memory layouts of vectors are identical (pretty common for numeric types). This really isn't super-important - maybe nobody will ever want to do this this - but just another reason to not change this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
