bersprockets opened a new pull request, #39117: URL: https://github.com/apache/spark/pull/39117
### What changes were proposed in this pull request? In `InterpretedUnsafeProjection`, use `UnsafeWriter.write`, rather than `UnsafeWriter.setNullAt`, to set null for interval fields. Also, in `InterpretedMutableProjection`, use `InternalRow.setInterval`, rather than `InternalRow.setNullAt`, to set null for interval fields. ### Why are the changes needed? This returns the wrong answer: ``` set spark.sql.codegen.wholeStage=false; set spark.sql.codegen.factoryMode=NO_CODEGEN; select first(col1), last(col2) from values (make_interval(0, 0, 0, 7, 0, 0, 0), make_interval(17, 0, 0, 2, 0, 0, 0)) as data(col1, col2); +---------------+---------------+ |first(col1) |last(col2) | +---------------+---------------+ |16 years 2 days|16 years 2 days| +---------------+---------------+ ``` In the above case, `TungstenAggregationIterator` uses `InterpretedUnsafeProjection` to create the aggregation buffer and to initialize all the fields to null. `InterpretedUnsafeProjection` incorrectly calls `UnsafeRowWriter#setNullAt`, rather than `unsafeRowWriter#write`, for the two calendar interval fields. As a result, the writer never allocates memory from the variable length region for the two intervals, and the pointers in the fixed region get left as zero. Later, when `InterpretedMutableProjection` attempts to update the first field, `UnsafeRow#setInterval` picks up the zero pointer and stores interval data on top of the null-tracking bit set. The call to UnsafeRow#setInterval for the second field also stomps the null-tracking bit set. Later updates to the null-tracking bit set (e.g., calls to `setNotNullAt`) further corrupt the interval data, turning `interval 7 years 2 days` into `interval 16 years 2 days`. Even after one fixes the above bug in `InterpretedUnsafeProjection` so that the buffer is created correctly, `InterpretedMutableProjection` has a similar bug to SPARK-41395, except this time for calendar interval data: ``` set spark.sql.codegen.wholeStage=false; set spark.sql.codegen.factoryMode=NO_CODEGEN; select first(col1), last(col2), max(col3) from values (null, null, 1), (make_interval(0, 0, 0, 7, 0, 0, 0), make_interval(17, 0, 0, 2, 0, 0, 0), 3) as data(col1, col2, col3); +---------------+---------------+---------+ |first(col1) |last(col2) |max(col3)| +---------------+---------------+---------+ |16 years 2 days|16 years 2 days|3 | +---------------+---------------+---------+ ``` These two bugs could get exercised during codegen fallback. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
