bersprockets opened a new pull request, #37513:
URL: https://github.com/apache/spark/pull/37513
### What changes were proposed in this pull request?
Add code to defensively check if the pre-allocated result array is big
enough to handle the next element in a date or timestamp sequence.
### Why are the changes needed?
`InternalSequenceBase.getSequenceLength` is a fast method for estimating the
size of the result array. It uses an estimated step size in micros which is not
always entirely accurate for the date/time/time-zone combination. As a result,
`getSequenceLength` occasionally overestimates the size of the result array and
also occasionally underestimates the size of the result array.
`getSequenceLength` sometimes overestimates the size of the result array
when the step size is in months (because `InternalSequenceBase` assumes 28 days
per month). This case is handled: `InternalSequenceBase` will slice the array,
if needed.
`getSequenceLength` sometimes understimates the size of the result array
when the sequence crosses a DST "spring forward" without a corresponding "fall
backward". This case is not handled (thus, this PR).
For example:
```
select sequence(
timestamp'2022-03-13 00:00:00',
timestamp'2022-03-14 00:00:00',
interval 1 day) as x;
```
In the America/Los_Angeles time zone, this results in the following error:
```
java.lang.ArrayIndexOutOfBoundsException: 1
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77)
```
This happens because `InternalSequenceBase` calculates an estimated step
size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the
America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13
has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later
than the specified stop value, `getSequenceLength` assumes the stop value is
not included in the result. Therefore, `getSequenceLength` estimates an array
size of 1.
However, when actually creating the sequence, `InternalSequenceBase` does
not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13
00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we
overrun the end of the result array.
The new unit test includes examples of problematic date sequences.
This PR adds code to to handle the underestimation case: it checks if we're
about to overrun the array, and if so, gets a new array that's larger by 1.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]