bersprockets opened a new pull request, #37513:
URL: https://github.com/apache/spark/pull/37513

   ### What changes were proposed in this pull request?
   
   Add code to defensively check if the pre-allocated result array is big 
enough to handle the next element in a date or timestamp sequence.
   
   ### Why are the changes needed?
   
   `InternalSequenceBase.getSequenceLength` is a fast method for estimating the 
size of the result array. It uses an estimated step size in micros which is not 
always entirely accurate for the date/time/time-zone combination. As a result, 
`getSequenceLength` occasionally overestimates the size of the result array and 
also occasionally underestimates the size of the result array.
   
   `getSequenceLength` sometimes overestimates the size of the result array 
when the step size is in months (because `InternalSequenceBase` assumes 28 days 
per month). This case is handled: `InternalSequenceBase` will slice the array, 
if needed.
   
   `getSequenceLength` sometimes understimates the size of the result array 
when the sequence crosses a DST "spring forward" without a corresponding "fall 
backward". This case is not handled (thus, this PR).
   
   For example:
   ```
   select sequence(
     timestamp'2022-03-13 00:00:00',
     timestamp'2022-03-14 00:00:00',
     interval 1 day) as x;
   ```
   In the America/Los_Angeles time zone, this results in the following error:
   ```
   java.lang.ArrayIndexOutOfBoundsException: 1
        at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77)
   ```
   This happens because `InternalSequenceBase` calculates an estimated step 
size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the 
America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 
has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later 
than the specified stop value, `getSequenceLength` assumes the stop value is 
not included in the result. Therefore, `getSequenceLength` estimates an array 
size of 1.
   
   However, when actually creating the sequence, `InternalSequenceBase` does 
not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 
00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we 
overrun the end of the result array.
   
   The new unit test includes examples of problematic date sequences.
   
   This PR adds code to to handle the underestimation case: it checks if we're 
about to overrun the array, and if so, gets a new array that's larger by 1.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to