MaxGekk opened a new pull request #28163: [SPARK-31359][SQL][3.0] Speed up 
timestamps rebasing
URL: https://github.com/apache/spark/pull/28163
 
 
   ### What changes were proposed in this pull request?
   In the PR, I propose to optimise the 
`DateTimeUtils`.`rebaseJulianToGregorianMicros()` and 
`rebaseGregorianToJulianMicros()` functions, and make them faster by using 
pre-calculated rebasing tables. This approach allows to avoid expensive 
conversions via local timestamps. For example, the `America/Los_Angeles` time 
zone has just a few time points when difference between Proleptic Gregorian 
calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is 
changed in the time interval 0001-01-01 .. 2100-01-01:
   
   | i | local  timestamp | Proleptic Greg. seconds | Hybrid (Julian+Greg) 
seconds | difference in minutes| 
   | -- | ------- |----|----| ---- |
   |0|0001-01-01 00:00|-62135568422|-62135740800|-2872|
   |1|0100-03-01 00:00|-59006333222|-59006419200|-1432|
   |...|...|...|...|...|
   |13|1582-10-15 00:00|-12219264422|-12219264000|7|
   |14|1883-11-18 12:00|-2717640000|-2717640000|0|
   
   The difference in microseconds between Proleptic and hybrid calendars for 
any local timestamp in time intervals `[local timestamp(i), local 
timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian 
micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an 
input micros by following the steps:
   1. Look at the table, and find the time interval where the micros falls to
   2. Take the difference between 2 calendars for this time interval
   3. Add the difference to the input micros. The result is rebased 
microseconds that has the same local timestamp representation.
   
   Here are details of the implementation:
   - Pre-calculated tables are stored to JSON files 
`gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` 
in the resource folder of `sql/catalyst`. The diffs and switch time points are 
stored as seconds, for example:
   ```json
   [
     {
       "tz" : "America/Los_Angeles",
       "switches" : [ -62135740800, -59006419200, ... , -2717640000 ],
       "diffs" : [ 172378, 85978, ..., 0 ]
     }
   ]
   ```
     The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - 
`generate 'gregorian-julian-rebase-micros.json'` and `generate 
'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default. 
     The `switches` time points are ordered from old to recent timestamps. This 
condition is checked by the test `validate rebase records in JSON files` in 
`RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the 
same (this is checked by the same test).
   
   - The **_Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_** time 
zones weren't added to the JSON files, see 
[SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385)
   - The rebase info from the JSON files is placed to hash tables - 
`gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it 
is almost 2 times faster than Scala's immutable Map. Also I tried 
`java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case.
   The hash maps store the switch time points and diffs in microseconds 
precision to avoid conversions from microseconds to seconds in the runtime.
   
   - I moved the code related to days and microseconds rebasing to the separate 
object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to 
date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason.
   
   - I placed rebasing via local timestamp to separate methods that require 
zone id as the first parameter assuming that the caller has zone id already. 
This allows to void unnecessary retrieving the default time zone. The methods 
are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as 
reference implementation.
   
   - Modified the `rebaseGregorianToJulianMicros()` and 
`rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the 
rebase tables first of all. If hash maps don't contain rebasing info for the 
given time zone id, the methods falls back to the implementation via local 
timestamps. This allows to support time zones specified as zone offsets like 
'-08:00'.
   
   ### Why are the changes needed?
   To make timestamps rebasing faster:
   - Saving timestamps to parquet files is ~ **x3.8 faster**
   - Loading timestamps from parquet files is ~**x2.8 faster**.
   - Loading timestamps by Vectorized reader ~**x4.6 faster**.
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   - Added the test `validate rebase records in JSON files` to 
`RebaseDateTimeSuite`. The test validates 2 json files from the resource folder 
- `gregorian-julian-rebase-micros.json` and 
`julian-gregorian-rebase-micros.json`, and it checks per each time zone records 
that
     - the number of switch points is equal to the number of diffs between 
calendars. If the numbers are different, this will violate the assumption made 
in `RebaseDateTime.rebaseMicros`.
     - swith points are ordered from old to recent timestamps. This 
pre-condition is required for linear search in the `rebaseMicros` function.
   - Added the test `optimization of micros rebasing - Gregorian to Julian` to 
`RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 
2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function 
`RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as 
non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, 
America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time 
zones.
   - Added the test `optimization of micros rebasing - Julian to Gregorian` to 
`RebaseDateTimeSuite` which does similar checks as the test above but for 
rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian 
calendar.
   - The tests for days rebasing are moved from `DateTimeUtilsSuite` to 
`RebaseDateTimeSuite` because the rebasing related code is moved from 
`DateTimeUtils` to the separate object `RebaseDateTime`.
   - Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it 
is set explicitly in the PR #28127):
   
   | Item | Description |
   | ---- | ----|
   | Region | us-west-2 (Oregon) |
   | Instance | r3.xlarge |
   | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 
(ami-06f2f779464715dc5) |
   | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 
11.0.6+10 |

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to