cloud-fan opened a new pull request #28406:
URL: https://github.com/apache/spark/pull/28406
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section
is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class
hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other
DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
Push the rebase logic to the lower level of the parquet vectorized reader,
to make the final code more vectorization-friendly.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
Parquet vectorized reader is carefully implemented, to make it more likely
to be vectorized by the JVM. However, the newly added datetime rebase degrade
the performance a lot, as it breaks vectorization, even if the datetime values
don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes
- provide the console output, description and/or an example to show the
behavior difference if possible.
If no, write 'No'.
-->
no
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some
test cases that check the changes thoroughly including negative and positive
cases if possible.
If it was tested in a way different from regular unit tests, please clarify
how you tested step by step, ideally copy and paste-able, so that other
reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why
it was difficult to add.
-->
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677
2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828
4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903
2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163
4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537
3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891
7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692
3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588
7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758
2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908
2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840
2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407
3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861
4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194
4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849
3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512
7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if
need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no
difference if need to rebase.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]