cloud-fan opened a new pull request #28406:
URL: https://github.com/apache/spark/pull/28406


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: 
https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: 
https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., 
'[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a 
faster review.
     7. If you want to add a new configuration, please read the guideline first 
for naming configurations in
        
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section 
is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster 
reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class 
hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other 
DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Push the rebase logic to the lower level of the parquet vectorized reader, 
to make the final code more vectorization-friendly.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Parquet vectorized reader is carefully implemented, to make it more likely 
to be vectorized by the JVM. However, the newly added datetime rebase degrade 
the performance a lot, as it breaks vectorization, even if the datetime values 
don't need to rebase (this is very likely as dates before 1582 is rare).
   
   ### Does this PR introduce any user-facing change?
   <!--
   If yes, please clarify the previous behavior and the change this PR proposes 
- provide the console output, description and/or an example to show the 
behavior difference if possible.
   If no, write 'No'.
   -->
   no
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   -->
   Run part of the `DateTimeRebaseBenchmark` locally. The results:
   before this patch
   ```
   [info] Load dates from parquet:                  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] after 1582, vec on, rebase off                     2677           
2838         142         37.4          26.8       1.0X
   [info] after 1582, vec on, rebase on                      3828           
4331         805         26.1          38.3       0.7X
   [info] before 1582, vec on, rebase off                    2903           
2926          34         34.4          29.0       0.9X
   [info] before 1582, vec on, rebase on                     4163           
4197          38         24.0          41.6       0.6X
   
   [info] Load timestamps from parquet:             Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] after 1900, vec on, rebase off                     3537           
3627         104         28.3          35.4       1.0X
   [info] after 1900, vec on, rebase on                      6891           
7010         105         14.5          68.9       0.5X
   [info] before 1900, vec on, rebase off                    3692           
3770          72         27.1          36.9       1.0X
   [info] before 1900, vec on, rebase on                     7588           
7610          30         13.2          75.9       0.5X
   ```
   
   After this patch
   ```
   [info] Load dates from parquet:                  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] after 1582, vec on, rebase off                     2758           
2944         197         36.3          27.6       1.0X
   [info] after 1582, vec on, rebase on                      2908           
2966          51         34.4          29.1       0.9X
   [info] before 1582, vec on, rebase off                    2840           
2878          37         35.2          28.4       1.0X
   [info] before 1582, vec on, rebase on                     3407           
3433          24         29.4          34.1       0.8X
   
   [info] Load timestamps from parquet:             Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] after 1900, vec on, rebase off                     3861           
4003         139         25.9          38.6       1.0X
   [info] after 1900, vec on, rebase on                      4194           
4283          77         23.8          41.9       0.9X
   [info] before 1900, vec on, rebase off                    3849           
3937          79         26.0          38.5       1.0X
   [info] before 1900, vec on, rebase on                     7512           
7546          55         13.3          75.1       0.5X
   ```
   
   Date type is 30% faster if the values don't need to rebase, 20% faster if 
need to rebase.
   Timestamp type is 60% faster if the values don't need to rebase, no 
difference if need to rebase.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to