[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...

davies Tue, 01 Mar 2016 10:38:12 -0800

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11437#issuecomment-190845510
  
    @nongli There is no visible difference on all existing benchmarks 
(ColumnarBatch and ParquetRead), they don't use dictionary encoding.
    
    After changed the intStringScan to use dictionary encoding (small number 
unique values), here is the result:
    
    Before this patch 
    
    ```
    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
    
-------------------------------------------------------------------------------------------
    SQL Parquet Reader                       1248 / 1281          8.4         
119.0       1.0X
    SQL Parquet MR                           1962 / 2093          5.3         
187.1       0.6X
    SQL Parquet Vectorized                    876 / 1018         12.0          
83.5       1.4X
    ParquetReader                             741 /  755         14.1          
70.7       1.7X
    ```
    
    After the patch 
    ```
    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
    
-------------------------------------------------------------------------------------------
    SQL Parquet Reader                       1247 / 1279          8.4         
118.9       1.0X
    SQL Parquet MR                           1809 / 1851          5.8         
172.5       0.7X
    SQL Parquet Vectorized                    805 /  909         13.0          
76.8       1.5X
    ParquetReader                             742 /  756         14.1          
70.7       1.7X
    ```
    
    We can see 10% improvement on SQL Parquet Vectorized, but no difference on 
ParquetReader, I don't know why. (I didn't included #11274 )



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...

Reply via email to