Github user davies commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190845510
@nongli There is no visible difference on all existing benchmarks
(ColumnarBatch and ParquetRead), they don't use dictionary encoding.
After changed the intStringScan to use dictionary encoding (small number
unique values), here is the result:
Before this patch
```
Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Reader 1248 / 1281 8.4
119.0 1.0X
SQL Parquet MR 1962 / 2093 5.3
187.1 0.6X
SQL Parquet Vectorized 876 / 1018 12.0
83.5 1.4X
ParquetReader 741 / 755 14.1
70.7 1.7X
```
After the patch
```
Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Reader 1247 / 1279 8.4
118.9 1.0X
SQL Parquet MR 1809 / 1851 5.8
172.5 0.7X
SQL Parquet Vectorized 805 / 909 13.0
76.8 1.5X
ParquetReader 742 / 756 14.1
70.7 1.7X
```
We can see 10% improvement on SQL Parquet Vectorized, but no difference on
ParquetReader, I don't know why. (I didn't included #11274 )
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]