GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/21892
[SPARK-24945][SQL] Switching to uniVocity 2.7.2
## What changes were proposed in this pull request?
In the PR, I propose to upgrade uniVocity parser from **2.6.3** to
**2.7.2**. The recent version includes a fix for the SPARK-24645 issue. Here is
the bug report for uniVocity
https://github.com/uniVocity/univocity-parsers/issues/250.
I removed the changes in `UnivocityParser` introduced by the commit:
https://github.com/apache/spark/commit/bd32b509a1728366494cba13f8f6612b7bd46ec0
but leaved the test from the commit.
## How was this patch tested?
I tested by `CSVSuite` and by running `CSVBenchmarsk`. The difference
between 2.6.3 and 2.7.2 is 0.2% - 8% except a benchmark for `count()`.
Performance degradation in the last case is **x3.8**.
Before changes:
```
Parsing quoted values: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
One quoted string 33336 / 34122 0.0
666727.0 1.0X
Wide rows with 1000 columns: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Select 1000 columns 90287 / 91713 0.0
90286.9 1.0X
Select 100 columns 31826 / 36589 0.0
31826.4 2.8X
Select one column 25738 / 25872 0.0
25737.9 3.5X
count() 6931 / 7269 0.1
6931.5 13.0X
```
after:
```
Parsing quoted values: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
One quoted string 34191 / 34332 0.0
683826.7 1.0X
Wide rows with 1000 columns: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Select 1000 columns 90446 / 91900 0.0
90446.1 1.0X
Select 100 columns 34315 / 39895 0.0
34314.9 2.6X
Select one column 27955 / 28125 0.0
27954.8 3.2X
count() 27713 / 27803 0.0
27712.8 3.3X
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 univocity-2_7_2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21892.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21892
----
commit 7b569ae1318316129d4b0d46969b02324b18b0aa
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-27T11:59:39Z
Bumping version of uniVocity parser up to 2.7.2
commit b116987d9a0adb887201177d41c1b94e6f5aeb63
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-27T13:25:11Z
Call uniVocity even the set of selected columns is empty
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]