[GitHub] spark pull request #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2

MaxGekk Fri, 27 Jul 2018 06:47:23 -0700

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/21892


    [SPARK-24945][SQL] Switching to uniVocity 2.7.2

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to upgrade uniVocity parser from **2.6.3** to 
**2.7.2**. The recent version includes a fix for the SPARK-24645 issue. Here is 
the bug report for uniVocity 
https://github.com/uniVocity/univocity-parsers/issues/250.
    
    I removed the changes in `UnivocityParser` introduced by the commit: 
https://github.com/apache/spark/commit/bd32b509a1728366494cba13f8f6612b7bd46ec0 
but leaved the test from the commit.
    
    ## How was this patch tested?
    
    I tested by `CSVSuite` and by running `CSVBenchmarsk`. The difference 
between 2.6.3 and 2.7.2 is 0.2% - 8% except a benchmark for `count()`. 
Performance degradation in the last case is **x3.8**.
    
    Before changes:
    ```
    Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    One quoted string                           33336 / 34122          0.0      
666727.0       1.0X
    
    Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    Select 1000 columns                         90287 / 91713          0.0      
 90286.9       1.0X
    Select 100 columns                          31826 / 36589          0.0      
 31826.4       2.8X
    Select one column                           25738 / 25872          0.0      
 25737.9       3.5X
    count()                                       6931 / 7269          0.1      
  6931.5      13.0X
    ```
    after:
    ```
    Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    One quoted string                           34191 / 34332          0.0      
683826.7       1.0X
    
    Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    Select 1000 columns                         90446 / 91900          0.0      
 90446.1       1.0X
    Select 100 columns                          34315 / 39895          0.0      
 34314.9       2.6X
    Select one column                           27955 / 28125          0.0      
 27954.8       3.2X
    count()                                     27713 / 27803          0.0      
 27712.8       3.3X
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 univocity-2_7_2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21892.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21892
    
----
commit 7b569ae1318316129d4b0d46969b02324b18b0aa
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-27T11:59:39Z

    Bumping version of uniVocity parser up to 2.7.2

commit b116987d9a0adb887201177d41c1b94e6f5aeb63
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-27T13:25:11Z

    Call uniVocity even the set of selected columns is empty

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2

Reply via email to