[GitHub] spark issue #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2

MaxGekk Wed, 01 Aug 2018 14:03:01 -0700

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21892
  
    @jbax It became really faster:
    ```
    Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    One quoted string                           33411 / 33510          0.0      
668211.4       1.0X
    
    Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    Select 1000 columns                         88028 / 89311          0.0      
 88028.1       1.0X
    Select 100 columns                          29010 / 32755          0.0      
 29010.1       3.0X
    Select one column                           22936 / 22953          0.0      
 22936.5       3.8X
    count()                                     22790 / 23143          0.0      
 22789.6       3.9X
    ```
    The `count()` benchmark is still slower because I reverted the optimization 
for empty schema. Before we didn't call `uniVocity`'s `parseLine` if the set of 
selected indexes is empty. In this PR, I call `parseLine` for empty set since 
the bug (https://github.com/uniVocity/univocity-parsers/issues/250) has been 
fixed. It seems it performs similar to the case when only one column is 
selected. So, the overhead per line is around 15.5 milliseconds on my CPU.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21892: [SPARK-24945][SQL] Switching to uniVocity 2.7.2

Reply via email to