[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...

dongjoon-hyun Thu, 08 Nov 2018 10:00:24 -0800

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22965#discussion_r232001105
  
    --- Diff: sql/core/benchmarks/DataSourceReadBenchmark-results.txt ---
    @@ -2,268 +2,268 @@
     SQL Single Numeric Column Scan
     
================================================================================================
     
    -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
    +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
     Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
     SQL Single TINYINT Column Scan:          Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
     
------------------------------------------------------------------------------------------------
    -SQL CSV                                     21508 / 22112          0.7     
   1367.5       1.0X
    -SQL Json                                      8705 / 8825          1.8     
    553.4       2.5X
    -SQL Parquet Vectorized                         157 /  186        100.0     
     10.0     136.7X
    -SQL Parquet MR                                1789 / 1794          8.8     
    113.8      12.0X
    -SQL ORC Vectorized                             156 /  166        100.9     
      9.9     138.0X
    -SQL ORC Vectorized with copy                   218 /  225         72.1     
     13.9      98.6X
    -SQL ORC MR                                    1448 / 1492         10.9     
     92.0      14.9X
    -
    -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
    +SQL CSV                                     26366 / 26562          0.6     
   1676.3       1.0X
    --- End diff --
    
    Hi, @HyukjinKwon , @MaxGekk , @cloud-fan , @peter-toth 
    
    This is not related to this PR. CSV shows a consistent performance 
regression (about 10%) thoughout all benchmark cases. The other data sources 
show reasonable numbers for all types. 
    
    The baseline is generated on Oct 11st. The followings are the suspects.
    
    1. ee03f760b3 [SPARK-25955][TEST] Porting JSON tests for CSV functions
    1. 94de5609be [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main 
method
    1. 3b4556745e [SPARK-25795][R][EXAMPLE] Fix CSV SparkR SQL Example
    1. 1e6c1d8bfb [SPARK-25493][SQL] Use auto-detection for CRLF in CSV 
datasource multiline mode
    1. c7eadb5e66 [SPARK-25660][SQL] Fix for the backward slash as CSV fields 
delimiter
    1. 39872af882 [SPARK-25684][SQL] Organize header related codes in CSV 
datasource
    1. 46fe40838a [SPARK-25669][SQL] Check CSV header only when it exists



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...

Reply via email to