(orc) branch branch-1.7 updated: ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918

dongjoon Mon, 08 Jan 2024 19:23:07 -0800

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-1.7
in repository https://gitbox.apache.org/repos/asf/orc.git



The following commit(s) were added to refs/heads/branch-1.7 by this push:
     new 24a0ca340 ORC-1578: Fix `SparkBenchmark` on `sales` data according to 
SPARK-40918
24a0ca340 is described below

commit 24a0ca340cc22127b7fb0b2b1c14e4f32ae2b13e
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Mon Jan 8 19:20:28 2024 -0800

    ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918
    
    ### What changes were proposed in this pull request?
    
    This PR aims to fix `SparkBenchmark` according to the requirement of 
SPARK-40918.
    
    Note that this fixes the synthetic benchmark on `Sales` data. For the other 
real-life dataset (`github` and `taxi`), we will revisit.
    
    ### Why are the changes needed?
    
    1. Generate `Sales` data
    ```
    $ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc 
-d sales -s 1000000
    ```
    
    2. Run Spark Benchmark
    ```
    $ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data 
-d sales -f orc
    # Run complete. Total time: 00:10:45
    
    Benchmark                                  (compression)  (dataset)  
(format)  Mode  Cnt        Score       Error  Units
    SparkBenchmark.fullRead                               gz      sales       
orc  avgt    5   686792.235 ±  4398.971  us/op
    SparkBenchmark.fullRead:bytesPerRecord                gz      sales       
orc  avgt    5        0.192                  #
    SparkBenchmark.fullRead:ops                           gz      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.fullRead:perRecord                     gz      sales       
orc  avgt    5        0.687 ±     0.004  us/op
    SparkBenchmark.fullRead:records                       gz      sales       
orc  avgt    5  5000000.000                  #
    SparkBenchmark.fullRead                           snappy      sales       
orc  avgt    5   286166.380 ± 19864.429  us/op
    SparkBenchmark.fullRead:bytesPerRecord            snappy      sales       
orc  avgt    5        0.201                  #
    SparkBenchmark.fullRead:ops                       snappy      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.fullRead:perRecord                 snappy      sales       
orc  avgt    5        0.286 ±     0.020  us/op
    SparkBenchmark.fullRead:records                   snappy      sales       
orc  avgt    5  5000000.000                  #
    SparkBenchmark.fullRead                             zstd      sales       
orc  avgt    5   384394.233 ± 10057.315  us/op
    SparkBenchmark.fullRead:bytesPerRecord              zstd      sales       
orc  avgt    5        0.192                  #
    SparkBenchmark.fullRead:ops                         zstd      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.fullRead:perRecord                   zstd      sales       
orc  avgt    5        0.384 ±     0.010  us/op
    SparkBenchmark.fullRead:records                     zstd      sales       
orc  avgt    5  5000000.000                  #
    SparkBenchmark.partialRead                            gz      sales       
orc  avgt    5    41683.914 ±  4046.077  us/op
    SparkBenchmark.partialRead:bytesPerRecord             gz      sales       
orc  avgt    5        0.192                  #
    SparkBenchmark.partialRead:ops                        gz      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.partialRead:perRecord                  gz      sales       
orc  avgt    5        0.042 ±     0.004  us/op
    SparkBenchmark.partialRead:records                    gz      sales       
orc  avgt    5  5000000.000                  #
    SparkBenchmark.partialRead                        snappy      sales       
orc  avgt    5    23981.054 ± 17874.229  us/op
    SparkBenchmark.partialRead:bytesPerRecord         snappy      sales       
orc  avgt    5        0.201                  #
    SparkBenchmark.partialRead:ops                    snappy      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.partialRead:perRecord              snappy      sales       
orc  avgt    5        0.024 ±     0.018  us/op
    SparkBenchmark.partialRead:records                snappy      sales       
orc  avgt    5  5000000.000                  #
    SparkBenchmark.partialRead                          zstd      sales       
orc  avgt    5    41433.277 ± 25110.021  us/op
    SparkBenchmark.partialRead:bytesPerRecord           zstd      sales       
orc  avgt    5        0.192                  #
    SparkBenchmark.partialRead:ops                      zstd      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.partialRead:perRecord                zstd      sales       
orc  avgt    5        0.041 ±     0.025  us/op
    SparkBenchmark.partialRead:records                  zstd      sales       
orc  avgt    5  5000000.000                  #
    SparkBenchmark.pushDown                               gz      sales       
orc  avgt    5    23760.997 ±   833.034  us/op
    SparkBenchmark.pushDown:bytesPerRecord                gz      sales       
orc  avgt    5       19.153                  #
    SparkBenchmark.pushDown:ops                           gz      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.pushDown:perRecord                     gz      sales       
orc  avgt    5        2.376 ±     0.083  us/op
    SparkBenchmark.pushDown:records                       gz      sales       
orc  avgt    5    50000.000                  #
    SparkBenchmark.pushDown                           snappy      sales       
orc  avgt    5    14062.508 ±  1793.691  us/op
    SparkBenchmark.pushDown:bytesPerRecord            snappy      sales       
orc  avgt    5       20.105                  #
    SparkBenchmark.pushDown:ops                       snappy      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.pushDown:perRecord                 snappy      sales       
orc  avgt    5        1.406 ±     0.179  us/op
    SparkBenchmark.pushDown:records                   snappy      sales       
orc  avgt    5    50000.000                  #
    SparkBenchmark.pushDown                             zstd      sales       
orc  avgt    5    15597.651 ±  1307.246  us/op
    SparkBenchmark.pushDown:bytesPerRecord              zstd      sales       
orc  avgt    5       19.213                  #
    SparkBenchmark.pushDown:ops                         zstd      sales       
orc  avgt    5       40.000                  #
    SparkBenchmark.pushDown:perRecord                   zstd      sales       
orc  avgt    5        1.560 ±     0.131  us/op
    SparkBenchmark.pushDown:records                     zstd      sales       
orc  avgt    5    50000.000                  #
    ```
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    Closes #1734 from dongjoon-hyun/ORC-1578.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    (cherry picked from commit fbe49d71e0e66903508a52952be1cb0ca9d09ac1)
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 .../src/java/org/apache/orc/bench/spark/SparkBenchmark.java    | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git 
a/java/bench/spark/src/java/org/apache/orc/bench/spark/SparkBenchmark.java 
b/java/bench/spark/src/java/org/apache/orc/bench/spark/SparkBenchmark.java
index c8883152e..febfb3660 100644
--- a/java/bench/spark/src/java/org/apache/orc/bench/spark/SparkBenchmark.java
+++ b/java/bench/spark/src/java/org/apache/orc/bench/spark/SparkBenchmark.java
@@ -121,6 +121,7 @@ public class SparkBenchmark implements OrcBenchmark {
     public void setup() {
       session = SparkSession.builder().appName("benchmark")
           .config("spark.master", "local[4]")
+          .config("spark.log.level", "ERROR")
           .config("spark.sql.orc.filterPushdown", true)
           .config("spark.sql.orc.impl", "native")
           .getOrCreate();
@@ -188,6 +189,9 @@ public class SparkBenchmark implements OrcBenchmark {
       case "json":
         options.add(new Tuple2<>("timestampFormat", "yyyy-MM-dd 
HH:mm:ss.SSS"));
         break;
+      case "orc":
+        options.add(new Tuple2<>("returning_batch", "true")); // SPARK-40918
+        break;
       default:
         break;
     }
@@ -217,6 +221,9 @@ public class SparkBenchmark implements OrcBenchmark {
       case "json":
       case "avro":
         throw new IllegalArgumentException(source.format + " can't handle 
projection");
+      case "orc":
+        options.add(new Tuple2<>("returning_batch", "true")); // SPARK-40918
+        break;
       default:
         break;
     }
@@ -288,6 +295,9 @@ public class SparkBenchmark implements OrcBenchmark {
       case "json":
       case "avro":
         throw new IllegalArgumentException(source.format + " can't handle 
pushdown");
+      case "orc":
+        options.add(new Tuple2<>("returning_batch", "true")); // SPARK-40918
+        break;
       default:
         break;
     }

(orc) branch branch-1.7 updated: ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918

Reply via email to