[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...

HyukjinKwon Mon, 15 Aug 2016 22:55:12 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/14660


    [SPARK-17071][SQL] Fetch Parquet schema without another Spark job when it 
is a single file to touch 

    ## What changes were proposed in this pull request?
    
    It seems Spark executes another job to figure out schema always 
([ParquetFileFormat#L739-L778](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L739-L778)).
    
    However, it seems it's a bit of overhead to touch only a single file. I ran 
a bench mark with the code below:
    
    ```scala
    test("Benchmark for JSON writer") {
      withTempPath { path =>
        Seq((1, 2D, 3L, "4")).toDF("a", "b", "c", "d")
          .write.format("parquet").save(path.getAbsolutePath)
    
        val benchmark = new Benchmark("Parquet - read schema", 1)
        benchmark.addCase("Parquet - read schema", 10) { _ =>
          spark.read.format("parquet").load(path.getCanonicalPath).schema
        }
        benchmark.run()
      }
    }
    ```
    
    with the results as below:
    
    - **Before**
    
      ```scala
      Parquet - read schema:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
      
------------------------------------------------------------------------------------------------
      Parquet - read schema                           47 /   49          0.0    
46728419.0       1.0X
      ```
    
    - **After**
    
      ```scala
      Parquet - read schema:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
      
------------------------------------------------------------------------------------------------
      Parquet - read schema                            2 /    3          0.0    
 1811673.0       1.0X
      ```
    
    It seems it became 20X faster (although It is a small bit in total job 
run-time).
    
    As a reference, it seems ORC is doing this within driver-side 
[OrcFileOperator.scala#L74-L83](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L83).
    
    ## How was this patch tested?
    
    Existing tests should cover this
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-17071

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14660.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14660
    
----
commit 614abbc6b7a03ff0d3e505697c0bbfec3b330c2b
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-08-16T05:42:29Z

    Fetch Parquet schema within driver-side when there is single file to touch 
without another Spark job

commit e1214d50035441fb96551683cf38ae3e49f07b7d
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-08-16T05:46:12Z

    Fix modifier

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...

Reply via email to