[GitHub] spark pull request #14446: [SPARK-16841][SQL] Improves the row level metrics...

clockfly Mon, 01 Aug 2016 16:48:07 -0700

GitHub user clockfly opened a pull request:

    https://github.com/apache/spark/pull/14446


    [SPARK-16841][SQL] Improves the row level metrics performance when reading 
Parquet table

    ## What changes were proposed in this pull request?
    
    When reading from Parquet table, Spark updates row level metrics like 
recordsRead, bytesRead.
    The implementation is not very efficient. It may take 20% of read them to 
update these metrics.
    
    Test benchmark:
    ```
    // Generates parquet table with nested columns
    
spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4")
    
    def time[R](block: => R): Long = {
        val t0 = System.nanoTime()
        val result = block    // call-by-name
        val t1 = System.nanoTime()
        println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
        (t1 - t0)/1000000
    }
    
    val x = ((0 until 20).toList.map(x => 
time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20
    ```
    
    ## How was this patch tested?
    
    Exisiting unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/clockfly/spark improve_metrics_performance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14446.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14446
    
----
commit 1054b74f18193378942b7fde26df36e06bff765e
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-08-01T23:35:30Z

    improve row level metrics performance

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14446: [SPARK-16841][SQL] Improves the row level metrics...

Reply via email to