[GitHub] spark pull request #22204: [SPARK-25196][SQL] Analyze column statistics in c...

maropu Thu, 23 Aug 2018 07:35:23 -0700

GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/22204


    [SPARK-25196][SQL] Analyze column statistics in cached query

    ## What changes were proposed in this pull request?
    This pr proposed a new API to analyze column statistics in cached query. In 
common usecases, users read catalog table data, join/aggregate them, and then 
cache the result for following quries. But, the current optimization of the 
queries depends on non-existing or inaccurate column statistics of the cached 
data because we are only allowed to analyze column statistics in catalog tables 
via ANALYZE commands.
    
    To solve this issue, this pr added `analyzeColumnCacheQuery` in 
`CacheManager to analyze column statistics in already-cached query;
    ```
    scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS 
c2").write.saveAsTable("t")
    scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2")
    scala> val cacheManager = spark.sharedState.cacheManager
    scala> def printColumnStats(data: org.apache.spark.sql.DataFrame) = {
         |   data.queryExecution.optimizedPlan.stats.attributeStats.foreach {
         |     case (k, v) => println(s"[$k]: $v")
         |   }
         | }
    scala> def df() = spark.table("t").groupBy("c0").agg(count("c1").as("v1"), 
sum("c2").as("v2"))
    
    // Prints column statistics in catalog table `t`
    scala> printColumnStats(spark.table("t"))
    [c0#7073L]: 
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
    [c1#7074]: 
ColumnStat(Some(997),Some(5.958619423369615E-4),Some(0.9988009488973438),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@4ef69c53)))
    [c2#7075]: 
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@7cbaf548)))
    
    // Prints column statistics on query result `df`
    scala> printColumnStats(df())
    [c0#7073L]: 
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
    
    // Prints column statistics on cached data of `df`
    scala> printColumnStats(df().cache)
    <No Column Statistics>
    
    // A new API described above
    scala> cacheManager.analyzeColumnCacheQuery(df(), "v1" :: "v2" :: Nil)
                                                                                
    
    // Then, prints again
    scala> printColumnStats(df())
    [v1#7101L]: 
ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
    [v2#7103L]: 
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
    
    scala> cacheManager.analyzeColumnCacheQuery(df(), "c0" :: Nil)
    scala> printColumnStats(df())
    [v1#7101L]: 
ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
    [v2#7103L]: 
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
    [c0#7073L]: 
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@626bcfc8)))
    ```
    
    ## How was this patch tested?
    Added tests in `CachedTableSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark SPARK-25196

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22204.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22204
    
----
commit fcc530000c71c3d623a559e499e1148efc54d0e6
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-08-22T11:59:29Z

    Fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22204: [SPARK-25196][SQL] Analyze column statistics in c...

Reply via email to