GitHub user maropu opened a pull request:
https://github.com/apache/spark/pull/22204
[SPARK-25196][SQL] Analyze column statistics in cached query
## What changes were proposed in this pull request?
This pr proposed a new API to analyze column statistics in cached query. In
common usecases, users read catalog table data, join/aggregate them, and then
cache the result for following quries. But, the current optimization of the
queries depends on non-existing or inaccurate column statistics of the cached
data because we are only allowed to analyze column statistics in catalog tables
via ANALYZE commands.
To solve this issue, this pr added `analyzeColumnCacheQuery` in
`CacheManager to analyze column statistics in already-cached query;
```
scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS
c2").write.saveAsTable("t")
scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2")
scala> val cacheManager = spark.sharedState.cacheManager
scala> def printColumnStats(data: org.apache.spark.sql.DataFrame) = {
| data.queryExecution.optimizedPlan.stats.attributeStats.foreach {
| case (k, v) => println(s"[$k]: $v")
| }
| }
scala> def df() = spark.table("t").groupBy("c0").agg(count("c1").as("v1"),
sum("c2").as("v2"))
// Prints column statistics in catalog table `t`
scala> printColumnStats(spark.table("t"))
[c0#7073L]:
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
[c1#7074]:
ColumnStat(Some(997),Some(5.958619423369615E-4),Some(0.9988009488973438),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@4ef69c53)))
[c2#7075]:
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@7cbaf548)))
// Prints column statistics on query result `df`
scala> printColumnStats(df())
[c0#7073L]:
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
// Prints column statistics on cached data of `df`
scala> printColumnStats(df().cache)
<No Column Statistics>
// A new API described above
scala> cacheManager.analyzeColumnCacheQuery(df(), "v1" :: "v2" :: Nil)
// Then, prints again
scala> printColumnStats(df())
[v1#7101L]:
ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
[v2#7103L]:
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
scala> cacheManager.analyzeColumnCacheQuery(df(), "c0" :: Nil)
scala> printColumnStats(df())
[v1#7101L]:
ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
[v2#7103L]:
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
[c0#7073L]:
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@626bcfc8)))
```
## How was this patch tested?
Added tests in `CachedTableSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/maropu/spark SPARK-25196
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22204.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22204
----
commit fcc530000c71c3d623a559e499e1148efc54d0e6
Author: Takeshi Yamamuro <yamamuro@...>
Date: 2018-08-22T11:59:29Z
Fix
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]