[GitHub] [spark] maropu opened a new pull request #24047: [SPARK-25196][SQL] Extends Analyze commands for cached tables

GitBox Sun, 10 Mar 2019 18:28:43 -0700

maropu opened a new pull request #24047: [SPARK-25196][SQL] Extends Analyze 
commands for cached tables 
URL: https://github.com/apache/spark/pull/24047
 
 
   ## What changes were proposed in this pull request?
   This pr added a new API to analyze cached data in `CacheManager`. In common 
use cases, users read catalog table data, join/aggregate them, and then cache 
the result for following reuse. Since we are only allowed to analyze column 
statistics in catalog tables via ANALYZE commands, the current optimization 
depends on non-existing or inaccurate column statistics of cached data. So, it 
would be great if we could analyze cached data as follows;
   
   ```
   scala> sql("SET spark.sql.cbo.enabled=true")
   scala> sql("SET spark.sql.statistics.histogram.enabled=true")
   scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS 
c2").write.saveAsTable("t")
   scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2")
   scala> val cacheManager = spark.sharedState.cacheManager
   scala> def printColumnStats(data: org.apache.spark.sql.DataFrame) = {
        |   data.queryExecution.optimizedPlan.stats.attributeStats.foreach {
        |     case (k, v) => println(s"[$k]: $v")
        |   }
        | }
   scala> def df() = spark.table("t").groupBy("c0").agg(count("c1").as("v1"), 
sum("c2").as("v2"))
   
   // Prints column statistics in catalog table `t`
   scala> printColumnStats(spark.table("t"))
   [c0#7073L]: 
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
   [c1#7074]: 
ColumnStat(Some(997),Some(5.958619423369615E-4),Some(0.9988009488973438),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@4ef69c53)))
   [c2#7075]: 
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@7cbaf548)))
   
   // Prints column statistics on query result `df`
   scala> printColumnStats(df())
   [c0#7073L]: 
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
   
   // Prints column statistics on cached data of `df`
   scala> printColumnStats(df().cache)
   <No Column Statistics>
   
   // A new API described above
   scala> cacheManager.analyzeColumnCacheQuery(df(), "v1" :: "v2" :: Nil)
                                                                                
   
   // Then, prints again
   scala> printColumnStats(df())
   [v1#7101L]: 
ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
   [v2#7103L]: 
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
   
   scala> cacheManager.analyzeColumnCacheQuery(df(), "c0" :: Nil)
   scala> printColumnStats(df())
   [v1#7101L]: 
ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
   [v2#7103L]: 
ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
   [c0#7073L]: 
ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@626bcfc8)))
   ```
   This pr is WIP; we need to finish #22204 first, and then we visit this.
   
   ## How was this patch tested?
   Added tests in `CachedTableSuite` and `StatisticsCollectionSuite`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] maropu opened a new pull request #24047: [SPARK-25196][SQL] Extends Analyze commands for cached tables

Reply via email to