AngersZhuuuu opened a new pull request #31485:
URL: https://github.com/apache/spark/pull/31485


   
   ### What changes were proposed in this pull request?
   When explain SQL with cost, treeString about subquery won't show it's 
statistics:
   
   How to reproduce:
   ```
   spark.sql("create table t1 using parquet as select id as a, id as b from 
range(1000)")
   spark.sql("create table t2 using parquet as select id as c, id as d from 
range(2000)")
   
   spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
   spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
   spark.sql("set spark.sql.cbo.enabled=true")
   
   spark.sql(
     """
       |WITH max_store_sales AS
       |  (SELECT max(csales) tpcds_cmax
       |  FROM (SELECT
       |    sum(b) csales
       |  FROM t1 WHERE a < 100 ) x),
       |best_ss_customer AS
       |  (SELECT
       |    c
       |  FROM t2
       |  WHERE d > (SELECT * FROM max_store_sales))
       |
       |SELECT c FROM best_ss_customer
       |""".stripMargin).explain("cost")
   ```
   Before this PR's output:
   ```
   == Optimized Logical Plan ==
   Project [c#4263L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3)
   +- Filter (isnotnull(d#4264L) AND (d#4264L > scalar-subquery#4262 [])), 
Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
      :  +- Aggregate [max(csales#4260L) AS tpcds_cmax#4261L]
      :     +- Aggregate [sum(b#4266L) AS csales#4260L]
      :        +- Project [b#4266L]
      :           +- Filter ((a#4265L < 100) AND isnotnull(a#4265L))
      :              +- Relation default.t1[a#4265L,b#4266L] parquet, 
Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
      +- Relation default.t2[c#4263L,d#4264L] parquet, 
Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
   Another case is TPC-DS q23a.
   ```
   
   After this pr:
   ```
   == Optimized Logical Plan ==
   Project [c#4481L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3)
   +- Filter (isnotnull(d#4482L) AND (d#4482L > scalar-subquery#4480 [])), 
Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
      :  +- Aggregate [max(csales#4478L) AS tpcds_cmax#4479L], 
Statistics(sizeInBytes=16.0 B, rowCount=1)
      :     +- Aggregate [sum(b#4484L) AS csales#4478L], 
Statistics(sizeInBytes=16.0 B, rowCount=1)
      :        +- Project [b#4484L], Statistics(sizeInBytes=1616.0 B, 
rowCount=101)
      :           +- Filter (isnotnull(a#4483L) AND (a#4483L < 100)), 
Statistics(sizeInBytes=2.4 KiB, rowCount=101)
      :              +- Relation[a#4483L,b#4484L] parquet, 
Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
      +- Relation[c#4481L,d#4482L] parquet, Statistics(sizeInBytes=46.9 KiB, 
rowCount=2.00E+3)
   
   ```
   
   ### Why are the changes needed?
   Complete explain treeString's statistics
   
   ### Does this PR introduce _any_ user-facing change?
   When user use explain with cost mode, user can see subquery's statistic too.
   
   
   ### How was this patch tested?
   Working


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to