Why some queries use logical.stats while others analyzed.stats?

Jacek Laskowski Thu, 04 Jan 2018 00:37:05 -0800

Hi,

I use Spark from the master today.


$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
Branch master
Compiled by user jacek on 2018-01-04T05:44:05Z
Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8

Can anyone explain why some queries have stats in logical plan while others
don't (and I had to use analyzed logical plan)?

I can explain the difference using the code, but I don't know why there is
the difference.

spark.range(1000).write.parquet("/tmp/p1000")
// The stats are available in logical plan (in logical "phase")
scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats
res21: org.apache.spark.sql.catalyst.plans.logical.Statistics =
Statistics(sizeInBytes=6.9 KB, hints=none)

// logical plan fails, but it worked fine above --> WHY?!
val names = Seq((1, "one"), (2, "two")).toDF("id", "name")
scala> names.queryExecution.logical.stats
java.lang.UnsupportedOperationException
  at
org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:232)
  at
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
  at
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)

// analyzed logical plan works fine
scala> names.queryExecution.analyzed.stats
res23: org.apache.spark.sql.catalyst.plans.logical.Statistics =
Statistics(sizeInBytes=48.0 B, hints=none)

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Why some queries use logical.stats while others analyzed.stats?

Reply via email to