Hi, I use Spark from the master today.
$ ./bin/spark-shell --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152 Branch master Compiled by user jacek on 2018-01-04T05:44:05Z Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8 Can anyone explain why some queries have stats in logical plan while others don't (and I had to use analyzed logical plan)? I can explain the difference using the code, but I don't know why there is the difference. spark.range(1000).write.parquet("/tmp/p1000") // The stats are available in logical plan (in logical "phase") scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats res21: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=6.9 KB, hints=none) // logical plan fails, but it worked fine above --> WHY?! val names = Seq((1, "one"), (2, "two")).toDF("id", "name") scala> names.queryExecution.logical.stats java.lang.UnsupportedOperationException at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:232) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27) // analyzed logical plan works fine scala> names.queryExecution.analyzed.stats res23: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=48.0 B, hints=none) Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski