Re: Why some queries use logical.stats while others analyzed.stats?

Jacek Laskowski Sat, 06 Jan 2018 05:25:53 -0800

Hi Wenchen,

That's just now when I stumbled across this comment in
`LeafNode.computeStats` [1]:


> Leaf nodes that can survive analysis must define their own statistics.

And the other in scaladoc of AnalysisBarrier [2]

> This analysis barrier will be removed at the end of analysis stage.

That makes a lot of sense now and makes QueryExecution.analyzed crucial
(since there could be AnalysisBarriers in a plan). Thanks again!

Regarding your comment:

> For this particular case, we can make it work by defining `computeStats`
in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this
doesn't break any real use cases.

Don't you think that `AnalysisBarrier.computeStats` could just dispatch
to child.computeStats and although "this doesn't break any real use cases"
would avoid questions like that? Less to worry about and would make things
more comprehensible. Mind if I proposed a PR with the change (and another
for the typo in the scaladoc where it says: "The SQL Analyzer goes through
a whole query plan even most part of it is analyzed." [3])?

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L231
[2]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L903
[3]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L895

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Thu, Jan 4, 2018 at 11:57 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

> First of all, I think you know that `QueryExecution` is a developer API
> right? By definition `QueryExecution.logical` is the input plan, which can
> even be unresolved. Developers should be aware of it and do not apply
> operations that need the plan to be resolved. Obviously `LogicalPlan.stats`
> needs the plan to be resolved.
>
> For this particular case, we can make it work by defining `computeStats`
> in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this
> doesn't break any real use cases.
>
> On Thu, Jan 4, 2018 at 4:36 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> I use Spark from the master today.
>>
>> $ ./bin/spark-shell --version
>> Welcome to
>>       ____              __
>>      / __/__  ___ _____/ /__
>>     _\ \/ _ \/ _ `/ __/  '_/
>>    /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>>       /_/
>>
>> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
>> Branch master
>> Compiled by user jacek on 2018-01-04T05:44:05Z
>> Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8
>>
>> Can anyone explain why some queries have stats in logical plan while
>> others don't (and I had to use analyzed logical plan)?
>>
>> I can explain the difference using the code, but I don't know why there
>> is the difference.
>>
>> spark.range(1000).write.parquet("/tmp/p1000")
>> // The stats are available in logical plan (in logical "phase")
>> scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats
>> res21: org.apache.spark.sql.catalyst.plans.logical.Statistics =
>> Statistics(sizeInBytes=6.9 KB, hints=none)
>>
>> // logical plan fails, but it worked fine above --> WHY?!
>> val names = Seq((1, "one"), (2, "two")).toDF("id", "name")
>> scala> names.queryExecution.logical.stats
>> java.lang.UnsupportedOperationException
>>   at org.apache.spark.sql.catalyst.plans.logical.LeafNode.compute
>> Stats(LogicalPlan.scala:232)
>>   at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.
>> SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStat
>> sPlanVisitor.scala:55)
>>   at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.
>> SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStat
>> sPlanVisitor.scala:27)
>>
>> // analyzed logical plan works fine
>> scala> names.queryExecution.analyzed.stats
>> res23: org.apache.spark.sql.catalyst.plans.logical.Statistics =
>> Statistics(sizeInBytes=48.0 B, hints=none)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>
>

Re: Why some queries use logical.stats while others analyzed.stats?

Reply via email to