Re: When and how does Spark use metastore statistics?

2023-12-26 Thread Bjørn Jørgensen
Tell me more about
spark.sql.cbo.strategy


tir. 12. des. 2023 kl. 00:25 skrev Nicholas Chammas <
nicholas.cham...@gmail.com>:

> Where exactly are you getting this information from?
>
> As far as I can tell, spark.sql.cbo.enabled has defaulted to false since
> it was introduced 7 years ago
> .
> It has never been enabled by default.
>
> And I cannot see mention of spark.sql.cbo.strategy anywhere at all in the
> code base.
>
> So again, where is this information coming from? Please link directly to
> your source.
>
>
>
> On Dec 11, 2023, at 5:45 PM, Mich Talebzadeh 
> wrote:
>
> You are right. By default CBO is not enabled. Whilst the CBO was the
> default optimizer in earlier versions of Spark, it has been replaced by
> the AQE in recent releases.
>
> spark.sql.cbo.strategy
>
> As I understand, The spark.sql.cbo.strategy configuration property
> specifies the optimizer strategy used by Spark SQL to generate query
> execution plans. There are two main optimizer strategies available:
>
>-
>
>CBO (Cost-Based Optimization): The default optimizer strategy, which
>analyzes the query plan and estimates the execution costs associated with
>each operation. It uses statistics to guide its decisions, selecting the
>plan with the lowest estimated cost.
>-
>
>CBO-Like (Cost-Based Optimization-Like): A simplified optimizer
>strategy that mimics some of the CBO's logic, but without the ability to
>estimate costs. This strategy is faster than CBO for simple queries, but
>may not produce the most efficient plan for complex queries.
>
> The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like.
> The default value is AUTO, which means that Spark will automatically
> choose the most appropriate strategy based on the complexity of the query
> and availability of statistic
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas 
> wrote:
>
>>
>> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh 
>> wrote:
>>
>> By default, the CBO is enabled in Spark.
>>
>>
>> Note that this is not correct. AQE is enabled
>> 
>>  by
>> default, but CBO isn’t
>> 
>> .
>>
>
>


Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
Where exactly are you getting this information from?

As far as I can tell, spark.sql.cbo.enabled has defaulted to false since it was 
introduced 7 years ago 
.
 It has never been enabled by default.

And I cannot see mention of spark.sql.cbo.strategy anywhere at all in the code 
base.

So again, where is this information coming from? Please link directly to your 
source.



> On Dec 11, 2023, at 5:45 PM, Mich Talebzadeh  
> wrote:
> 
> You are right. By default CBO is not enabled. Whilst the CBO was the default 
> optimizer in earlier versions of Spark, it has been replaced by the AQE in 
> recent releases.
> 
> spark.sql.cbo.strategy
> 
> As I understand, The spark.sql.cbo.strategy configuration property specifies 
> the optimizer strategy used by Spark SQL to generate query execution plans. 
> There are two main optimizer strategies available:
> CBO (Cost-Based Optimization): The default optimizer strategy, which analyzes 
> the query plan and estimates the execution costs associated with each 
> operation. It uses statistics to guide its decisions, selecting the plan with 
> the lowest estimated cost.
> 
> CBO-Like (Cost-Based Optimization-Like): A simplified optimizer strategy that 
> mimics some of the CBO's logic, but without the ability to estimate costs. 
> This strategy is faster than CBO for simple queries, but may not produce the 
> most efficient plan for complex queries.
> 
> The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like. The 
> default value is AUTO, which means that Spark will automatically choose the 
> most appropriate strategy based on the complexity of the query and 
> availability of statistic
> 
> 
> 
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas  > wrote:
>> 
>>> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh >> > wrote:
>>> 
>>> By default, the CBO is enabled in Spark.
>> 
>> Note that this is not correct. AQE is enabled 
>> 
>>  by default, but CBO isn’t 
>> .



Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Mich Talebzadeh
You are right. By default CBO is not enabled. Whilst the CBO was the
default optimizer in earlier versions of Spark, it has been replaced by the
AQE in recent releases.

spark.sql.cbo.strategy

As I understand, The spark.sql.cbo.strategy configuration property
specifies the optimizer strategy used by Spark SQL to generate query
execution plans. There are two main optimizer strategies available:

   -

   CBO (Cost-Based Optimization): The default optimizer strategy, which
   analyzes the query plan and estimates the execution costs associated with
   each operation. It uses statistics to guide its decisions, selecting the
   plan with the lowest estimated cost.
   -

   CBO-Like (Cost-Based Optimization-Like): A simplified optimizer strategy
   that mimics some of the CBO's logic, but without the ability to estimate
   costs. This strategy is faster than CBO for simple queries, but may not
   produce the most efficient plan for complex queries.

The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like.
The default value is AUTO, which means that Spark will automatically choose
the most appropriate strategy based on the complexity of the query and
availability of statistic


Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas 
wrote:

>
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh 
> wrote:
>
> By default, the CBO is enabled in Spark.
>
>
> Note that this is not correct. AQE is enabled
> 
>  by
> default, but CBO isn’t
> 
> .
>


Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas

> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh  
> wrote:
> spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default optimizer, 
> or NONE to disable it completely.
> 
Hmm, I’ve also never heard of this setting before and can’t seem to find it in 
the Spark docs or source code.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas

> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh  
> wrote:
> 
> By default, the CBO is enabled in Spark.

Note that this is not correct. AQE is enabled 

 by default, but CBO isn’t 
.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Mich Talebzadeh
Some of these have been around outside of spark for years. like CBO and RBO
etc but I concur that they have a place in spark's doc.

Simply put, statistics  provide insights into the characteristics of data,
such as distribution, skewness, and cardinalities, which help the optimizer
make informed decisions about data partitioning, aggregation strategies,
and join order.

Not so differently, Spark utilizes statistics to:

   - Partition Data Effectively: Spark partitions data into smaller chunks
   to distribute and parallelize computations across worker nodes. Accurate
   statistics enable the optimizer to choose the most appropriate partitioning
   strategy for each data set, considering factors like data distribution and
   skewness.
   - Optimize Join Operations: Spark employs statistics to determine the
   most efficient join order, considering the join factors and their
   respective cardinalities. This helps reduce the amount of data shuffled
   during joins, improving performance and minimizing data transfer overhead.
   - Choose Optimal Aggregation Strategies: When performing aggregations,
   Spark uses statistics to determine the most efficient aggregation algorithm
   based on the data distribution and the desired aggregation functions. This
   ensures that aggregations are performed efficiently without compromising
   accuracy.


With regard to type of statistics:


   - Catalog Statistics: These are pre-computed statistics that are stored
   in the Spark SQL catalog and associated with table or dataset metadata.
   They are typically gathered using the ANALYZE TABLE statement or through
   data source-specific mechanisms.
   - Data Source Statistics: These statistics are computed by the data
   source itself, such as Parquet or Hive, and are associated with the
   internal format of the data. Spark can access and utilize these statistics
   when working with external data sources.
   - Runtime Statistics: These are statistics that are dynamically computed
   during query execution. Spark can gather runtime statistics for certain
   operations, such as aggregations or joins, to refine its optimization
   decisions based on the actual data encountered.

It is important to mention Cost-Based Optimization (CBO). CBO in Spark
analyzes the query plan and estimates the execution costs associated with
each operation. It uses statistics to guide its decisions, selecting the
plan with the lowest estimated cost. I do not know any RDBMS that uses rule
based optimizer (RBO) anymore.

By default, the CBO is enabled in Spark. However, you can explicitly enable
or disable it using the following options:

   -

   spark.sql.cbo.enabled: Set to true to enable the CBO, or false to
   disable it.
   -

   spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default
   optimizer, or NONE to disable it completely.

HTH
Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Dec 2023 at 02:36, Nicholas Chammas 
wrote:

> I’ve done some reading and have a slightly better understanding of
> statistics now.
>
> Every implementation of LeafNode.computeStats
> 
>  offers
> its own way to get statistics:
>
>- LocalRelation
>
> 
>  estimates
>the size of the relation directly from the row count.
>- HiveTableRelation
>
> 
>  pulls
>those statistics from the catalog or metastore.
>- DataSourceV2Relation
>
> 

Re: When and how does Spark use metastore statistics?

2023-12-10 Thread Nicholas Chammas
I’ve done some reading and have a slightly better understanding of statistics 
now.

Every implementation of LeafNode.computeStats 

 offers its own way to get statistics:

LocalRelation 

 estimates the size of the relation directly from the row count.
HiveTableRelation 

 pulls those statistics from the catalog or metastore.
DataSourceV2Relation 

 delegates the job of computing statistics to the underlying data source.
There are a lot of details I’m still fuzzy on, but I think that’s the gist of 
things.

Would it make sense to add a paragraph or two to the SQL performance tuning 
page  
covering statistics at a high level? Something that briefly explains:

what statistics are and how Spark uses them to optimize plans
the various ways Spark computes or loads statistics (catalog, data source, 
runtime, etc.)
how to gather catalog statistics (i.e. pointer to ANALYZE TABLE)
how to check statistics on an object (i.e. DESCRIBE EXTENDED) and as part of an 
optimized plan (i.e. .explain(mode="cost"))
what the cost-based optimizer does and how to enable it
Would this be a welcome addition to the project’s documentation? I’m happy to 
work on this.



> On Dec 5, 2023, at 12:12 PM, Nicholas Chammas  
> wrote:
> 
> I’m interested in improving some of the documentation relating to the table 
> and column statistics that get stored in the metastore, and how Spark uses 
> them.
> 
> But I’m not clear on a few things, so I’m writing to you with some questions.
> 
> 1. The documentation for spark.sql.autoBroadcastJoinThreshold 
>  implies 
> that it depends on table statistics to work, but it’s not clear. Is it 
> accurate to say that unless you have run ANALYZE on the tables participating 
> in a join, spark.sql.autoBroadcastJoinThreshold cannot impact the execution 
> plan?
> 
> 2. As a follow-on to the above question, the adaptive version of 
> autoBroadcastJoinThreshold, namely 
> spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
> depends only on runtime statistics and not statistics in the metastore. Is 
> that correct? I am assuming that “runtime statistics” are gathered on the fly 
> by Spark, but I would like to mention this in the docs briefly somewhere.
> 
> 3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
>  mentions 
> “statistics”, but it’s not clear what kind of statistics we’re talking about. 
> Are those runtime statistics, metastore statistics (that depend on you 
> running ANALYZE), or something else?
> 
> 4. The documentation for ANALYZE TABLE 
>  
> states that the collected statistics help the optimizer "find a better query 
> execution plan”. I wish we could link to something from here with more 
> explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only 
> place where metastore statistics are explicitly referenced as impacting the 
> execution plan. Surely there must be other places, no? Would it be 
> appropriate to mention the cost-based optimizer framework 
>  somehow? It doesn’t 
> appear to have any public documentation outside of Jira.
> 
> Any pointers or information you can provide would be very helpful. Again, I 
> am interested in contributing some documentation improvements relating to 
> statistics, but there is a lot I’m not sure about.
> 
> Nick
> 



When and how does Spark use metastore statistics?

2023-12-05 Thread Nicholas Chammas
I’m interested in improving some of the documentation relating to the table and 
column statistics that get stored in the metastore, and how Spark uses them.

But I’m not clear on a few things, so I’m writing to you with some questions.

1. The documentation for spark.sql.autoBroadcastJoinThreshold 
 implies that 
it depends on table statistics to work, but it’s not clear. Is it accurate to 
say that unless you have run ANALYZE on the tables participating in a join, 
spark.sql.autoBroadcastJoinThreshold cannot impact the execution plan?

2. As a follow-on to the above question, the adaptive version of 
autoBroadcastJoinThreshold, namely 
spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
depends only on runtime statistics and not statistics in the metastore. Is that 
correct? I am assuming that “runtime statistics” are gathered on the fly by 
Spark, but I would like to mention this in the docs briefly somewhere.

3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
 mentions 
“statistics”, but it’s not clear what kind of statistics we’re talking about. 
Are those runtime statistics, metastore statistics (that depend on you running 
ANALYZE), or something else?

4. The documentation for ANALYZE TABLE 
 
states that the collected statistics help the optimizer "find a better query 
execution plan”. I wish we could link to something from here with more 
explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only place 
where metastore statistics are explicitly referenced as impacting the execution 
plan. Surely there must be other places, no? Would it be appropriate to mention 
the cost-based optimizer framework 
 somehow? It doesn’t appear 
to have any public documentation outside of Jira.

Any pointers or information you can provide would be very helpful. Again, I am 
interested in contributing some documentation improvements relating to 
statistics, but there is a lot I’m not sure about.

Nick