[
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754694#comment-17754694
]
Rakesh Raushan edited comment on SPARK-44817 at 8/16/23 1:08 PM:
-----------------------------------------------------------------
[~cloud_fan] [~gurwls223] [~maxgekk] [~dongjoon] What are your thoughts over
this ?
If this looks promising, i can work on raising PR for this.
was (Author: rakson):
[~cloud_fan] [~gurwls223] [~maxgekk] What are your thoughts over this ?
If this looks promising, i can work on raising PR for this.
> Incremental Stats Collection
> ----------------------------
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Rakesh Raushan
> Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if
> auto update of stats collection is not turned on. To keep stats updated we
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run
> itself. This way our table and column stats would be fresh at all the time
> and CBO benefits can be applied. Initially, we can only update table level
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE
> STATISTICS` for updating stats.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]