Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

via GitHub Tue, 05 Mar 2024 13:57:51 -0800


xinrong-meng commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1513548381



##########
docs/sql-performance-tuning.md:
##########
@@ -157,6 +157,18 @@ SELECT /*+ REBALANCE(3, c) */ * FROM t;
 
 For more details please refer to the documentation of [Partitioning 
Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints).
 
+## Leveraging Statistics
+Apache Spark's ability to choose the best execution plan among many possible 
options is determined in part by its estimates of how many rows will be output 
by every node in the execution plan (read, filter, join, etc.). Those estimates 
in turn are based on statistics that are made available to Spark in one of 
several ways:
+
+- **Data source**: Statistics that Spark reads directly from the underlying 
data source, like the counts and min/max values in the metadata of Parquet 
files. These statistics are maintained by the underlying data source.
+- **Catalog**: Statistics that Spark reads from the catalog, like the Hive 
Metastore. These statistics are collected or updated whenever you run [`ANALYZE 
TABLE`](sql-ref-syntax-aux-analyze-table.html).
+- **Runtime**: Statistics that Spark computes itself at runtime as a job is 
running.

Review Comment:
   Thank you @nchammas , that's very helpful!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Reply via email to