gatorsmile commented on a change in pull request #27216: [SPARK-28801][DOC]
Document SELECT statement in SQL Reference (Main page)
URL: https://github.com/apache/spark/pull/27216#discussion_r368264445
##########
File path: docs/sql-ref-syntax-qry-select.md
##########
@@ -18,8 +18,119 @@ license: |
See the License for the specific language governing permissions and
limitations under the License.
---
+Spark supports a `SELECT` statement and conforms to the ANSI SQL standard.
Queries are
+used to retrieve result sets from one or more tables. The following section
+describes the overall query syntax and the sub-sections cover different
constructs
+of a query along with examples.
-Spark SQL is a Apache Spark's module for working with structured data.
-This guide is a reference for Structured Query Language (SQL) for Apache
-Spark. This document describes the SQL constructs supported by Spark in detail
-along with usage examples when applicable.
+### Syntax
+{% highlight sql %}
+[WITH with_query [, ...]]
+SELECT [hints, ...] [ALL|DISTINCT] named_expression[, named_expression, ...]
+ FROM from_item [, from_item, ...]
+ [WHERE boolean_expression]
+ [GROUP BY expression [, ...] ]
+ [HAVING boolean_expression [, ...] ]
+ [ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ]
+ [SORT BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ]
+ [CLUSTER BY [expression [, ...] ]
+ [DISTRIBUTE BY [expression [, ...] ]
+ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
+ [WINDOW named_window[, WINDOW named_window, ...]]
+ [LIMIT {ALL | expression}]
+{% endhighlight %}
+
+### Parameters
+<dl>
+ <dt><code><em>with_query</em></code></dt>
+ <dd>
+ Specifies the common table expressions (CTEs) before the main
<code>SELECT</code> query block.
+ These table expressions are allowed to be referenced later in the main
query. This is useful to abstract
+ out repeated sub query blocks in the main query and improves readability
of the query.
+ </dd>
+ <dt><code><em>hints</em></code></dt>
+ <dd>
+ Hints can be specified to help spark optimizer make better planning
decisions. Currently spark supports hints
+ that influence selection of join strategies and repartitioning of the
data.
+ </dd>
+ <dt><code><em>ALL</em></code></dt>
+ <dd>
+ Select all matching rows from the relation and is enabled by default.
+ </dd>
+ <dt><code><em>DISTINCT</em></code></dt>
+ <dd>
+ Select all matching rows from the relation after removing duplicates in
results.
+ </dd>
+ <dt><code><em>named_expression</em></code></dt>
+ <dd>
+ A expression with an assigned name. In general, it denotes a column
expression.<br><br>
+ <b>Syntax:</b>
+ <code>
+ expression [AS] [alias]
+ </code>
+ </dd>
+ <dt><code><em>from_item</em></code></dt>
+ <dd>
+ Specifies a source of input for the query. It can be one of the following.
+ <ol>
+ <li>Table relation</li>
+ <li>Join relation</li>
+ <li>Table valued function</li>
+ <li>Inlined table</li>
+ <li>Subquery</li>
+ </ol>
+ </dd>
+ <dt><code><em>WHERE</em></code></dt>
+ <dd>
+ Filters the result of the FROM clause based on the supplied predicates.
+ </dd>
+ <dt><code><em>GROUP BY</em></code></dt>
+ <dd>
+ Specifies the expressions that are used to group the rows. This is used in
conjunction with aggregate functions
+ (MIN, MAX, COUNT, SUM, AVG) to group rows bsed on the grouping expressions.
+ </dd>
+ <dt><code><em>HAVING</em></code></dt>
+ <dd>
+ Specifies the predicates by which the rows produced by GROUP BY are
filtered. The HAVING clause is used to
+ filter rows after the grouping is performed
+ </dd>
+ <dt><code><em>ORDER BY</em></code></dt>
+ <dd>
+ Specifies an ordering of the rows of the complete result set of the query.
The output rows are ordered
+ across the partitions. This parameter is mutually exclusive with
<code>SORT BY</code>,
+ <code>CLUSTER BY</code> and <code>DISTRIBUTE BY</code> and can not be
specified together.
+ </dd>
+ <dt><code><em>SORT BY</em></code></dt>
+ <dd>
+ Specifies an ordering by which the rows are ordered within each partition.
This parameter is mutually
+ exclusive with <code>ORDER BY</code> and <code>CLUSTER BY</code> and can
not be specified together.
+ </dd>
+ <dt><code><em>CLUSTER BY</em></code></dt>
+ <dd>
+ Specifies a set of expressions that is used to repartition and sort the
rows. Using this clause has
+ the same effect of using <code>DISTRIBUTE BY</code> and <code>SORT
BY</code> together.
+ </dd>
+ <dt><code><em>DISTRIBUTE BY</em></code></dt>
Review comment:
These three clauses are very special. It is from Hive. Could we have a
simple SELECT and then a full SELECT?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]