alamb commented on code in PR #10361: URL: https://github.com/apache/datafusion/pull/10361#discussion_r1589052854
########## docs/source/user-guide/sql/subqueries.md: ########## @@ -19,80 +19,392 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found +in the results of a _[correlated subquery](#correlated-subqueries)_. +`NOT IN` returns all rows where a given expression’s value cannot be found in +the results of a subquery or list of values. + +```sql +expression [NOT] IN (subquery|list-literal) +``` + +#### Examples + +```sql +SELECT * FROM x WHERE column_1 IN (1,3); + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ -1 row in set. ``` -## NOT EXISTS +```sql +SELECT * FROM x WHERE column_1 NOT IN (1,3); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +## SELECT clause subqueries + +`SELECT` clause subqueries use values returned from the inner query as part +of the outer query's `SELECT` list. +The `SELECT` clause only supports [scalar subqueries](#scalar-subqueries) that +return a single value per execution of the inner query. +The returned value can be unique per row. + +```sql +SELECT [expression1[, expression2, ..., expressionN],] (<subquery>) +``` + +**Note**: `SELECT` clause subqueries can be used as an alternative to `JOIN` +operations. -The `NOT EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces zero matches -for that row. Only correlated subqueries are supported. +### Example ```sql -select * from x y where not exists (select * from x where x.column_1 = y.column_1); -0 rows in set. +SELECT + column_1, + ( + SELECT + first_value(string) + FROM + y + WHERE + number = x.column_1 + ) AS "numeric string" +FROM + x; + ++----------+----------------+ +| column_1 | numeric string | ++----------+----------------+ +| 1 | one | +| 2 | two | ++----------+----------------+ ``` -## IN +## FROM clause subqueries -The `IN` syntax can be used to find all rows in a relation where a given expression's value can be found in the -results of a correlated subquery. +`FROM` clause subqueries return a set of results that is then queried and +operated on by the outer query. ```sql -select * from x where column_1 in (select column_1 from x); +SELECT expression1[, expression2, ..., expressionN] FROM (<subquery>) +``` + +### Example + +The following query returns the average of maximum values per room. +The inner query returns the maximum value for each field from each room. +The outer query uses the results of the inner query and returns the average +maximum value for each field. + +```sql +SELECT + column_2 +FROM + ( + SELECT + * + FROM + x + WHERE + column_1 > 1 + ); + ++----------+ +| column_2 | ++----------+ +| 4 | ++----------+ +``` + +## WHERE clause subqueries + +`WHERE` clause subqueries compare an expression to the result of the subquery +and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `WHERE` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + expression1[, expression2, ..., expressionN] +FROM + <measurement> +WHERE + expression operator (<subquery>) +``` + +**Note:** `WHERE` clause subqueries can be used as an alternative to `JOIN` +operations. + +### Examples + +#### `WHERE` clause with scalar subquery + +The following query returns all rows with `column_2` values above the average +of all `number` values in `y`. + +```sql +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(number) + FROM + y + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` -## NOT IN +#### `WHERE` clause with non-scalar subquery -The `NOT IN` syntax can be used to find all rows in a relation where a given expression's value can not be found in the -results of a correlated subquery. +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a +single column. The values in the returned column are evaluated as a list. + +The following query returns all rows with `column_2` values in table `x` that +are in the list of numbers with string lengths greater than three from table +`y`. ```sql -select * from x where column_1 not in (select column_1 from x); -0 rows in set. +SELECT + * +FROM + x +WHERE + column_2 IN ( + SELECT + number + FROM + y + WHERE + length(string) > 3 + ); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ ``` -## Scalar Subquery +### `WHERE` clause with correlated subquery -A scalar subquery can be used to produce a single value that can be used in many different contexts in a query. Here -is an example of a filter using a scalar subquery. Only correlated subqueries are supported. +The following query returns rows with `column_2` values from table `x` greater +than the average `string` value length from table `y`. +The subquery in the `WHERE` clause uses the `column_1` value from the outer +query to return the average `string` value length for that specific value. ```sql -select * from x y where column_1 < (select sum(column_2) from x where x.column_1 = y.column_1); +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(length(string)) + FROM + y + WHERE + number = x.column_1 + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` + +## HAVING clause subqueries + +`HAVING` clause subqueries compare an expression that uses aggregate values +returned by aggregate functions in the `SELECT` clause to the result of the +subquery and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `HAVING` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + aggregate_expression1[, aggregate_expression2, ..., aggregate_expressionN] +FROM + <measurement> +WHERE + <conditional_expression> +GROUP BY + column_expression1[, column_expression2, ..., column_expressionN] +HAVING + expression operator (<subquery>) +``` + +### Examples + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are equal to the maximum value of `column_1` +in table `x`. + +#### `HAVING` clause with a scalar subquery + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg = ( + SELECT + MAX(column_1) + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +#### `HAVING` clause with a non-scalar subquery + +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a +single column. The values in the returned column are evaluated as a list. + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are in `column_1` of table `x`. + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg IN ( + SELECT + column_1 + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +## Subquery categories + +Subqueries can be categorized as one or more of the following based on the +behavior of the subquery: + +- [correlated](#correlated-subqueries) or + [non-correlated](#non-correlated-subqueries) +- [scalar](#scalar-subqueries) or [non-scalar](#non-scalar-subqueries) + +### Correlated subqueries + +In a **correlated** subquery, the inner query depends on the outer query, using +values from the outer query for its results. +Correlated subqueries can return a maximum of one row, so Review Comment: I don't think correlated subqueries always return a single row (though it may be required in DataFusion's current implementation). Maybe we could use some of the text from https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-correlated-subquery/ ########## docs/source/user-guide/sql/subqueries.md: ########## @@ -19,80 +19,392 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found +in the results of a _[correlated subquery](#correlated-subqueries)_. +`NOT IN` returns all rows where a given expression’s value cannot be found in +the results of a subquery or list of values. + +```sql +expression [NOT] IN (subquery|list-literal) +``` + +#### Examples + +```sql +SELECT * FROM x WHERE column_1 IN (1,3); + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ -1 row in set. ``` -## NOT EXISTS +```sql +SELECT * FROM x WHERE column_1 NOT IN (1,3); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +## SELECT clause subqueries + +`SELECT` clause subqueries use values returned from the inner query as part +of the outer query's `SELECT` list. +The `SELECT` clause only supports [scalar subqueries](#scalar-subqueries) that +return a single value per execution of the inner query. +The returned value can be unique per row. + +```sql +SELECT [expression1[, expression2, ..., expressionN],] (<subquery>) +``` + +**Note**: `SELECT` clause subqueries can be used as an alternative to `JOIN` +operations. -The `NOT EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces zero matches -for that row. Only correlated subqueries are supported. +### Example ```sql -select * from x y where not exists (select * from x where x.column_1 = y.column_1); -0 rows in set. +SELECT + column_1, + ( + SELECT + first_value(string) + FROM + y + WHERE + number = x.column_1 + ) AS "numeric string" +FROM + x; + ++----------+----------------+ +| column_1 | numeric string | ++----------+----------------+ +| 1 | one | +| 2 | two | ++----------+----------------+ ``` -## IN +## FROM clause subqueries -The `IN` syntax can be used to find all rows in a relation where a given expression's value can be found in the -results of a correlated subquery. +`FROM` clause subqueries return a set of results that is then queried and +operated on by the outer query. ```sql -select * from x where column_1 in (select column_1 from x); +SELECT expression1[, expression2, ..., expressionN] FROM (<subquery>) +``` + +### Example + +The following query returns the average of maximum values per room. +The inner query returns the maximum value for each field from each room. +The outer query uses the results of the inner query and returns the average +maximum value for each field. + +```sql +SELECT + column_2 +FROM + ( + SELECT + * + FROM + x + WHERE + column_1 > 1 + ); + ++----------+ +| column_2 | ++----------+ +| 4 | ++----------+ +``` + +## WHERE clause subqueries + +`WHERE` clause subqueries compare an expression to the result of the subquery +and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `WHERE` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + expression1[, expression2, ..., expressionN] +FROM + <measurement> +WHERE + expression operator (<subquery>) +``` + +**Note:** `WHERE` clause subqueries can be used as an alternative to `JOIN` +operations. + +### Examples + +#### `WHERE` clause with scalar subquery + +The following query returns all rows with `column_2` values above the average +of all `number` values in `y`. + +```sql +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(number) + FROM + y + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` -## NOT IN +#### `WHERE` clause with non-scalar subquery -The `NOT IN` syntax can be used to find all rows in a relation where a given expression's value can not be found in the -results of a correlated subquery. +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a Review Comment: I think they can use `[NOT] EXISTS` as well I believe ########## docs/source/user-guide/sql/subqueries.md: ########## @@ -19,80 +19,392 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found +in the results of a _[correlated subquery](#correlated-subqueries)_. +`NOT IN` returns all rows where a given expression’s value cannot be found in +the results of a subquery or list of values. + +```sql +expression [NOT] IN (subquery|list-literal) +``` + +#### Examples + +```sql +SELECT * FROM x WHERE column_1 IN (1,3); + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ -1 row in set. ``` -## NOT EXISTS +```sql +SELECT * FROM x WHERE column_1 NOT IN (1,3); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +## SELECT clause subqueries + +`SELECT` clause subqueries use values returned from the inner query as part +of the outer query's `SELECT` list. +The `SELECT` clause only supports [scalar subqueries](#scalar-subqueries) that +return a single value per execution of the inner query. +The returned value can be unique per row. + +```sql +SELECT [expression1[, expression2, ..., expressionN],] (<subquery>) +``` + +**Note**: `SELECT` clause subqueries can be used as an alternative to `JOIN` +operations. -The `NOT EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces zero matches -for that row. Only correlated subqueries are supported. +### Example ```sql -select * from x y where not exists (select * from x where x.column_1 = y.column_1); -0 rows in set. +SELECT + column_1, + ( + SELECT + first_value(string) + FROM + y + WHERE + number = x.column_1 + ) AS "numeric string" +FROM + x; + ++----------+----------------+ +| column_1 | numeric string | ++----------+----------------+ +| 1 | one | +| 2 | two | ++----------+----------------+ ``` -## IN +## FROM clause subqueries -The `IN` syntax can be used to find all rows in a relation where a given expression's value can be found in the -results of a correlated subquery. +`FROM` clause subqueries return a set of results that is then queried and +operated on by the outer query. ```sql -select * from x where column_1 in (select column_1 from x); +SELECT expression1[, expression2, ..., expressionN] FROM (<subquery>) +``` + +### Example + +The following query returns the average of maximum values per room. +The inner query returns the maximum value for each field from each room. +The outer query uses the results of the inner query and returns the average +maximum value for each field. + +```sql +SELECT + column_2 +FROM + ( + SELECT + * + FROM + x + WHERE + column_1 > 1 + ); + ++----------+ +| column_2 | ++----------+ +| 4 | ++----------+ +``` + +## WHERE clause subqueries + +`WHERE` clause subqueries compare an expression to the result of the subquery +and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `WHERE` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + expression1[, expression2, ..., expressionN] +FROM + <measurement> +WHERE + expression operator (<subquery>) +``` + +**Note:** `WHERE` clause subqueries can be used as an alternative to `JOIN` +operations. + +### Examples + +#### `WHERE` clause with scalar subquery + +The following query returns all rows with `column_2` values above the average +of all `number` values in `y`. + +```sql +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(number) + FROM + y + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` -## NOT IN +#### `WHERE` clause with non-scalar subquery -The `NOT IN` syntax can be used to find all rows in a relation where a given expression's value can not be found in the -results of a correlated subquery. +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a +single column. The values in the returned column are evaluated as a list. + +The following query returns all rows with `column_2` values in table `x` that +are in the list of numbers with string lengths greater than three from table +`y`. ```sql -select * from x where column_1 not in (select column_1 from x); -0 rows in set. +SELECT + * +FROM + x +WHERE + column_2 IN ( + SELECT + number + FROM + y + WHERE + length(string) > 3 + ); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ ``` -## Scalar Subquery +### `WHERE` clause with correlated subquery -A scalar subquery can be used to produce a single value that can be used in many different contexts in a query. Here -is an example of a filter using a scalar subquery. Only correlated subqueries are supported. +The following query returns rows with `column_2` values from table `x` greater +than the average `string` value length from table `y`. +The subquery in the `WHERE` clause uses the `column_1` value from the outer +query to return the average `string` value length for that specific value. ```sql -select * from x y where column_1 < (select sum(column_2) from x where x.column_1 = y.column_1); +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(length(string)) + FROM + y + WHERE + number = x.column_1 + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` + +## HAVING clause subqueries + +`HAVING` clause subqueries compare an expression that uses aggregate values +returned by aggregate functions in the `SELECT` clause to the result of the +subquery and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `HAVING` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + aggregate_expression1[, aggregate_expression2, ..., aggregate_expressionN] +FROM + <measurement> +WHERE + <conditional_expression> +GROUP BY + column_expression1[, column_expression2, ..., column_expressionN] +HAVING + expression operator (<subquery>) +``` + +### Examples + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are equal to the maximum value of `column_1` +in table `x`. + +#### `HAVING` clause with a scalar subquery + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg = ( + SELECT + MAX(column_1) + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +#### `HAVING` clause with a non-scalar subquery + +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a +single column. The values in the returned column are evaluated as a list. + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are in `column_1` of table `x`. + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg IN ( + SELECT + column_1 + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +## Subquery categories + +Subqueries can be categorized as one or more of the following based on the +behavior of the subquery: + +- [correlated](#correlated-subqueries) or + [non-correlated](#non-correlated-subqueries) +- [scalar](#scalar-subqueries) or [non-scalar](#non-scalar-subqueries) + +### Correlated subqueries + +In a **correlated** subquery, the inner query depends on the outer query, using +values from the outer query for its results. +Correlated subqueries can return a maximum of one row, so Review Comment: I don't think correlated subqueries always return a single row (though it may be required in DataFusion's current implementation). Maybe we could use some of the text from https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-correlated-subquery/ ########## docs/source/user-guide/sql/subqueries.md: ########## @@ -19,80 +19,392 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found Review Comment: There is a bunch of subtlety regarding what happens with NULLs (like if the expression has a NULL and the subquery also returns a NULL) with respect to `NOT IN` / `NOT EXISTS` which always confused me. It is probably better not to worry about documenting them ########## docs/source/user-guide/sql/subqueries.md: ########## @@ -19,80 +19,392 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found +in the results of a _[correlated subquery](#correlated-subqueries)_. +`NOT IN` returns all rows where a given expression’s value cannot be found in +the results of a subquery or list of values. + +```sql +expression [NOT] IN (subquery|list-literal) +``` + +#### Examples + +```sql +SELECT * FROM x WHERE column_1 IN (1,3); + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ -1 row in set. ``` -## NOT EXISTS +```sql +SELECT * FROM x WHERE column_1 NOT IN (1,3); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +## SELECT clause subqueries + +`SELECT` clause subqueries use values returned from the inner query as part +of the outer query's `SELECT` list. +The `SELECT` clause only supports [scalar subqueries](#scalar-subqueries) that +return a single value per execution of the inner query. +The returned value can be unique per row. + +```sql +SELECT [expression1[, expression2, ..., expressionN],] (<subquery>) +``` + +**Note**: `SELECT` clause subqueries can be used as an alternative to `JOIN` +operations. -The `NOT EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces zero matches -for that row. Only correlated subqueries are supported. +### Example ```sql -select * from x y where not exists (select * from x where x.column_1 = y.column_1); -0 rows in set. +SELECT + column_1, + ( + SELECT + first_value(string) + FROM + y + WHERE + number = x.column_1 + ) AS "numeric string" +FROM + x; + ++----------+----------------+ +| column_1 | numeric string | ++----------+----------------+ +| 1 | one | +| 2 | two | ++----------+----------------+ ``` -## IN +## FROM clause subqueries -The `IN` syntax can be used to find all rows in a relation where a given expression's value can be found in the -results of a correlated subquery. +`FROM` clause subqueries return a set of results that is then queried and +operated on by the outer query. ```sql -select * from x where column_1 in (select column_1 from x); +SELECT expression1[, expression2, ..., expressionN] FROM (<subquery>) +``` + +### Example + +The following query returns the average of maximum values per room. +The inner query returns the maximum value for each field from each room. +The outer query uses the results of the inner query and returns the average +maximum value for each field. + +```sql +SELECT + column_2 +FROM + ( + SELECT + * + FROM + x + WHERE + column_1 > 1 + ); + ++----------+ +| column_2 | ++----------+ +| 4 | ++----------+ +``` + +## WHERE clause subqueries + +`WHERE` clause subqueries compare an expression to the result of the subquery +and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `WHERE` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + expression1[, expression2, ..., expressionN] +FROM + <measurement> +WHERE + expression operator (<subquery>) +``` + +**Note:** `WHERE` clause subqueries can be used as an alternative to `JOIN` +operations. + +### Examples + +#### `WHERE` clause with scalar subquery + +The following query returns all rows with `column_2` values above the average +of all `number` values in `y`. + +```sql +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(number) + FROM + y + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` -## NOT IN +#### `WHERE` clause with non-scalar subquery -The `NOT IN` syntax can be used to find all rows in a relation where a given expression's value can not be found in the -results of a correlated subquery. +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a Review Comment: ```suggestion Non-scalar subqueries must use the `[NOT] IN` or `[NOT] EXISTS` operators and can only return a ``` ########## docs/source/user-guide/sql/subqueries.md: ########## @@ -19,80 +19,392 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found +in the results of a _[correlated subquery](#correlated-subqueries)_. +`NOT IN` returns all rows where a given expression’s value cannot be found in +the results of a subquery or list of values. + +```sql +expression [NOT] IN (subquery|list-literal) +``` + +#### Examples + +```sql +SELECT * FROM x WHERE column_1 IN (1,3); + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ -1 row in set. ``` -## NOT EXISTS +```sql +SELECT * FROM x WHERE column_1 NOT IN (1,3); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +## SELECT clause subqueries + +`SELECT` clause subqueries use values returned from the inner query as part +of the outer query's `SELECT` list. +The `SELECT` clause only supports [scalar subqueries](#scalar-subqueries) that +return a single value per execution of the inner query. +The returned value can be unique per row. + +```sql +SELECT [expression1[, expression2, ..., expressionN],] (<subquery>) +``` + +**Note**: `SELECT` clause subqueries can be used as an alternative to `JOIN` +operations. -The `NOT EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces zero matches -for that row. Only correlated subqueries are supported. +### Example ```sql -select * from x y where not exists (select * from x where x.column_1 = y.column_1); -0 rows in set. +SELECT + column_1, + ( + SELECT + first_value(string) + FROM + y + WHERE + number = x.column_1 + ) AS "numeric string" +FROM + x; + ++----------+----------------+ +| column_1 | numeric string | ++----------+----------------+ +| 1 | one | +| 2 | two | ++----------+----------------+ ``` -## IN +## FROM clause subqueries -The `IN` syntax can be used to find all rows in a relation where a given expression's value can be found in the -results of a correlated subquery. +`FROM` clause subqueries return a set of results that is then queried and +operated on by the outer query. ```sql -select * from x where column_1 in (select column_1 from x); +SELECT expression1[, expression2, ..., expressionN] FROM (<subquery>) +``` + +### Example + +The following query returns the average of maximum values per room. +The inner query returns the maximum value for each field from each room. +The outer query uses the results of the inner query and returns the average +maximum value for each field. + +```sql +SELECT + column_2 +FROM + ( + SELECT + * + FROM + x + WHERE + column_1 > 1 + ); + ++----------+ +| column_2 | ++----------+ +| 4 | ++----------+ +``` + +## WHERE clause subqueries + +`WHERE` clause subqueries compare an expression to the result of the subquery +and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `WHERE` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + expression1[, expression2, ..., expressionN] +FROM + <measurement> +WHERE + expression operator (<subquery>) +``` + +**Note:** `WHERE` clause subqueries can be used as an alternative to `JOIN` +operations. + +### Examples + +#### `WHERE` clause with scalar subquery + +The following query returns all rows with `column_2` values above the average +of all `number` values in `y`. + +```sql +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(number) + FROM + y + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` -## NOT IN +#### `WHERE` clause with non-scalar subquery -The `NOT IN` syntax can be used to find all rows in a relation where a given expression's value can not be found in the -results of a correlated subquery. +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a +single column. The values in the returned column are evaluated as a list. + +The following query returns all rows with `column_2` values in table `x` that +are in the list of numbers with string lengths greater than three from table +`y`. ```sql -select * from x where column_1 not in (select column_1 from x); -0 rows in set. +SELECT + * +FROM + x +WHERE + column_2 IN ( + SELECT + number + FROM + y + WHERE + length(string) > 3 + ); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ ``` -## Scalar Subquery +### `WHERE` clause with correlated subquery -A scalar subquery can be used to produce a single value that can be used in many different contexts in a query. Here -is an example of a filter using a scalar subquery. Only correlated subqueries are supported. +The following query returns rows with `column_2` values from table `x` greater +than the average `string` value length from table `y`. +The subquery in the `WHERE` clause uses the `column_1` value from the outer +query to return the average `string` value length for that specific value. ```sql -select * from x y where column_1 < (select sum(column_2) from x where x.column_1 = y.column_1); +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(length(string)) + FROM + y + WHERE + number = x.column_1 + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` + +## HAVING clause subqueries + +`HAVING` clause subqueries compare an expression that uses aggregate values +returned by aggregate functions in the `SELECT` clause to the result of the +subquery and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `HAVING` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + aggregate_expression1[, aggregate_expression2, ..., aggregate_expressionN] +FROM + <measurement> +WHERE + <conditional_expression> +GROUP BY + column_expression1[, column_expression2, ..., column_expressionN] +HAVING + expression operator (<subquery>) +``` + +### Examples + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are equal to the maximum value of `column_1` +in table `x`. + +#### `HAVING` clause with a scalar subquery + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg = ( + SELECT + MAX(column_1) + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +#### `HAVING` clause with a non-scalar subquery + +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a Review Comment: ```suggestion Non-scalar subqueries must use the `[NOT] IN` or `[NOT] EXISTS` operators and can only return a ``` ########## docs/source/user-guide/sql/subqueries.md: ########## @@ -19,80 +19,392 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found +in the results of a _[correlated subquery](#correlated-subqueries)_. +`NOT IN` returns all rows where a given expression’s value cannot be found in +the results of a subquery or list of values. + +```sql +expression [NOT] IN (subquery|list-literal) +``` + +#### Examples + +```sql +SELECT * FROM x WHERE column_1 IN (1,3); + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ -1 row in set. ``` -## NOT EXISTS +```sql +SELECT * FROM x WHERE column_1 NOT IN (1,3); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +## SELECT clause subqueries + +`SELECT` clause subqueries use values returned from the inner query as part +of the outer query's `SELECT` list. +The `SELECT` clause only supports [scalar subqueries](#scalar-subqueries) that +return a single value per execution of the inner query. +The returned value can be unique per row. + +```sql +SELECT [expression1[, expression2, ..., expressionN],] (<subquery>) +``` + +**Note**: `SELECT` clause subqueries can be used as an alternative to `JOIN` +operations. -The `NOT EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces zero matches -for that row. Only correlated subqueries are supported. +### Example ```sql -select * from x y where not exists (select * from x where x.column_1 = y.column_1); -0 rows in set. +SELECT + column_1, + ( + SELECT + first_value(string) + FROM + y + WHERE + number = x.column_1 + ) AS "numeric string" +FROM + x; + ++----------+----------------+ +| column_1 | numeric string | ++----------+----------------+ +| 1 | one | +| 2 | two | ++----------+----------------+ ``` -## IN +## FROM clause subqueries -The `IN` syntax can be used to find all rows in a relation where a given expression's value can be found in the -results of a correlated subquery. +`FROM` clause subqueries return a set of results that is then queried and +operated on by the outer query. ```sql -select * from x where column_1 in (select column_1 from x); +SELECT expression1[, expression2, ..., expressionN] FROM (<subquery>) +``` + +### Example + +The following query returns the average of maximum values per room. +The inner query returns the maximum value for each field from each room. +The outer query uses the results of the inner query and returns the average +maximum value for each field. + +```sql +SELECT + column_2 +FROM + ( + SELECT + * + FROM + x + WHERE + column_1 > 1 + ); + ++----------+ +| column_2 | ++----------+ +| 4 | ++----------+ +``` + +## WHERE clause subqueries + +`WHERE` clause subqueries compare an expression to the result of the subquery +and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. Review Comment: Technically it is rows that do not evaluate to `_true` (aka rows that evaluate to `NULL` are also filtered) ```suggestion Rows that evaluate to _false_ or `_NULL_` are filtered from results. ``` ########## docs/source/user-guide/sql/subqueries.md: ########## @@ -19,80 +19,392 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found +in the results of a _[correlated subquery](#correlated-subqueries)_. +`NOT IN` returns all rows where a given expression’s value cannot be found in +the results of a subquery or list of values. + +```sql +expression [NOT] IN (subquery|list-literal) +``` + +#### Examples + +```sql +SELECT * FROM x WHERE column_1 IN (1,3); + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ -1 row in set. ``` -## NOT EXISTS +```sql +SELECT * FROM x WHERE column_1 NOT IN (1,3); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +## SELECT clause subqueries + +`SELECT` clause subqueries use values returned from the inner query as part +of the outer query's `SELECT` list. +The `SELECT` clause only supports [scalar subqueries](#scalar-subqueries) that +return a single value per execution of the inner query. +The returned value can be unique per row. + +```sql +SELECT [expression1[, expression2, ..., expressionN],] (<subquery>) +``` + +**Note**: `SELECT` clause subqueries can be used as an alternative to `JOIN` +operations. -The `NOT EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces zero matches -for that row. Only correlated subqueries are supported. +### Example ```sql -select * from x y where not exists (select * from x where x.column_1 = y.column_1); -0 rows in set. +SELECT + column_1, + ( + SELECT + first_value(string) + FROM + y + WHERE + number = x.column_1 + ) AS "numeric string" +FROM + x; + ++----------+----------------+ +| column_1 | numeric string | ++----------+----------------+ +| 1 | one | +| 2 | two | ++----------+----------------+ ``` -## IN +## FROM clause subqueries -The `IN` syntax can be used to find all rows in a relation where a given expression's value can be found in the -results of a correlated subquery. +`FROM` clause subqueries return a set of results that is then queried and +operated on by the outer query. ```sql -select * from x where column_1 in (select column_1 from x); +SELECT expression1[, expression2, ..., expressionN] FROM (<subquery>) +``` + +### Example + +The following query returns the average of maximum values per room. +The inner query returns the maximum value for each field from each room. +The outer query uses the results of the inner query and returns the average +maximum value for each field. + +```sql +SELECT + column_2 +FROM + ( + SELECT + * + FROM + x + WHERE + column_1 > 1 + ); + ++----------+ +| column_2 | ++----------+ +| 4 | ++----------+ +``` + +## WHERE clause subqueries + +`WHERE` clause subqueries compare an expression to the result of the subquery +and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `WHERE` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + expression1[, expression2, ..., expressionN] +FROM + <measurement> +WHERE + expression operator (<subquery>) +``` + +**Note:** `WHERE` clause subqueries can be used as an alternative to `JOIN` +operations. + +### Examples + +#### `WHERE` clause with scalar subquery + +The following query returns all rows with `column_2` values above the average +of all `number` values in `y`. + +```sql +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(number) + FROM + y + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` -## NOT IN +#### `WHERE` clause with non-scalar subquery -The `NOT IN` syntax can be used to find all rows in a relation where a given expression's value can not be found in the -results of a correlated subquery. +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a +single column. The values in the returned column are evaluated as a list. + +The following query returns all rows with `column_2` values in table `x` that +are in the list of numbers with string lengths greater than three from table +`y`. ```sql -select * from x where column_1 not in (select column_1 from x); -0 rows in set. +SELECT + * +FROM + x +WHERE + column_2 IN ( + SELECT + number + FROM + y + WHERE + length(string) > 3 + ); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ ``` -## Scalar Subquery +### `WHERE` clause with correlated subquery -A scalar subquery can be used to produce a single value that can be used in many different contexts in a query. Here -is an example of a filter using a scalar subquery. Only correlated subqueries are supported. +The following query returns rows with `column_2` values from table `x` greater +than the average `string` value length from table `y`. +The subquery in the `WHERE` clause uses the `column_1` value from the outer +query to return the average `string` value length for that specific value. ```sql -select * from x y where column_1 < (select sum(column_2) from x where x.column_1 = y.column_1); +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(length(string)) + FROM + y + WHERE + number = x.column_1 + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` + +## HAVING clause subqueries + +`HAVING` clause subqueries compare an expression that uses aggregate values +returned by aggregate functions in the `SELECT` clause to the result of the +subquery and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `HAVING` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + aggregate_expression1[, aggregate_expression2, ..., aggregate_expressionN] +FROM + <measurement> +WHERE + <conditional_expression> +GROUP BY + column_expression1[, column_expression2, ..., column_expressionN] +HAVING + expression operator (<subquery>) +``` + +### Examples + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are equal to the maximum value of `column_1` +in table `x`. + +#### `HAVING` clause with a scalar subquery + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg = ( + SELECT + MAX(column_1) + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +#### `HAVING` clause with a non-scalar subquery + +Non-scalar subqueries must use the `[NOT] IN` operator and can only return a +single column. The values in the returned column are evaluated as a list. + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are in `column_1` of table `x`. + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg IN ( + SELECT + column_1 + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +## Subquery categories + +Subqueries can be categorized as one or more of the following based on the +behavior of the subquery: + +- [correlated](#correlated-subqueries) or + [non-correlated](#non-correlated-subqueries) +- [scalar](#scalar-subqueries) or [non-scalar](#non-scalar-subqueries) + +### Correlated subqueries + +In a **correlated** subquery, the inner query depends on the outer query, using +values from the outer query for its results. +Correlated subqueries can return a maximum of one row, so +[aggregations](/user-guide/sql/aggregate_functions.html) may be required in the +inner query. + +**Note:** Because correlated subqueries depend on the outer query and typically +must execute for each row returned by the outer query, correlated subqueries are +**less performant** than non-correlated subqueries. Review Comment: Internally, DataFusion actually rewrites correlated subqueries to joins (it doesn't actually re-execte the subquery for each row in the outer query). Maybe this could say something like ```suggestion **Note:** DataFusion internally rewrites correlated subqueries into JOINs to improve performance. In general correlated subqueries are **less performant** than non-correlated subqueries. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
