[jira] [Comment Edited] (CALCITE-6939) Add support for Lateral Column Alias

Mihai Budiu (Jira) Mon, 07 Apr 2025 14:58:43 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941722#comment-17941722
 ]


Mihai Budiu edited comment on CALCITE-6939 at 4/7/25 9:41 PM:
--------------------------------------------------------------

I don't know why finding documentation these days on the internet is so 
difficult...
Here are some links I could find:
 * A short blog article about this topic: 
[https://medium.com/inthepipeline/analytical-sql-tips-series-lateral-column-alias-9d28eecdc8df]
 which claims support in DuckDB, Snowflake, Databricks, and Redshift.
 * Databricks documentation: 
[https://www.databricks.com/blog/introducing-support-lateral-column-alias]
 * DuckDB mentions this feature 
[https://duckdb.org/docs/stable/sql/dialect/friendly_sql.html], but does not 
seem to document how name lookup is actually performed in their pdf 
documentation [https://duckdb.org/duckdb-docs.pdf]
 * Snowflake: a blog post about it: 
[https://sqlkover.com/cool-stuff-in-snowflake-part-4-aliasing-all-the-things/,] 
but no real documentation about the semantics
 * Redshift: 
[https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/],
 [https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html]

RedShift and Databricks have a similar lookup rule; here is a quote from the 
RedShift documentation:
{noformat}
The alias is recognized right after it is defined in the target list. 
You can use an alias in other expressions defined after it in the same target 
list. 
The following example illustrates this.

select clicks / impressions as probability, round(100 * probability, 1) as 
percentage from raw_data;

The benefit of the lateral alias reference is you don't need to repeat the 
aliased
 expression when building more complex expressions in the same target list. 
When Amazon Redshift parses this type of reference, it just inlines the 
previously
 defined aliases. If there is a column with the same name defined in the FROM 
clause
 as the previously aliased expression, the column in the FROM clause takes 
priority.
 For example, in the above query if there is a column named 'probability' in 
table 
raw_data, the 'probability' in the second expression in the target list refers 
to 
that column instead of the alias name 'probability'. 
{noformat}
Clearly, this is a breaking change, because it makes some otherwise illegal 
queries become legal. It is also technically just syntactic sugar, since it is 
not absolutely necessary. But if you have a user who needs to port a 10KLOC SQL 
program from Spark SQL to a Calcite-based tool, they will appreciate this 
feature, even if the Spark SQL design is not the best one. That's why it this 
feature should be behind a conformance flag.

I am thinking to make the conformance flag an enum with 3 values: Unsupported 
(standard behavior), LeftToRight (my previous proposal), AnyOrder (similar to 
measures). The two databases that document this choose to lookup in the same 
SELECT last, and that sounds to me like the right choice.

We could also add new syntax to clarify scoping, either on definition, like 
MEASURE, or on lookup, but the downside is that these don't look like anything 
out there, so the porting effort may be diminished, but not eliminated, using 
some keyword, e.g., NEW:
 * SELECT NEW x+1 to indicate that the lookup of x should be done in the local 
select statement,
 * SELECT expr AS NEW x to indicate that x should be visible in the same 
statement


was (Author: JIRAUSER295926):
I don't know why finding documentation these days on the internet is so 
difficult...
Here are some links I could find:
 * A short blog article about this topic: 
[https://medium.com/inthepipeline/analytical-sql-tips-series-lateral-column-alias-9d28eecdc8df]
 which claims support in DuckDB, Snowflake, Databricks, and Redshift.
 * Databricks documentation: 
[https://www.databricks.com/blog/introducing-support-lateral-column-alias]
 * DuckDB mentions this feature 
[https://duckdb.org/docs/stable/sql/dialect/friendly_sql.html], but does not 
seem to document how name lookup is actually performed in their pdf 
documentation [https://duckdb.org/duckdb-docs.pdf]
 * Snowflake: a blog post about it: 
[https://sqlkover.com/cool-stuff-in-snowflake-part-4-aliasing-all-the-things/,] 
but no real documentation about the semantics
 * Redshift: 
[https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/],
 [https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html]

RedShift and Databricks have a similar lookup rule; here is a quote from the 
RedShift documentation:
{noformat}
The alias is recognized right after it is defined in the target list. You can 
use an alias in other expressions defined after it in the same target list. The 
following example illustrates this.

select clicks / impressions as probability, round(100 * probability, 1) as 
percentage from raw_data;

The benefit of the lateral alias reference is you don't need to repeat the 
aliased expression when building more complex expressions in the same target 
list. When Amazon Redshift parses this type of reference, it just inlines the 
previously defined aliases. If there is a column with the same name defined in 
the FROM clause as the previously aliased expression, the column in the FROM 
clause takes priority. For example, in the above query if there is a column 
named 'probability' in table raw_data, the 'probability' in the second 
expression in the target list refers to that column instead of the alias name 
'probability'. 
{noformat}
Clearly, this is a breaking change, because it makes some otherwise illegal 
queries become legal. It is also technically just syntactic sugar, since it is 
not absolutely necessary. But if you have a user who needs to port a 10KLOC SQL 
program from Spark SQL to a Calcite-based tool, they will appreciate this 
feature, even if the Spark SQL design is not the best one. That's why it this 
feature should be behind a conformance flag.

I am thinking to make the conformance flag an enum with 3 values: Unsupported 
(standard behavior), LeftToRight (my previous proposal), AnyOrder (similar to 
measures). The two databases that document this choose to lookup in the same 
SELECT last, and that sounds to me like the right choice.

We could also add new syntax to clarify scoping, either on definition, like 
MEASURE, or on lookup, but the downside is that these don't look like anything 
out there, so the porting effort may be diminished, but not eliminated, using 
some keyword, e.g., NEW:
 * SELECT NEW x+1 to indicate that the lookup of x should be done in the local 
select statement,
 * SELECT expr AS NEW x to indicate that x should be visible in the same 
statement

> Add support for Lateral Column Alias
> ------------------------------------
>
>                 Key: CALCITE-6939
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6939
>             Project: Calcite
>          Issue Type: Wish
>          Components: core
>    Affects Versions: 1.39.0
>            Reporter: Mihai Budiu
>            Priority: Minor
>              Labels: pull-request-available
>
> Several SQL dialects allow a SELECT to refer to aliases introduced previously 
> in the same select.
> Examples are: Spark, BigQuery, Snowflake. Other names for this feature are 
> "column aliasing with immediate reference", or "column alias visibility in 
> the same SELECT statement".
> This feature was already proposed and implemented by [CALCITE-5248], but that 
> issue and PR contains multiple other unrelated features. 
> I am also proposing adding a conformance flag `isSelectAlias` which enables 
> this feature. This is strictly a validator feature.
> The semantics has to be clearly defined; we do not want to change the 
> semantics of any existing query. My proposal is to allow a name to be 
> resolved in the current SELECT statement only if all the other name lookup 
> methods have failed. 
> I also think it's a good idea to introduce new aliases from left to right, so 
> a expression can only refer to aliases introduced to the left of it. This 
> removes a few unpleasant corner cases too.
> Another difficult problem is the handling of nondeterministic expressions. 
> When you refer to a prior expression by an alias, can/is the expression 
> re-evaluated? Unfortunately I believe that the current Calcite IR makes it 
> difficult to represent the program in such a way that re-evaluation is 
> avoided. (But this is a problem with many other Calcite rewrite rules as 
> well.) This feature is still useful when applied to deterministic expressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CALCITE-6939) Add support for Lateral Column Alias

Reply via email to