[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2024-06-11 Thread Jonathan Boarman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854182#comment-17854182
 ] 

Jonathan Boarman commented on SPARK-24497:
--

There's a lot of folks wondering when this will get merged.  I see from Peter 
the issue relates to getting reviewers?  How do we get reviewers to review that 
PR?

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31561) Add QUALIFY Clause

2024-06-11 Thread Jonathan Boarman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854181#comment-17854181
 ] 

Jonathan Boarman commented on SPARK-31561:
--

How do we revive this PR and add reviewers to get this promoted?

> Add QUALIFY Clause
> --
>
> Key: SPARK-31561
> URL: https://issues.apache.org/jira/browse/SPARK-31561
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> In a SELECT statement, the QUALIFY clause filters the results of window 
> functions.
> QUALIFY does with window functions what HAVING does with aggregate functions 
> and GROUP BY clauses.
> In the execution order of a query, QUALIFY is therefore evaluated after 
> window functions are computed.
> Examples:
> https://docs.snowflake.com/en/sql-reference/constructs/qualify.html#examples
> More details:
> https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
> https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/19NnI91neorAi7LX6SJXBw



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39247) Support returning a table or set of rows in CREATE FUNCTION

2024-06-08 Thread Jonathan Boarman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853410#comment-17853410
 ] 

Jonathan Boarman commented on SPARK-39247:
--

Being able to create a *UDTF* (user-defined table function), also known as a 
tabular UDF, table UDF, UDF table function, or a TVF (user-defined table-value 
function), are incredibly useful and very powerful, especially in combination 
with the LATERAL invocations.

Other platforms, such as SQL Server has offered this feature for more than 20 
years back, which means this has been around long enough to develop deep and 
frequent use cases in more mature data teams.

 
{panel:title=Documentation from a few major data platforms covering their 
implementation usage: }
*_Databricks -- Create a SQL table function_*
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html#create-a-sql-table-function

*_Snowflake -- Creating Tabular SQL UDFs (UDTFs)_*
https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-tabular-functions

*_BigQuery -- Creating table-valued functions (TVF)_*
https://cloud.google.com/bigquery/docs/table-functions

*_Postgres -- SQL Functions Returning TABLE_*
https://www.postgresql.org/docs/current/xfunc-sql.html#XFUNC-SQL-FUNCTIONS-RETURNING-TABLE

*_SAP HANA -- Table user-defined functions (TUDF)_*
https://help.sap.com/docs/SAP_HANA_PLATFORM/de2486ee947e43e684d39702027f8a94/2fc6d7beebd14c579457092e91519082.html?q=create%20function

*_IBM DB2 -- CREATE FUNCTION (user-defined external table)_*
https://www.ibm.com/docs/en/db2/11.5?topic=statements-create-function-external-table

*_Oracle -- Pipelined Table Functions_*
https://docs.oracle.com/en/database/oracle/oracle-database/19/lnpls/PIPELINED-clause.html

*_MSSQL -- Inline and Multi-statement table-valued functions (TVF/MSTVF)_*
https://learn.microsoft.com/en-us/sql/relational-databases/user-defined-functions/create-user-defined-functions-database-engine?view=sql-server-ver16#inline-table-valued-function-tvf
{panel}
 

> Support returning a table or set of rows in CREATE FUNCTION
> ---
>
> Key: SPARK-39247
> URL: https://issues.apache.org/jira/browse/SPARK-39247
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Erik Jansen
>Priority: Major
>
> The Databricks version supports create a function which returns a result set:
> _"Creates a SQL scalar or table function that takes a set of arguments and 
> returns a scalar value or a set of rows."_
> Such functions are very useful as the user can pass a list of parameters 
> which can be used in the function to create a dynamic result set (a query 
> which will use the passed parameters). 
> See the example from databricks:
> {_}"{_}{_}-- Produce all weekdays between two dates > CREATE FUNCTION 
> weekdays(start DATE, end DATE) RETURNS TABLE(day_of_week STRING, day DATE) 
> RETURN SELECT extract(DAYOFWEEK_ISO FROM day), day FROM (SELECT 
> sequence(weekdays.start, weekdays.end)) AS T(days) LATERAL VIEW explode(days) 
> AS day WHERE extract(DAYOFWEEK_ISO FROM day) BETWEEN 1 AND 5;"{_}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

2024-05-23 Thread Jonathan Boarman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849126#comment-17849126
 ] 

Jonathan Boarman commented on SPARK-33164:
--

There are significant benefits provided by the `{*}{{EXCEPT}}{*}` feature 
provided by most large data platforms, including Databricks, Snowflake, 
BigQuery, DuckDB, etc.  The list of vendors that support *{{EXCEPT}}* (or 
increasingly called `{*}{{EXCLUDE}}{*}` to avoid conflicts) is pretty long and 
growing.  As such, migrating projects from those platforms to a pure Spark SQL 
environment is extremely costly.

Further, the "risks" associated with `{*}{{SELECT *}}{*}` do not apply to all 
scenarios – very importantly, with CTEs these risks are not applicable since 
the constraints on column selection are generally made in the first CTE.

For example, any subsequent CTEs in a chain of CTEs inherits the field 
selection of the first CTEs.  On platforms that lack this feature, we have a 
different risk caused be crazy levels of duplication if we are forced to 
enumerate fields in each and every CTE.  This is particularly problematic when 
joining two CTEs that share a field, such as an `{*}{{ID}}{*}` column.  In that 
situation, the most efficient and risk-free approach is to `{*}{{SELECT * 
EXCEPT(right.id)}}{*}` from the join of its two dependent CTEs.

Any perceived judgment aside, this is a highly-relied-upon feature in 
enterprise environments that depend on these quality-of-life innovations.  
Clearly such improvements are providing value in those environments, and Spark 
SQL should not be any different in supporting its users that have come to rely 
on such innovations.

> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to 
> DataSet.dropColumn(someColumn)
> 
>
> Key: SPARK-33164
> URL: https://issues.apache.org/jira/browse/SPARK-33164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Arnaud Nauwynck
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM 
> .." 
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery 
> or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many 
> developpers have the same problems.
> example posts: 
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
>  you add "count ( * )  countCol" column, and then filter on it using for 
> example "having countCol = 1" 
>   ... and then you want to select all columns EXCEPT this dummy column which 
> always is "1"
> {noformat}
>   select * (EXCEPT countCol)
>   from (  
>  select count(*) countCol, * 
>from MyTable 
>where ... 
>group by ... having countCol = 1
>   )
> {noformat}
>
> use-case 2:
>  same with analytical function "partition over(...) rankCol  ... where 
> rankCol=1"
>  For example to get the latest row before a given time, in a time series 
> table.
>  This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
>  CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 
> type2, ... col42)
>  pastTime=..
>  SELECT * (except rankCol)
>  FROM (
>SELECT *,
>   RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
>FROM t_updates
>where update_time < pastTime
>  ) WHERE rankCol = 1
>  
> {noformat}
>  
> use-case 3:
>  copy some data from table "t" to corresponding table "t_snapshot", and back 
> to "t"
> {noformat}
>CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>
>/* create corresponding table: (snap_id string, col1 type1, col2 type2, 
> col3 type3, ... col42 type42) */
>CREATE TABLE t_snapshot
>AS SELECT '' as snap_id, * FROM t WHERE 1=2
>/* insert data from t to some snapshot */
>INSERT INTO t_snapshot
>SELECT 'snap1' as snap_id, * from t 
>
>/* select some data from snapshot table (without snap_id column) .. */   
>SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
>
> {noformat}
>
>
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar. 
> It does not change SQL execution plan or an