[jira] [Commented] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

Abhinav Kumar (Jira) Wed, 01 Nov 2023 18:37:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781959#comment-17781959
 ]


Abhinav Kumar commented on SPARK-33164:
---------------------------------------

I see value in some use cases like [~arnaud.nauwynck] mentions. But there is 
this "SELECT *" very well documented risks, leading to maintainability issues. 
Should we still be trying to implement this?

> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to 
> DataSet.dropColumn(someColumn)
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33164
>                 URL: https://issues.apache.org/jira/browse/SPARK-33164
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>            Reporter: Arnaud Nauwynck
>            Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM 
> .." 
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery 
> or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many 
> developpers have the same problems.
> example posts: 
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
>  you add "count ( * )  countCol" column, and then filter on it using for 
> example "having countCol = 1" 
>   ... and then you want to select all columns EXCEPT this dummy column which 
> always is "1"
> {noformat}
>   select * (EXCEPT countCol)
>   from (  
>      select count(*) countCol, * 
>        from MyTable 
>        where ... 
>        group by ... having countCol = 1
>   )    
> {noformat}
>        
> use-case 2:
>  same with analytical function "partition over(...) rankCol  ... where 
> rankCol=1"
>  For example to get the latest row before a given time, in a time series 
> table.
>  This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
>  CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 
> type2, ... col42)
>  pastTime=..
>  SELECT * (except rankCol)
>  FROM (
>    SELECT *,
>       RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
>    FROM t_updates
>    where update_time < pastTime
>  ) WHERE rankCol = 1
>  
> {noformat}
>  
> use-case 3:
>  copy some data from table "t" to corresponding table "t_snapshot", and back 
> to "t"
> {noformat}
>    CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>    
>    /* create corresponding table: (snap_id string, col1 type1, col2 type2, 
> col3 type3, ... col42 type42) */
>    CREATE TABLE t_snapshot
>    AS SELECT '' as snap_id, * FROM t WHERE 1=2
>    /* insert data from t to some snapshot */
>    INSERT INTO t_snapshot
>    SELECT 'snap1' as snap_id, * from t 
>    
>    /* select some data from snapshot table (without snap_id column) .. */   
>    SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
>    
> {noformat}
>    
>    
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar. 
> It does not change SQL execution plan or anything complex.
> *Q3.* How is it done today, and what are the limits of current practice?
>  
> Today, you can either use the DataSet API, with .dropColumn(someColumn)
> or you need to HARD-CODE manually all columns in your SQL. Therefore your 
> code is NOT generic (or you are using a SQL meta-code generator?)
> *Q4.* What is new in your approach and why do you think it will be successful?
> It is NOT new... it is already a proven solution from DataSet.dropColumn(), 
> Postgresql, BigQuery
>  
> *Q5.* Who cares? If you are successful, what difference will it make?
> It simplifies life of developpers, dba, data analysts, end users.
> It simplify development of SQL code, in a more generic way for many tasks.
> *Q6.* What are the risks?
> There is VERY limited risk on spark SQL, because it already exists in DataSet 
> API.
> It is an extension of SQL syntax, so the risk is annoying some IDE SQL 
> editors for a new SQL syntax. 
> *Q7.* How long will it take?
> No idea. I guess someone experienced in the Spark SQL internals might do it 
> relatively "quickly".
> It is a kind of syntaxic sugar to add in antlr grammar rule, then transform 
> in DataSet api
> *Q8.* What are the mid-term and final “exams” to check for success?
> The 3 standard use-cases given in question Q1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

Reply via email to