[
https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781959#comment-17781959
]
Abhinav Kumar commented on SPARK-33164:
---------------------------------------
I see value in some use cases like [~arnaud.nauwynck] mentions. But there is
this "SELECT *" very well documented risks, leading to maintainability issues.
Should we still be trying to implement this?
> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to
> DataSet.dropColumn(someColumn)
> ------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-33164
> URL: https://issues.apache.org/jira/browse/SPARK-33164
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
> Reporter: Arnaud Nauwynck
> Priority: Minor
> Original Estimate: 120h
> Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely
> no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM
> .."
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery
> or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many
> developpers have the same problems.
> example posts:
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
> you add "count ( * ) countCol" column, and then filter on it using for
> example "having countCol = 1"
> ... and then you want to select all columns EXCEPT this dummy column which
> always is "1"
> {noformat}
> select * (EXCEPT countCol)
> from (
> select count(*) countCol, *
> from MyTable
> where ...
> group by ... having countCol = 1
> )
> {noformat}
>
> use-case 2:
> same with analytical function "partition over(...) rankCol ... where
> rankCol=1"
> For example to get the latest row before a given time, in a time series
> table.
> This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
> CREATE table t_updates (update_time timestamp, id string, col1 type1, col2
> type2, ... col42)
> pastTime=..
> SELECT * (except rankCol)
> FROM (
> SELECT *,
> RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol
> FROM t_updates
> where update_time < pastTime
> ) WHERE rankCol = 1
>
> {noformat}
>
> use-case 3:
> copy some data from table "t" to corresponding table "t_snapshot", and back
> to "t"
> {noformat}
> CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>
> /* create corresponding table: (snap_id string, col1 type1, col2 type2,
> col3 type3, ... col42 type42) */
> CREATE TABLE t_snapshot
> AS SELECT '' as snap_id, * FROM t WHERE 1=2
> /* insert data from t to some snapshot */
> INSERT INTO t_snapshot
> SELECT 'snap1' as snap_id, * from t
>
> /* select some data from snapshot table (without snap_id column) .. */
> SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1'
>
> {noformat}
>
>
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar.
> It does not change SQL execution plan or anything complex.
> *Q3.* How is it done today, and what are the limits of current practice?
>
> Today, you can either use the DataSet API, with .dropColumn(someColumn)
> or you need to HARD-CODE manually all columns in your SQL. Therefore your
> code is NOT generic (or you are using a SQL meta-code generator?)
> *Q4.* What is new in your approach and why do you think it will be successful?
> It is NOT new... it is already a proven solution from DataSet.dropColumn(),
> Postgresql, BigQuery
>
> *Q5.* Who cares? If you are successful, what difference will it make?
> It simplifies life of developpers, dba, data analysts, end users.
> It simplify development of SQL code, in a more generic way for many tasks.
> *Q6.* What are the risks?
> There is VERY limited risk on spark SQL, because it already exists in DataSet
> API.
> It is an extension of SQL syntax, so the risk is annoying some IDE SQL
> editors for a new SQL syntax.
> *Q7.* How long will it take?
> No idea. I guess someone experienced in the Spark SQL internals might do it
> relatively "quickly".
> It is a kind of syntaxic sugar to add in antlr grammar rule, then transform
> in DataSet api
> *Q8.* What are the mid-term and final “exams” to check for success?
> The 3 standard use-cases given in question Q1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]