[GitHub] [spark] dtenedor commented on a diff in pull request #37256: [SPARK-39844][SQL] Restrict adding DEFAULT columns for existing tables to allowlist of supported data source types

GitBox Tue, 26 Jul 2022 09:11:20 -0700


dtenedor commented on code in PR #37256:
URL: https://github.com/apache/spark/pull/37256#discussion_r930162598



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2919,6 +2919,17 @@ object SQLConf {
       .stringConf
       .createWithDefault("csv,json,orc,parquet")
 
+  val ADD_DEFAULT_COLUMN_EXISTING_TABLE_BANNED_PROVIDERS =
+    buildConf("spark.sql.defaultColumn.addColumnExistingTableBannedProviders")
+      .internal()
+      .doc("List of table providers wherein SQL commands are NOT permitted to 
assign DEFAULT " +
+        "values to new columns in existing tables, such as when using the 
ALTER TABLE ... " +
+        "ADD COLUMNS command in SQL. Comma-separated list, whitespace ignored, 
case-insensitive.")

Review Comment:
   These are good points. It is not ideal to duplicate the table provider 
strings in two SQLConf entries.
   
   @gengliangwang and @amaliujia what do you think about simply reusing 
`DEFAULT_COLUMN_ALLOWED_PROVIDERS` SQLConf and adding some simple rule to mark 
a table provider as "does not support `ALTER TABLE ADD COLUMN` commands"? For 
example, we could allow an asterisk after the table provider name to indicate 
this. For example, we could have `"csv,json,orc,parquet,myv2datasource*"` as 
the SQLConf value for a new V2 data source.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2919,6 +2919,17 @@ object SQLConf {
       .stringConf
       .createWithDefault("csv,json,orc,parquet")
 
+  val ADD_DEFAULT_COLUMN_EXISTING_TABLE_BANNED_PROVIDERS =

Review Comment:
   Good Q: we want to differentiate between which table providers can allow 
column DEFAULT values with all commands, vs. which ones support all commands 
*except* `ALTER TABLE ADD COLUMN`. The reason is that the latter command 
requires the corresponding data source to implement specific support for 
filling in the existence default values during future scans, while all other 
commands simply use the `ResolveDefaultColumns` analysis rule in a generic way.
   
   We've implemented this data source scan support for CSV, JSON, Orc, and 
Parquet file formats, but not for any V2 data sources yet. If someone wants to 
extend a V2 data source to support column DEFAULT values, they'll have to block 
`ALTER TABLE ADD COLUMN` commands at first. This PR provides a way to do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dtenedor commented on a diff in pull request #37256: [SPARK-39844][SQL] Restrict adding DEFAULT columns for existing tables to allowlist of supported data source types

Reply via email to