[GitHub] [spark] ryan-johnson-databricks opened a new pull request, #40300: [SPARK-42683] Automatically rename conflicting metadata columns

via GitHub Mon, 06 Mar 2023 08:06:43 -0800


ryan-johnson-databricks opened a new pull request, #40300:
URL: https://github.com/apache/spark/pull/40300


   ### What changes were proposed in this pull request?
   
   Today, if a data source already has an output column called `_metadata`, 
queries cannot access the file-source metadata column that normally carries 
that name. We can address this conflict with two changes to metadata column 
handling:
   
   1. Automatically rename any metadata column whose name conflicts with an 
output column.
   2. Add a way to reliably find metadata columns, even if they were renamed.
   
   In this PR, the name is made unique by prepending underscores to the 
original name until it no longer conflicts. This improves debuggability of the 
resulting query plan, because a human can still determine quickly what column 
it might be. It also gives a potential user surface for accessing the column 
manually, by adjusting column name in the query to add a predictable number of 
underscores.
   
   In addition, we define new dataframe methods `metadataColumn` and 
`withMetadataColumn`, which mirror the existing methods `col` and `withColumn`, 
but which only work for metadata columns.
   
   ### Why are the changes needed?
   
   Today, it's too easy to lose access to metadata columns if the user's table 
happened to have the wrong column name. This sharp edge limits the utility of 
metadata columns in general, because the feature doesn't work reliably for all 
table schemas.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Suppose we have the following table definition:
   ```sql
   CREATE TABLE has_metadata_conflict(x INTEGER, y INTEGER, _metadata VARCHAR)
   ```
   Then this query would return a string and the file-source metadata column is 
completely inaccessible:
   ```sql
   SELECT _metadata FROM has_metadata_conflict
   ```
   The metadata column is also not available through the dataframe API, and the 
example below would return the table's string column:
   ```scala
   df.withColumn("_metadata")
   ```
   
   With the change, the original query still returns a string, but the 
file-source metadata column can still be found and accessed by invoking 
`DataSet.withMetadataColumn` or `DataSet.metadataColumn`:
   ```scala
   df.withMetadataColumn("_metadata")
   ```
   
   The renamed metadata column can also be selected manually (as  `__metadata` 
in this case), if the user prefers to rewrite the query:
   ```sql
   SELECT __metadata FROM has_metadata_conflict
   ```
   
   ### How was this patch tested?
   
   New unit tests added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ryan-johnson-databricks opened a new pull request, #40300: [SPARK-42683] Automatically rename conflicting metadata columns

Reply via email to