ryan-johnson-databricks opened a new pull request, #40300: URL: https://github.com/apache/spark/pull/40300
### What changes were proposed in this pull request? Today, if a data source already has an output column called `_metadata`, queries cannot access the file-source metadata column that normally carries that name. We can address this conflict with two changes to metadata column handling: 1. Automatically rename any metadata column whose name conflicts with an output column. 2. Add a way to reliably find metadata columns, even if they were renamed. In this PR, the name is made unique by prepending underscores to the original name until it no longer conflicts. This improves debuggability of the resulting query plan, because a human can still determine quickly what column it might be. It also gives a potential user surface for accessing the column manually, by adjusting column name in the query to add a predictable number of underscores. In addition, we define new dataframe methods `metadataColumn` and `withMetadataColumn`, which mirror the existing methods `col` and `withColumn`, but which only work for metadata columns. ### Why are the changes needed? Today, it's too easy to lose access to metadata columns if the user's table happened to have the wrong column name. This sharp edge limits the utility of metadata columns in general, because the feature doesn't work reliably for all table schemas. ### Does this PR introduce _any_ user-facing change? Suppose we have the following table definition: ```sql CREATE TABLE has_metadata_conflict(x INTEGER, y INTEGER, _metadata VARCHAR) ``` Then this query would return a string and the file-source metadata column is completely inaccessible: ```sql SELECT _metadata FROM has_metadata_conflict ``` The metadata column is also not available through the dataframe API, and the example below would return the table's string column: ```scala df.withColumn("_metadata") ``` With the change, the original query still returns a string, but the file-source metadata column can still be found and accessed by invoking `DataSet.withMetadataColumn` or `DataSet.metadataColumn`: ```scala df.withMetadataColumn("_metadata") ``` The renamed metadata column can also be selected manually (as `__metadata` in this case), if the user prefers to rewrite the query: ```sql SELECT __metadata FROM has_metadata_conflict ``` ### How was this patch tested? New unit tests added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org