Yaohua628 opened a new pull request #34575:
URL: https://github.com/apache/spark/pull/34575


   ### What changes were proposed in this pull request?
   This PR proposes a new interface in Spark SQL that allows users to query the 
metadata of the input files for all file formats. Spark SQL will expose them as 
**built-in hidden columns** meaning **users can only see them when they 
explicitly reference them**. Currently, This PR proposes to support the 
following metadata columns inside of a metadata struct `_metadata`:
   
   | Name  | Type | Description | Example |
   | ------------- | ------------- | ------------- | ------------- |
   | _metadata.file_path  | String  | The absolute file path of the input file. 
| file:/tmp/spark-7f600b30-b3ec-43a8-8cd2-686491654f9b/f0.csv |
   | _metadata.file_name  | String  | The name of the input file along with the 
extension. | f0.csv |
   | _metadata.file_size  | Long  | The length of the input file, in bytes. | 
628 |
   | _metadata.file_modification_time  | Long  | The modification time of the 
file, in milliseconds. | 1632701945157 |
   
   This proposed hidden file metadata interface has the following behaviors:
   - **Hidden**: metadata columns are hidden. They will not show up when only 
selecting data columns or selecting all `(SELECT *)`. In other words, they are 
not returned unless being explicitly referenced.
   - **Not overwrite the data schema**: in the case of name collisions with 
data columns, data columns will be returned instead of the metadata columns. In 
other words, metadata columns can not overwrite user data in any case.
   
   ### Why are the changes needed?
   To improve the Spark SQL observability for all file formats.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. 
   ```
   spark.read.format("csv")
        .schema(schema)
        .load("file:/tmp/*")
        .select("name", "age",
                "_metadata.file_path", "_metadata.file_name",
                "_metadata.file_size", "_metadata.file_modification_time")
   ```
   Example return:
   | name  | age | file_path | file_name | file_size | file_modification_time |
   | ------------- | ------------- | ------------- | ------------- | 
------------- | ------------- |
   | Debbie | 18 | file:/tmp/f0.csv | f0.csv | 12 | 710112965421 |
   | Frank | 24 | file:/tmp/f1.csv | f1.csv | 11 | 787959365553 |
   
   ### How was this patch tested?
   Add new testsuite: FileMetadataColumnsSuite
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to