Yaohua628 opened a new pull request #34575:
URL: https://github.com/apache/spark/pull/34575
### What changes were proposed in this pull request?
This PR proposes a new interface in Spark SQL that allows users to query the
metadata of the input files for all file formats. Spark SQL will expose them as
**built-in hidden columns** meaning **users can only see them when they
explicitly reference them**. Currently, This PR proposes to support the
following metadata columns inside of a metadata struct `_metadata`:
| Name | Type | Description | Example |
| ------------- | ------------- | ------------- | ------------- |
| _metadata.file_path | String | The absolute file path of the input file.
| file:/tmp/spark-7f600b30-b3ec-43a8-8cd2-686491654f9b/f0.csv |
| _metadata.file_name | String | The name of the input file along with the
extension. | f0.csv |
| _metadata.file_size | Long | The length of the input file, in bytes. |
628 |
| _metadata.file_modification_time | Long | The modification time of the
file, in milliseconds. | 1632701945157 |
This proposed hidden file metadata interface has the following behaviors:
- **Hidden**: metadata columns are hidden. They will not show up when only
selecting data columns or selecting all `(SELECT *)`. In other words, they are
not returned unless being explicitly referenced.
- **Not overwrite the data schema**: in the case of name collisions with
data columns, data columns will be returned instead of the metadata columns. In
other words, metadata columns can not overwrite user data in any case.
### Why are the changes needed?
To improve the Spark SQL observability for all file formats.
### Does this PR introduce _any_ user-facing change?
Yes.
```
spark.read.format("csv")
.schema(schema)
.load("file:/tmp/*")
.select("name", "age",
"_metadata.file_path", "_metadata.file_name",
"_metadata.file_size", "_metadata.file_modification_time")
```
Example return:
| name | age | file_path | file_name | file_size | file_modification_time |
| ------------- | ------------- | ------------- | ------------- |
------------- | ------------- |
| Debbie | 18 | file:/tmp/f0.csv | f0.csv | 12 | 710112965421 |
| Frank | 24 | file:/tmp/f1.csv | f1.csv | 11 | 787959365553 |
### How was this patch tested?
Add new testsuite: FileMetadataColumnsSuite
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]