jkylling commented on issue #20135:
URL: https://github.com/apache/datafusion/issues/20135#issuecomment-3843241290
If we go with metadata columns we need to answer questions like:
- Should virtual columns be part of `select * from table`?
- Should we expose a single column with a system struct, like `select
__metadata.row_number from table`, or flat columns, like `select __row_number,
__file_name from table`?
- Should virtual columns show up when you do `describe table` or `select *
from information_schema.columns`?
- What happens if there is a name conflicts between a metadata column and an
ordinary table column?
- What should the name of the metadata columns be?
Instead of answering yes or no to the questions, could we support both
alternatives, and leave it up to the `TableProvider` to make the decision?
Within Datafusion we might have an opinionated default answer for
`ListingTable`, but other engines would be free to include metadata column(s)
in `select * from table` (or make it configurable), or decide if they want flat
metadata columns or bundle them all in a struct.
Maybe we could achieve this with a simple extension type on the schema
fields? A `TableProvider` could return a schema like
```
Field::new("user_id", DataType::Int64, false),
Field::new("amount", DataType::Float64, false),
Field::new("file_location", DataType::Utf8,
false).with_extension_type(FileLocation),
Field::new("row_index", DataType::Int64,
false).with_extension_type(Hidden).with_extension_type(RowNumber),
Field::new("__metadata", large_metadata_struct,
false).with_extension_type(Hidden),
```
Here the columns with the `Hidden` extension type would be omitted from
`select * from table`, the `FileLocation` extension type would be picked up by
the `FileSource` to add the file location, the `RowNumber` extension type would
be picked up by the Arrow Parquet reader, and the `large_metadata_struct` could
be a large struct with metadata columns (file size, last modified time, row
group id, etc.) which are resolved within the `TableProvider::scan` method and
the sources it uses.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]