jkylling commented on issue #20135:
URL: https://github.com/apache/datafusion/issues/20135#issuecomment-3843241290

   If we go with metadata columns we need to answer questions like:
   - Should virtual columns be part of `select * from table`?
   - Should we expose a single column with a system struct, like `select 
__metadata.row_number from table`, or flat columns, like `select __row_number, 
__file_name from table`?
   - Should virtual columns show up when you do `describe table` or `select * 
from information_schema.columns`?
   - What happens if there is a name conflicts between a metadata column and an 
ordinary table column?
   - What should the name of the metadata columns be?
   
   Instead of answering yes or no to the questions, could we support both 
alternatives, and leave it up to the `TableProvider` to make the decision? 
Within Datafusion we might have an opinionated default answer for 
`ListingTable`, but other engines would be free to include metadata column(s) 
in `select * from table` (or make it configurable), or decide if they want flat 
metadata columns or bundle them all in a struct.
   
   Maybe we could achieve this with a simple extension type on the schema 
fields? A `TableProvider` could return a schema like
   ```
   Field::new("user_id", DataType::Int64, false),
   Field::new("amount", DataType::Float64, false),
   Field::new("file_location", DataType::Utf8, 
false).with_extension_type(FileLocation),
   Field::new("row_index", DataType::Int64, 
false).with_extension_type(Hidden).with_extension_type(RowNumber),
   Field::new("__metadata", large_metadata_struct, 
false).with_extension_type(Hidden),
   ```
   Here the columns with the `Hidden` extension type would be omitted from 
`select * from table`, the `FileLocation` extension type would be picked up by 
the `FileSource` to add the file location, the `RowNumber` extension type would 
be picked up by the Arrow Parquet reader, and the `large_metadata_struct` could 
be a large struct with metadata columns (file size, last modified time, row 
group id, etc.) which are resolved within the `TableProvider::scan` method and 
the sources it uses.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to