mbutrovich commented on PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4389588406

   > I hate to bring this up but do you think think this approach is compatible 
with all of the proposed UX versions of this? I don't know that the UDF version 
is blessed necessarily, I want to make sure we don't back ourselves into a 
corner. I think the answer is yes but just want to confirm.
   
   Nah I think it's a good thing to sort out now from an API stability 
standpoint. It's a necessary discussion even if we're trying to carve out a 
small portion of the work to be a more bite-size PR now. I think the right lens 
is API stability: what does this PR add to the public surface that we'd have to 
live with (or break) if a later UX decision wants something different?
   
   The answer is a pretty small surface:
   
   * `TableSchema::with_virtual_columns(Vec<FieldRef>)` and `virtual_columns()` 
getter
   * `TableSchema::schema_without_virtual_columns()`
   * `ParquetVirtualColumn` enum in `datasource-parquet`
   
   The shape that's actually load-bearing across all of these is one sentence: 
**a virtual column is a leaf `FieldRef` carrying an Arrow extension type, held 
on `TableSchema` in its own bucket**.
   
   Everything else in the PR is either internal to the opener (schema 
augmentation during rewrite, projection-mask stripping, the pushdown defensive 
check) or a convention we can change without breaking callers (the `[file, 
partition, virtual]` ordering in `table_schema()`, which is just a projection 
away from any other layout).
   
   Mapping that against the UX directions in #20135:
   
   * UDF rewrite (`input_file_name()` shape, now applied to a reader-produced 
value): planner rewrites the UDF to a `Column` ref, TableProvider puts a 
`RowNumber`-tagged field in `virtual_columns`. Same seam as #20071.
   * `_metadata` struct column (Spark / Databricks): TableProvider exposes the 
struct at the SQL surface, projection pushdown reduces `_metadata.row_index` to 
a flat leaf, opener produces the leaf, a projection above the scan rebuilds the 
struct.
   * Relation-scoped hidden columns (`a.row_number`): TableProvider tags a 
hidden column with `RowNumber` and adds it to `virtual_columns` when projected. 
Hiding policy sits above.
   * @jkylling's extension-type tagging (`Hidden` + `RowNumber`): most direct 
fit, extension types are the mechanism already.
   
   All four land on the same seam. None of them would force us to break the 
`TableSchema` methods or the enum.
   
   The one case where we'd feel locked in is if a later design wants to model a 
"virtual column" that isn't backed by an arrow-rs extension type (a purely 
DataFusion-synthesized value, say). That would want a different seam. I'd argue 
that's a different concept anyway and shouldn't share the `virtual_columns` 
bucket, but worth noting.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to