djanderson opened a new issue, #18059: URL: https://github.com/apache/datafusion/issues/18059
### Is your feature request related to a problem or challenge? In various parts of the codebase, it's clear that a `projection: Option<&Vec<usize>>` value of `None` is intended to indicate essentially `select *` or no explicit subset of the table's columns. One important place this shows up is in the projection parameter of the [TableProvider::scan()](https://docs.rs/datafusion/latest/datafusion/datasource/trait.TableProvider.html#tymethod.scan) function. However, the high-level interfaces for calling this seem to _never pass `None` for that parameter_. ## Concrete use-case Imagine I have a custom table provider, `VersionedTableProvider`, which abstracts over a version-enabled object store bucket to allow the user to write queres like `select * from t where version = 'v1';`. A user then pushes a new table `t` with the following data. This is stored in a custom index as `v1`. ``` +----+----+----+ | id | a | b | +----+----+----+ | 0 | 10 | 30 | | 1 | 20 | 40 | +----+----+----+ ``` Then, they push another version, stored as `v2`: ``` +----+----+----+ | id | a | c | +----+----+----+ | 0 | 10 | 50 | | 1 | 20 | 60 | +----+----+----+ ``` Calling `::schema()` on this VersionedTableProvider gives me a combined schema like `["id", "a", "b", "c", "version"]`. If the user explicitly requests a column not in any version of the table, DataFusion will throw an error while building the logical plan: `SELECT f FROM t;` -> `FieldNotFound: no field f in t`. However what about the following 2 cases (no explicit `version` filter implies _latest_)? 1. `SELECT * FROM t;` 2. `SELECT a, b, c FROM t;` I have been trying to handle this in `TableProvider::scan()`, but they need to be handled differently. Let's say the beginning of my custom `scan()` impl looks like this: ```rust let versioned_table = match parse_version_from_filters(filters) { Some(version) => self.versioned_table(&version)?, None => self.latest_versioned_table(), }; // This may differ from self.schema(), it's the schema of the actual versioned parquet file. let file_schema = versioned_table.schema(); let projected_schema = project_schema(&file_schema, &file_projection)?; ... ``` Where [project_schema](https://docs.rs/datafusion-common/latest/src/datafusion_common/utils/mod.rs.html#74-83) handle `None` projection correctly. ## What I'm expecting should happen Case 1 should be a valid query. Since I have the `projection` and `filter` expr list, I can confirm the version the user wants or default to latest. Then, should be able to simply do project `None` onto the file_schema, and the user gets what they expect. Case 2 should be an error. Since the user has explicitly requested a column `b` that is not in the requested version of the table `v2`, I have enough information be able to return a very specific diagnostic: `field b not found in t version v2`. This should be trivial because `SELECT *` should seemingly, as documented in various places, flow down a `None` projection to scan(). ## What's actually happening When using the dataframe API or `ctx.sql`, I seem to always get a `Some` projection. Earlier in the call chain, for example looking at the LogicalPlan right after `sql_to_statement` on `SELECT * FROM t;`, I see Projection { expr: vec![... every col in the table schema], input: TableScan { ..., projection: **None**, ... }, schema: ..., } Interesting, the input TableScan logical plan correctly stores the `None` value of projection. But for some reason that's not what gets passed to `scan()`. ### Describe the solution you'd like `SELECT *` should map to a projection of `None` in the `TableProvider::scan()` so that, when there's a mismatch between a full table schema and a specific file schema, I can determine if the user explicitly requested an invalid column in an explicit projection list, or not. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
