mbutrovich commented on PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4390069611

   > Thanks @mbutrovich I'll do a coder review later today/tomorrow , but 
another high level question is on the schema column name collision between user 
columns and virtual columns? how it can affect already existing user queries?
   
   Current DataFusion behavior would delegate that to figure out in #20135. 
Using partition columns as an example: DataFusion does not validate 
partition-vs-file-column collisions in the core code path. 
ListingTable::try_new (catalog-listing/src/table.rs:216) just pushes partition 
fields onto the file schema via SchemaBuilder::push, which arrow documents as 
"without checking for collision” (arrow-schema/src/schema.rs:48), and finish 
doesn't dedupe. Same in our TableSchema::build_table_schema.
   
   The only place collision is handled is the SQL-layer ListingTableFactory 
(listing_table_factory.rs:133), which for `CREATE EXTERNAL TABLE ... 
PARTITIONED BY (c1)` explicitly excludes c1 from the file schema before 
constructing options. It solves the problem by construction for the SQL entry 
point but doesn't guard the data structure itself. A caller constructing 
ListingTable/TableSchema programmatically with colliding names gets silent 
duplicates.
   
   So existing convention is: upstream avoids bad names, core structs don't 
validate.
   
   That said, I am not opposed to enforcing the contract if we want to make it 
explicit now that `with_virtual_columns` validates. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to