jorgecarleitao commented on pull request #8839:
URL: https://github.com/apache/arrow/pull/8839#issuecomment-739532745


   Thanks a lot for looking at this. All excellent points. I now see that this 
is tricky :)
   
   Thinking about what you wrote, if we plan the Logical as `t1.a, t2.a`, 
wouldn't the column names become `a, a` on the `RecordBatch`? i.e. there will 
be a discrepancy between the schema provided by `df.schema()` and the 
`RecordBatches::schema()` returned by `collect()`, no?
   
   I think that this will happen even if we pass `DFSchema` to the physical 
plan (1.) or use indexes (3.), as any map `qualified name -> unqualified` is 
lossy (the qualifier), and thus never recoverable at the `RecordBatch`'s schema.
   
   This IMO leaves us with 2., which is what I would try: change the physical 
planner to alias/rewrite column names with the qualifier when the physical plan 
is created. This will cause the resulting `RecordBatch`'s schema to have 
columns named `t1.a` and `t2.a`, thereby guaranteeing the invariant that the 
output schema of the physical execution matches the schema of the logical plan.
   
   I.e. The invariant that `SELECT t1.a, t2.a, c ...` yields a schema whose 
columns are named `["t1.a", "t2.a", "c"]` is preserved. 
   
   Note that we already do this when performing coercion: we preserve the 
logical schema name by injecting cast ops during physical (and not logical) 
planning, so that if the user wrote `SELECT sqrt(f32) ...`, the resulting name 
on the `RecordBatch::schema()` is `sqrt(f32)`, even if the physical operation 
performed was `sqrt(CAST(f32 as Float64))`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to