jorgecarleitao commented on pull request #8839: URL: https://github.com/apache/arrow/pull/8839#issuecomment-739532745
Thanks a lot for looking at this. All excellent points. I now see that this is tricky :) Thinking about what you wrote, if we plan the Logical as `t1.a, t2.a`, wouldn't the column names become `a, a` on the `RecordBatch`? i.e. there will be a discrepancy between the schema provided by `df.schema()` and the `RecordBatches::schema()` returned by `collect()`, no? I think that this will happen even if we pass `DFSchema` to the physical plan (1.) or use indexes (3.), as any map `qualified name -> unqualified` is lossy (the qualifier), and thus never recoverable at the `RecordBatch`'s schema. This IMO leaves us with 2., which is what I would try: change the physical planner to alias/rewrite column names with the qualifier when the physical plan is created. This will cause the resulting `RecordBatch`'s schema to have columns named `t1.a` and `t2.a`, thereby guaranteeing the invariant that the output schema of the physical execution matches the schema of the logical plan. I.e. The invariant that `SELECT t1.a, t2.a, c ...` yields a schema whose columns are named `["t1.a", "t2.a", "c"]` is preserved. Note that we already do this when performing coercion: we preserve the logical schema name by injecting cast ops during physical (and not logical) planning, so that if the user wrote `SELECT sqrt(f32) ...`, the resulting name on the `RecordBatch::schema()` is `sqrt(f32)`, even if the physical operation performed was `sqrt(CAST(f32 as Float64))`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
