AndreaBozzo commented on issue #16054:
URL: https://github.com/apache/datafusion/issues/16054#issuecomment-3824269294
I worked on this in PR #19916 (now closed) and want to share the findings,
since the issue is more nuanced than it appears.
### The root problem: `SchemaDisplay` serves two conflicting purposes
The CLI column header for `select (1+2)*3` comes from this chain:
```
Expr::qualified_name()
→ Expr::schema_name()
→ SchemaDisplay(self).to_string()
→ used as Field name in DFSchema
→ propagated to physical plan schema
→ used by Arrow pretty printer as column header
```
`SchemaDisplay` for `BinaryExpr` currently has **no parenthesization at
all**:
```rust
Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
write!(f, "{} {op} {}", SchemaDisplay(left), SchemaDisplay(right))
}
```
This is why `(1+2)*3` displays as `Int64(1) + Int64(2) * Int64(3)` — it's
misleading.
However, `SchemaDisplay` output is also used as a **field-matching key by
optimizers** (e.g. `common_subexpr_eliminate`, `push_down_filter`,
`decorrelate`). These optimizers compare `schema_name()` strings to match
fields across plan nodes. Changing the format of `SchemaDisplay` changes these
keys, which breaks field resolution.
I hit this problem in #19916: adding parentheses to `SchemaDisplay` broke
optimizer tests, so I had to revert it. But without changing `SchemaDisplay`,
the CLI column headers can't be fixed — because that's where they come from.
### What's NOT broken
- **`Display` for `BinaryExpr`** already has correct precedence-based
parenthesization on main. `(1+2)*3` displays as `(Int64(1) + Int64(2)) *
Int64(3)`. This is used in `EXPLAIN` plans and debug logging.
- **`SqlDisplay`** has no parenthesization, but adding it (with precedence
logic) would be straightforward.
### What a proper fix would need
The core issue is that `SchemaDisplay` is **dual-purpose**: it's both a
user-facing column name and an internal optimizer field-matching key. These two
concerns need different formatting, but they're currently coupled through
`Expr::schema_name()`.
A proper fix would likely need to decouple these — for example by
introducing a separate `display_name()` for user-facing output while keeping
`schema_name()` stable for optimizer internals. This is a bigger architectural
change than just "add parentheses."
Wanted to document this here so future contributors don't hit the same wall.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]