AlonSpivack commented on issue #15161:
URL: https://github.com/apache/datafusion/issues/15161#issuecomment-3941065101
Thanks @neilconway for putting together such a clear comparison of the two
approaches.
After reviewing both PRs, I'm in favor of **Approach 2 (PR #20456)**. Here's
my reasoning:
The key advantage of Approach 2 is that it catches errors at **planning
time** rather than deferring them to runtime. Under Approach 1, `text_col < 5`
silently inserts a `CAST(text_col AS Int64)` that will crash at runtime if the
column contains non-numeric data. The user gets no indication of the problem
until a specific row triggers the cast failure, which could happen
intermittently depending on the data. Approach 2 rejects this at planning time
with a clear type mismatch error, which is the correct behavior.
The `IN` list handling is also a significant differentiator. Under Approach
1, `int_col IN ('5', '10')` still routes through `type_union_coercion`,
resulting in lexicographic comparison - the exact class of bug we're trying to
fix. Approach 2 handles this correctly by converting the literals at planning
time.
Regarding @adriangb's concern about stricter coercion impacting user
experience: I think silently returning wrong results is a far worse UX than a
clear planning error with an actionable message. Users who genuinely need
string comparison can always use an explicit `CAST`.
---
**Looking further ahead**, I think it's worth acknowledging what @neilconway
already noted: Approach 2 is fundamentally an approximation of Postgres's
`unknown` type semantics. It works well for the common cases (literal
comparisons, `IN` lists, `BETWEEN`), but it won't cover non-literal string
expressions like `int_col = '2025' || '-' || '01'`, and it requires the
analyzer to handle each expression type individually.
The "right" long-term solution is probably introducing an `unknown` (or
`unresolved`) type that string literals receive initially, which then gets
resolved to a concrete type based on context - similar to how Postgres handles
this. That would solve the problem at its root and cover all expression forms
uniformly, not just the ones we explicitly handle in the analyzer.
I fully understand this is a large undertaking - it touches the type system,
the parser/planner, the analyzer, coercion rules, built-in function signatures,
and likely hundreds of tests. It's not something to block the current fix on.
But if the community agrees this is the right long-term direction, I'd be
happy to take part in driving it forward - from the initial design/RFC through
implementation. I'm already deep in this area of the codebase due to hitting
this bug in production, and I'd like to turn that experience into a meaningful
contribution.
So my suggestion would be:
1. Merge PR #20456 as the immediate fix (it solves the critical correctness
issue now)
2. Open a separate issue/RFC to discuss the `unknown` type as a long-term
project
What do you think, @neilconway @alamb @adriangb?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]