Re: [I] Incorrect cast of integer columns to utf8 when comparing with utf8 constant [datafusion]

via GitHub Sun, 22 Feb 2026 06:33:34 -0800


AlonSpivack commented on issue #15161:
URL: https://github.com/apache/datafusion/issues/15161#issuecomment-3941065101


   Thanks @neilconway for putting together such a clear comparison of the two 
approaches.
   
   After reviewing both PRs, I'm in favor of **Approach 2 (PR #20456)**. Here's 
my reasoning:
   
   The key advantage of Approach 2 is that it catches errors at **planning 
time** rather than deferring them to runtime. Under Approach 1, `text_col < 5` 
silently inserts a `CAST(text_col AS Int64)` that will crash at runtime if the 
column contains non-numeric data. The user gets no indication of the problem 
until a specific row triggers the cast failure, which could happen 
intermittently depending on the data. Approach 2 rejects this at planning time 
with a clear type mismatch error, which is the correct behavior.
   
   The `IN` list handling is also a significant differentiator. Under Approach 
1, `int_col IN ('5', '10')` still routes through `type_union_coercion`, 
resulting in lexicographic comparison - the exact class of bug we're trying to 
fix. Approach 2 handles this correctly by converting the literals at planning 
time.
   
   Regarding @adriangb's concern about stricter coercion impacting user 
experience: I think silently returning wrong results is a far worse UX than a 
clear planning error with an actionable message. Users who genuinely need 
string comparison can always use an explicit `CAST`.
   
   ---
   
   **Looking further ahead**, I think it's worth acknowledging what @neilconway 
already noted: Approach 2 is fundamentally an approximation of Postgres's 
`unknown` type semantics. It works well for the common cases (literal 
comparisons, `IN` lists, `BETWEEN`), but it won't cover non-literal string 
expressions like `int_col = '2025' || '-' || '01'`, and it requires the 
analyzer to handle each expression type individually.
   
   The "right" long-term solution is probably introducing an `unknown` (or 
`unresolved`) type that string literals receive initially, which then gets 
resolved to a concrete type based on context - similar to how Postgres handles 
this. That would solve the problem at its root and cover all expression forms 
uniformly, not just the ones we explicitly handle in the analyzer.
   
   I fully understand this is a large undertaking - it touches the type system, 
the parser/planner, the analyzer, coercion rules, built-in function signatures, 
and likely hundreds of tests. It's not something to block the current fix on.
   
   But if the community agrees this is the right long-term direction, I'd be 
happy to take part in driving it forward - from the initial design/RFC through 
implementation. I'm already deep in this area of the codebase due to hitting 
this bug in production, and I'd like to turn that experience into a meaningful 
contribution.
   
   So my suggestion would be:
   1. Merge PR #20456 as the immediate fix (it solves the critical correctness 
issue now)
   2. Open a separate issue/RFC to discuss the `unknown` type as a long-term 
project
   
   What do you think, @neilconway @alamb @adriangb?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Incorrect cast of integer columns to utf8 when comparing with utf8 constant [datafusion]

Reply via email to