AlonSpivack commented on issue #15161: URL: https://github.com/apache/datafusion/issues/15161#issuecomment-3904440723
<html><head></head><body><p>I'm hitting this bug in production on <strong>DataFusion v52</strong>, and it's causing <strong>silently incorrect query results</strong> across multiple scenarios.</p> <p>I want to add important context beyond what's already discussed here - this bug is wider than originally reported.</p> <hr> <h3>1. PR #15110 (merged) only fixed a very narrow subset of the problem</h3> <p>The <code>cast_literal_to_type_with_op</code> function that was merged only handles:</p> <ul> <li><strong>Operators</strong>: <code>=</code> and <code>!=</code> only</li> <li><strong>Types</strong>: integer types only (<code>Int8</code>–<code>Int64</code>, <code>UInt8</code>–<code>UInt64</code>)</li> </ul> <p>This means <strong><code><</code>, <code>></code>, <code><=</code>, <code>>=</code> are still broken for integers</strong>, and <strong>all operators are broken for float types</strong> (<code>Float32</code>, <code>Float64</code>). PR #15482 attempted to fix the broader problem at the type coercion level but was closed as stale in June 2025.</p> <hr> <h3>2. Float columns are also affected - not just integers</h3> <p>The original issue only mentions integer columns, but float columns have the same problem - and in some cases it's even worse:</p> <pre><code class="language-sql">CREATE TABLE t AS SELECT * FROM (VALUES (1.0), (5.0), (5.5), (325.0), (499.0)) AS t(num_float); -- WRONG RESULTS: returns 499.0, 325.0 because '499.0' < '5' lexicographically SELECT * FROM t WHERE num_float < '5'; -- WRONG RESULTS: returns 0 rows! CAST(5.0 AS Utf8) produces '5.0' which != '5' SELECT * FROM t WHERE num_float = '5'; -- Might accidentally "work" because CAST(5.0 AS Utf8) = '5.0', but it's still doing string comparison SELECT * FROM t WHERE num_float = '5.0'; -- WRONG: same lexicographic issue SELECT * FROM t WHERE num_float > '5'; </code></pre> <p>The <code>num_float = '5'</code> case is especially dangerous - it <strong>silently returns zero rows</strong> instead of matching the value <code>5.0</code>, because the float gets cast to the string <code>"5.0"</code> which doesn't equal <code>"5"</code>. No error, no warning, just wrong data.</p> <hr> <h3>3. The integer <code><</code>/<code>></code> cases are equally broken</h3> <pre><code class="language-sql">CREATE TABLE t2 AS SELECT * FROM (VALUES (1), (5), (325), (499), (1000)) AS t2(num_int); -- WRONG RESULTS: returns 499, 325 etc. because '499' < '5' in string ordering SELECT * FROM t2 WHERE num_int < '5'; -- This one works (fixed by PR #15110), but only for = and != SELECT * FROM t2 WHERE num_int = '5'; </code></pre> <hr> <h3>4. What Postgres does (the correct behavior)</h3> <p>Postgres treats the string literal as "unknown" type and casts it to the column type - not the other way around:</p> <pre><code>postgres=# SELECT * FROM t WHERE num_float < '5'; -- Returns all values less than 5.0 (correct numeric comparison) postgres=# SELECT * FROM t WHERE num_float = '5'; -- Matches the value 5.0 (correctly casts '5' to float) postgres=# SELECT * FROM t WHERE num_float < 'hello'; ERROR: invalid input syntax for type double precision: "hello" </code></pre> <hr> <h3>5. Summary of broken scenarios in v52</h3> Scenario | Fixed? | What happens -- | -- | -- int_col = '5' | ✅ (PR #15110) | Correct int_col != '5' | ✅ (PR #15110) | Correct int_col < '5' | ❌ | Wrong results int_col > '5' | ❌ | Wrong results int_col <= '5' | ❌ | Wrong results int_col >= '5' | ❌ | Wrong results float_col = '5' | ❌ | Wrong results float_col = '5.0' | ✅ | Accidentally works via string match float_col != '5' | ❌ | Wrong results float_col < '5' | ❌ | Wrong results float_col > '5' | ❌ | Wrong results <hr> <h3>6. My thoughts on the fix</h3> <p>I think the ideal solution is what @scsmithr suggested - fix this at the <strong>type coercion level</strong>: when comparing a numeric column (int or float) with a string literal, <strong>cast the string literal to the column's numeric type</strong>, not the other way around. If the cast fails (e.g., <code>num_col < 'hello'</code>), throw a clear planning error like <code>Cannot coerce 'hello' to type 'Float64'</code>.</p> <p>At the very least, even if the full coercion fix takes time, I think DataFusion should <strong>throw an error/exception immediately</strong> when it encounters a string value on one side of any comparison operator (<code>=</code>, <code>!=</code>, <code><</code>, <code>></code>, <code><=</code>, <code>>=</code>) with a numeric column on the other side, with a message explaining that implicit string-to-column casting is not supported and that the types are incompatible. This would prevent the current situation where users silently get wrong results without any indication.</p> <p>If a user actually wants string comparison behavior, they can explicitly cast in the SQL:</p> <pre><code class="language-sql">-- Explicit cast: user is intentionally asking for string comparison SELECT * FROM t WHERE CAST(num_float AS TEXT) < '5'; </code></pre> <p>This way, the default behavior is safe (either correct numeric comparison or a clear error), and users who truly want string comparison can opt into it explicitly.</p> <hr> ### 7. Status question Is anyone currently working on a revised approach after PR #15482 was closed? In that PR, @alamb [[noted](https://github.com/apache/datafusion/pull/15482#discussion_r2026999963)](https://github.com/apache/datafusion/pull/15482#discussion_r2026999963) that the fix shouldn't check for `Expr::Literal` inside the type coercion rewriter, but instead should change the **base type coercion rules** themselves (which work on `DataType`, not on expression kind). He also pointed out that checking for literals won't handle string expressions like `date_col = '2025'||'-'||'02'`. Separately in PR #15110, @alamb [[agreed with @scsmithr](https://github.com/apache/datafusion/pull/15110#issuecomment-2715747690)](https://github.com/apache/datafusion/pull/15110#issuecomment-2715747690) that changing the coercion rules is a better plan than trying to fix it up afterwards in the optimizer. Has there been any progress on that direction? This is a **correctness bug** that silently returns wrong data. In my production system I was getting incorrect results for a long time before I tracked it down to this coercion issue. I'd love to help move this forward if there's a way I can contribute. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
