Re: [I] Incorrect cast of integer columns to utf8 when comparing with utf8 constant [datafusion]

via GitHub Sun, 15 Feb 2026 05:23:59 -0800


AlonSpivack commented on issue #15161:
URL: https://github.com/apache/datafusion/issues/15161#issuecomment-3904440723


   <html><head></head><body><p>I'm hitting this bug in production on 
<strong>DataFusion v52</strong>, and it's causing <strong>silently incorrect 
query results</strong> across multiple scenarios.</p>
   <p>I want to add important context beyond what's already discussed here - 
this bug is wider than originally reported.</p>
   <hr>
   <h3>1. PR #15110 (merged) only fixed a very narrow subset of the problem</h3>
   <p>The <code>cast_literal_to_type_with_op</code> function that was merged 
only handles:</p>
   <ul>
   <li><strong>Operators</strong>: <code>=</code> and <code>!=</code> only</li>
   <li><strong>Types</strong>: integer types only 
(<code>Int8</code>–<code>Int64</code>, 
<code>UInt8</code>–<code>UInt64</code>)</li>
   </ul>
   <p>This means <strong><code>&lt;</code>, <code>&gt;</code>, 
<code>&lt;=</code>, <code>&gt;=</code> are still broken for integers</strong>, 
and <strong>all operators are broken for float types</strong> 
(<code>Float32</code>, <code>Float64</code>). PR #15482 attempted to fix the 
broader problem at the type coercion level but was closed as stale in June 
2025.</p>
   <hr>
   <h3>2. Float columns are also affected - not just integers</h3>
   <p>The original issue only mentions integer columns, but float columns have 
the same problem - and in some cases it's even worse:</p>
   <pre><code class="language-sql">CREATE TABLE t AS SELECT * FROM (VALUES 
(1.0), (5.0), (5.5), (325.0), (499.0)) AS t(num_float);
   
   -- WRONG RESULTS: returns 499.0, 325.0 because '499.0' &lt; '5' 
lexicographically
   SELECT * FROM t WHERE num_float &lt; '5';
   
   -- WRONG RESULTS: returns 0 rows! CAST(5.0 AS Utf8) produces '5.0' which != 
'5'
   SELECT * FROM t WHERE num_float = '5';
   
   -- Might accidentally "work" because CAST(5.0 AS Utf8) = '5.0', but it's 
still doing string comparison
   SELECT * FROM t WHERE num_float = '5.0';
   
   -- WRONG: same lexicographic issue
   SELECT * FROM t WHERE num_float &gt; '5';
   </code></pre>
   <p>The <code>num_float = '5'</code> case is especially dangerous - it 
<strong>silently returns zero rows</strong> instead of matching the value 
<code>5.0</code>, because the float gets cast to the string <code>"5.0"</code> 
which doesn't equal <code>"5"</code>. No error, no warning, just wrong data.</p>
   <hr>
   <h3>3. The integer <code>&lt;</code>/<code>&gt;</code> cases are equally 
broken</h3>
   <pre><code class="language-sql">CREATE TABLE t2 AS SELECT * FROM (VALUES 
(1), (5), (325), (499), (1000)) AS t2(num_int);
   
   -- WRONG RESULTS: returns 499, 325 etc. because '499' &lt; '5' in string 
ordering  
   SELECT * FROM t2 WHERE num_int &lt; '5';
   
   -- This one works (fixed by PR #15110), but only for = and !=
   SELECT * FROM t2 WHERE num_int = '5';
   </code></pre>
   <hr>
   <h3>4. What Postgres does (the correct behavior)</h3>
   <p>Postgres treats the string literal as "unknown" type and casts it to the 
column type - not the other way around:</p>
   <pre><code>postgres=# SELECT * FROM t WHERE num_float &lt; '5';
   -- Returns all values less than 5.0 (correct numeric comparison)
   
   postgres=# SELECT * FROM t WHERE num_float = '5';
   -- Matches the value 5.0 (correctly casts '5' to float)
   
   postgres=# SELECT * FROM t WHERE num_float &lt; 'hello';
   ERROR:  invalid input syntax for type double precision: "hello"
   </code></pre>
   <hr>
   <h3>5. Summary of broken scenarios in v52</h3>
   
   Scenario | Fixed? | What happens
   -- | -- | --
   int_col = '5' | ✅ (PR #15110) | Correct
   int_col != '5' | ✅ (PR #15110) | Correct
   int_col < '5' | ❌ | Wrong results
   int_col > '5' | ❌ | Wrong results
   int_col <= '5' | ❌ | Wrong results
   int_col >= '5' | ❌ | Wrong results
   float_col = '5' | ❌ | Wrong results
   float_col = '5.0' | ✅ | Accidentally works via string match
   float_col != '5' | ❌ | Wrong results
   float_col < '5' | ❌ | Wrong results
   float_col > '5' | ❌ | Wrong results
   
   
   <hr>
   <h3>6. My thoughts on the fix</h3>
   <p>I think the ideal solution is what @scsmithr suggested - fix this at the 
<strong>type coercion level</strong>: when comparing a numeric column (int or 
float) with a string literal, <strong>cast the string literal to the column's 
numeric type</strong>, not the other way around. If the cast fails (e.g., 
<code>num_col &lt; 'hello'</code>), throw a clear planning error like 
<code>Cannot coerce 'hello' to type 'Float64'</code>.</p>
   <p>At the very least, even if the full coercion fix takes time, I think 
DataFusion should <strong>throw an error/exception immediately</strong> when it 
encounters a string value on one side of any comparison operator 
(<code>=</code>, <code>!=</code>, <code>&lt;</code>, <code>&gt;</code>, 
<code>&lt;=</code>, <code>&gt;=</code>) with a numeric column on the other 
side, with a message explaining that implicit string-to-column casting is not 
supported and that the types are incompatible. This would prevent the current 
situation where users silently get wrong results without any indication.</p>
   <p>If a user actually wants string comparison behavior, they can explicitly 
cast in the SQL:</p>
   <pre><code class="language-sql">-- Explicit cast: user is intentionally 
asking for string comparison
   SELECT * FROM t WHERE CAST(num_float AS TEXT) &lt; '5';
   </code></pre>
   <p>This way, the default behavior is safe (either correct numeric comparison 
or a clear error), and users who truly want string comparison can opt into it 
explicitly.</p>
   <hr>
   
   ### 7. Status question
   
   Is anyone currently working on a revised approach after PR #15482 was 
closed? 
   
   In that PR, @alamb 
[[noted](https://github.com/apache/datafusion/pull/15482#discussion_r2026999963)](https://github.com/apache/datafusion/pull/15482#discussion_r2026999963)
 that the fix shouldn't check for `Expr::Literal` inside the type coercion 
rewriter, but instead should change the **base type coercion rules** themselves 
(which work on `DataType`, not on expression kind). He also pointed out that 
checking for literals won't handle string expressions like `date_col = 
'2025'||'-'||'02'`. Separately in PR #15110, @alamb [[agreed with 
@scsmithr](https://github.com/apache/datafusion/pull/15110#issuecomment-2715747690)](https://github.com/apache/datafusion/pull/15110#issuecomment-2715747690)
 that changing the coercion rules is a better plan than trying to fix it up 
afterwards in the optimizer.
   
   Has there been any progress on that direction? This is a **correctness bug** 
that silently returns wrong data. In my production system I was getting 
incorrect results for a long time before I tracked it down to this coercion 
issue. I'd love to help move this forward if there's a way I can contribute.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Incorrect cast of integer columns to utf8 when comparing with utf8 constant [datafusion]

Reply via email to