[ 
https://issues.apache.org/jira/browse/DERBY-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482938
 ] 

A B commented on DERBY-2256:
----------------------------

It looks like the various incorrect results reported in this Jira are all 
caused by one of two different but related problems.

Problem I) Compile-time comparisons.

The first problem exists in the preprocessing code of InListOperatorNode, where 
an IN list comprised exclusively of constant literals undergoes three potential 
transformations:

  0. First, if the list has a single value in it, it will be replaced by
     an equality predicate, and that's it.  So an IN list such as
     "... i in (3)" will become "i = 3".
  1. Else, the list will be sorted in ascending order.
  2. Then, in the special case where the minimum and maximum values are
     the same, the IN list will be replaced with a simple equality
     predicate.  So an IN list such as " ... i in (2, 2, 2)" will actually
     turn into an equality of the form "i = 2".

The problem here is that the code to sort the list, and also the code to find 
the min/max values, always does comparisons using the type of the left operand, 
as can be seen from the following comment in the code:

    /* When sorting or choosing min/max in the list, if types
     * are not an exact match, we use the left operand's type
     * as the "judge", assuming that they are compatible, as
     * also the case with DB2.
     */

But it turns out that this is *not* always the correct thing to do--and despite 
what the comment says, this does not appear to be what DB2 does, either (all of 
the queries referenced in this Jira issue return the correct results on DB2 v8).

To see why this is wrong, let's assume we have a table T1 with an integer 
column "I" whose rows are simply:

    I
    ----
    1
    2
    3
    4
    5

Now let's take the following query:

  select * from t1 where i in (4.23, 4);

In this case "i" is the left operand, so the left operand's type is INTEGER.  
Derby will then bubble-sort the IN list values using INTEGER as the "judge" 
type for comparisons.  When doing integer comparisons with decimal values, 
Derby will compare with a _truncated_ version of the decimal(s), which means 
that we will effectively end up with something like:

  if (trunc(4.23) > 4)
    <swap "4.23" with "4">

which becomes:

  if (4 > 4)
    <swap "4.23" with "4>

Since 4 is not greater than 4, the "if" condition will return false and thus we 
will not swap the values; we'll just leave them as they are and consider them 
"sorted"--which is wrong.

If we keep going, we will then check to see if the mininum value is equal to 
the maximum value, and if so we'll rewrite the IN list as an equality 
predicate.  That said, the code assumes that the values have been correctly 
sorted, in which case the minimum value should simply be the first value in the 
list.  So Derby will at this point think that the minimum value is "4.23" 
(which is wrong).

So in pseudo-code we'll have:

  if (minVal == maxVal)
    <transform to equality>

which becomes

  if (trunc(4.23) == 4)
    <transform to equality>

which becomes

  if (4 == 4)
    <transform to equality>

As a result, we transform the IN list into a simple equality of the form "i = 
<minValue>".  But because of the incorrect sort we still think that our minimum 
value is "4.23", so the query that we ultimately end up executing is:

  select * from t1 where i = 4.23

Thus the query returns no rows when it should have returned one.  Notice that 
if the IN list values are reversed, i.e.:

  select * from t1 where i in (4, 4.23);

then all of the same (incorrect) logic will apply, except that the minimum 
value will be "4" instead of "4.23" (because "4" appears first in the list).  
Thus the query gets rewritten to:

  select * from t1 where i = 4

which returns the expected row (somewhat by accident).  So this explains the 
behavior posted by Yip in an earlier comment.

Problem II) Execution-time comparisons.

Even if the incorrect logic described above is fixed, we will still have cases 
where certain queries return the wrong results.  The reason is that the 
execution-time logic for IN lists has the same problem as the compilation-time 
logic: namely, Derby does all of the IN-list comparisons using the left 
operand's type, which is wrong.  The relevant code can be found in the 
DataType.in() method, where we start with the following comment:

    /* Do a binary search if the list is ordered until the 
     * range of values to search is 3 or less.
     *
     * NOTE: We've ensured that the IN list and the left all have
     * the same precedence at compile time.  If we don't enforce 
     * the same precendence then we could get the wrong result
     * when doing a binary search.
     */

Ironically enough, this comment makes reference to the problem at hand here: if 
we don't enforce the correct precedence when comparing values, we can get wrong 
results.  The problem is that the "NOTE" in this comment is WRONG--despite what 
it says, we actually did *NOT* ensure that the IN list and the left operand had 
the same precedence at compile time.  Even if we assume that the preprocessing 
logic is correct, it doesn't actually *change* the types of any of the IN-list 
values--so when we get to execution, the values *can* in fact have different 
type precedences.  Because of this broken assumption, the rest of the code in 
DataType.in() fails in certain cases.

As an example we can use the same T1 mentioned above, but this time let our 
query be the following:

  select * from t1 where i in (2.8, 4.23)

Notice that the IN-list values are already in sorted order and that, when 
truncated, "2.8" does NOT equal "4.23"--so the incorrect preprocessing 
described above will (accidentally) do the correct thing.  But now when we get 
to execution, we're going to search the IN list values for each of the values 
in T1(i).  Since the left operand is INTEGER, all comparisons will be 
integer-based comparisons, meaning that they will use truncated versions of the 
decimal values.  So we'll see something like:

  Search for "1":

    (1 == trunc(2.8))  --> (1 == 2) --> false (no match)
    (1 == trunc(4.23)) --> (1 == 4) --> false (no match)

  Search for "2":

    (2 == trunc(2.8))  --> (2 == 2) --> true (MATCH -- which is WRONG)

  Search for "3":

    (3 == trunc(2.8))  --> (3 == 2) --> false (no match)
    (3 == trunc(4.23)) --> (3 == 4) --> false (no match)

  Search for "4":

    (4 == trunc(2.8))  --> (4 == 2) --> false (no match)
    (4 == trunc(4.23)) --> (4 == 4) --> true (MATCH -- which is WRONG)

  Search for "5":

    (5 == trunc(2.8))  --> (5 == 2) --> false (no match)
    (5 == trunc(4.23)) --> (5 == 4) --> false (no match)

Thus the query will return two rows when in fact it should return NONE.

One final note here: when this issue was first created, the above query 
actually only returned 1 row, not two.  That was because Derby used to create 
an additional BETWEEN predicate for the IN-list, and that predicate eliminated 
the row for "i == 2" (because 2 is not between 2.8 and 4.23).  But the changes 
for DERBY-47 replaced the BETWEEN optimization with a multi-probe approach, 
thus further exposing (but not causing) the execution-time bug described here.  
This is further backed up by the fact that even before DERBY-47 the following 
query returned two rows, as well:

ij> select * from t1 where i in (cast (2.8 as decimal(4, 2)), 4.23);
I
-----------
2
4

2 rows selected  

The reason is that the explict CAST disabled the BETWEEN optimization, thereby 
leading to the same situation as just outlined.

> Wrong Results: Use of decimal values in an IN-list with INTEGER left operand 
> can lead to extra rows.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: DERBY-2256
>                 URL: https://issues.apache.org/jira/browse/DERBY-2256
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.0.2.0, 10.0.2.1, 10.0.2.2, 10.1.1.0, 10.1.2.1, 
> 10.1.3.1, 10.1.3.2, 10.1.4.0, 10.2.1.6, 10.2.2.0, 10.2.2.1, 10.2.3.0, 10.3.0.0
>            Reporter: A B
>         Assigned To: A B
>
> While trying out some code changes for DERBY-47 I was running a few test 
> cases and happened to notice that there are a couple of cases in which Derby 
> behaves incorrectly (that or my understanding of what should be happening 
> here is way off).
> First and most simply: the following query should return zero rows (unless 
> I'm missing something?), but it returns one:
> ij> create table t1 (i int);
> 0 rows inserted/updated/deleted
> ij> insert into t1 values 1, 2, 3, 4, 5;
> 5 rows inserted/updated/deleted
> -- Correct returns zero rows.
> ij> select * from t1 where i in (4.23);
> I
> -----------
> 0 rows selected
> -- But this one returns 1 row...
> ij> select * from t1 where i in (2.8, 4.23);
> I
> -----------
> 4
> 1 row selected
> Secondly, if the IN-list contains a non-constant value then Derby can 
> incorrectly return rows that do not match the IN predicate.  I think this is 
> because some internal casting is happening when it shouldn't?
> ij> create table t1 (i int);
> 0 rows inserted/updated/deleted
> ij> insert into t1 values 1, 2, 3, 4, 5;
> 5 rows inserted/updated/deleted
> -- Following values clause returns "2.80", as expected.
> ij> values cast (2.8 as decimal(4, 2));
> 1
> -------
> 2.80
> 1 row selected
> -- But if we use it in an IN-list it gets cast to "2" and thus returns a 
> match.
> -- We get 2 rows when we should get NONE.
> ij> select * from t1 where i in (cast (2.8 as decimal(4, 2)), 4.23);
> I
> -----------
> 2
> 4
> 2 rows selected
> I confirmed that we see these results on trunk, 10.2, 10.1, and even as far 
> back as svn #201660 for 10.0.  I also ran the above statements on DB2 v8 as a 
> sanity check to confirm that NO results were returned there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to