Caizhi Weng created FLINK-27627:
-----------------------------------

             Summary: Incorrect result when order by (string, double) pair with 
NaN values
                 Key: FLINK-27627
                 URL: https://issues.apache.org/jira/browse/FLINK-27627
             Project: Flink
          Issue Type: Bug
          Components: Table SQL / Runtime
    Affects Versions: 1.15.0
            Reporter: Caizhi Weng


Use these test data and SQL to reproduce this exception.

gao.csv:
{code}
1.0,2.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
1.0,1.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
1.0,0.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
-1.0,0.0,aaaaaaaaaaaaaaa
1.0,-1.0,aaaaaaaaaaaaaaa
1.0,-2.0,aaaaaaaaaaaaaaa
{code}

Flink SQL:
{code}
Flink SQL> create table T ( a double, b double, c string ) WITH ( 'connector' = 
'filesystem', 'path' = '/tmp/gao.csv', 'format' = 'csv' );
[INFO] Execute statement succeed.

Flink SQL> create table S ( a string, b double ) WITH ( 'connector' = 
'filesystem', 'path' = '/tmp/gao2.csv', 'format' = 'csv' );
[INFO] Execute statement succeed.

Flink SQL> insert into S select c, a / b from T;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 8c98f5bb99c2dcd28f13def916e2178a


Flink SQL> select * from S order by a, b;
+-----------------+-----------+
|               a |         b |
+-----------------+-----------+
| aaaaaaaaaaaaaaa |       0.5 |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa |       1.0 |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa |  Infinity |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa | -Infinity |
| aaaaaaaaaaaaaaa |      -1.0 |
| aaaaaaaaaaaaaaa |      -0.5 |
+-----------------+-----------+
9 rows in set

Flink SQL> select * from S order by b;
+-----------------+-----------+
|               a |         b |
+-----------------+-----------+
| aaaaaaaaaaaaaaa | -Infinity |
| aaaaaaaaaaaaaaa |      -1.0 |
| aaaaaaaaaaaaaaa |      -0.5 |
| aaaaaaaaaaaaaaa |       0.5 |
| aaaaaaaaaaaaaaa |       1.0 |
| aaaaaaaaaaaaaaa |  Infinity |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa |       NaN |
+-----------------+-----------+
9 rows in set
{code}

As is shown above, when order by a (string, double) pair the result is 
incorrect, while order by a double column separately yields the correct result.

This is because {{BinaryIndexedSortable}} uses two comparators, the normalized 
key comparator which directly compares memory segments, and the record 
comparator which compares actual column values. If the length of sort keys are 
not determined (for example if the sort keys contain strings) the normalized 
key comparator cannot fully determine the order and it will fall back to the 
record comparator.

As we can see in {{GenerateUtils#generateCompare}}, record comparator compares 
double values directly with {{<}} and {{>}}. However for {{Double.NaN}}, every 
binary comparator except {{!=}} will return false, which causes this issue.

Note that we cannot simply change {{GenerateUtils#generateCompare}}. This is 
because comparing {{NaN}} in SQL should also return false except for {{<>}}. It 
is the sorting operator that requires a specific order. That is to say, the 
current implementation of {{GenerateUtils#generateCompare}} is correct for 
comparing, but not for sorting. Maybe we should generate a special comparator 
for all sorting operators?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to