Caizhi Weng created FLINK-27627:
-----------------------------------
Summary: Incorrect result when order by (string, double) pair with
NaN values
Key: FLINK-27627
URL: https://issues.apache.org/jira/browse/FLINK-27627
Project: Flink
Issue Type: Bug
Components: Table SQL / Runtime
Affects Versions: 1.15.0
Reporter: Caizhi Weng
Use these test data and SQL to reproduce this exception.
gao.csv:
{code}
1.0,2.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
1.0,1.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
1.0,0.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
-1.0,0.0,aaaaaaaaaaaaaaa
1.0,-1.0,aaaaaaaaaaaaaaa
1.0,-2.0,aaaaaaaaaaaaaaa
{code}
Flink SQL:
{code}
Flink SQL> create table T ( a double, b double, c string ) WITH ( 'connector' =
'filesystem', 'path' = '/tmp/gao.csv', 'format' = 'csv' );
[INFO] Execute statement succeed.
Flink SQL> create table S ( a string, b double ) WITH ( 'connector' =
'filesystem', 'path' = '/tmp/gao2.csv', 'format' = 'csv' );
[INFO] Execute statement succeed.
Flink SQL> insert into S select c, a / b from T;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 8c98f5bb99c2dcd28f13def916e2178a
Flink SQL> select * from S order by a, b;
+-----------------+-----------+
| a | b |
+-----------------+-----------+
| aaaaaaaaaaaaaaa | 0.5 |
| aaaaaaaaaaaaaaa | NaN |
| aaaaaaaaaaaaaaa | 1.0 |
| aaaaaaaaaaaaaaa | NaN |
| aaaaaaaaaaaaaaa | Infinity |
| aaaaaaaaaaaaaaa | NaN |
| aaaaaaaaaaaaaaa | -Infinity |
| aaaaaaaaaaaaaaa | -1.0 |
| aaaaaaaaaaaaaaa | -0.5 |
+-----------------+-----------+
9 rows in set
Flink SQL> select * from S order by b;
+-----------------+-----------+
| a | b |
+-----------------+-----------+
| aaaaaaaaaaaaaaa | -Infinity |
| aaaaaaaaaaaaaaa | -1.0 |
| aaaaaaaaaaaaaaa | -0.5 |
| aaaaaaaaaaaaaaa | 0.5 |
| aaaaaaaaaaaaaaa | 1.0 |
| aaaaaaaaaaaaaaa | Infinity |
| aaaaaaaaaaaaaaa | NaN |
| aaaaaaaaaaaaaaa | NaN |
| aaaaaaaaaaaaaaa | NaN |
+-----------------+-----------+
9 rows in set
{code}
As is shown above, when order by a (string, double) pair the result is
incorrect, while order by a double column separately yields the correct result.
This is because {{BinaryIndexedSortable}} uses two comparators, the normalized
key comparator which directly compares memory segments, and the record
comparator which compares actual column values. If the length of sort keys are
not determined (for example if the sort keys contain strings) the normalized
key comparator cannot fully determine the order and it will fall back to the
record comparator.
As we can see in {{GenerateUtils#generateCompare}}, record comparator compares
double values directly with {{<}} and {{>}}. However for {{Double.NaN}}, every
binary comparator except {{!=}} will return false, which causes this issue.
Note that we cannot simply change {{GenerateUtils#generateCompare}}. This is
because comparing {{NaN}} in SQL should also return false except for {{<>}}. It
is the sorting operator that requires a specific order. That is to say, the
current implementation of {{GenerateUtils#generateCompare}} is correct for
comparing, but not for sorting. Maybe we should generate a special comparator
for all sorting operators?
--
This message was sent by Atlassian Jira
(v8.20.7#820007)