Joe McDonnell created IMPALA-13944:
--------------------------------------
Summary: Top-N should have a mode to handle duplicates
deterministically
Key: IMPALA-13944
URL: https://issues.apache.org/jira/browse/IMPALA-13944
Project: IMPALA
Issue Type: Task
Components: Frontend
Affects Versions: Impala 5.0.0
Reporter: Joe McDonnell
Top-N is not deterministic when there are duplicates:
{noformat}
> select id from functional.alltypes order by int_col limit 5;
+------+
| id |
+------+
| 1880 |
| 1890 |
| 1870 |
| 1840 |
| 1850 |
+------+
Fetched 5 row(s) in 0.12s
> select id from functional.alltypes order by int_col limit 5;
+-----+
| id |
+-----+
| 970 |
| 980 |
| 960 |
| 930 |
| 940 |
+-----+
Fetched 5 row(s) in 0.12s{noformat}
This is expected, but the non-determinism can create problems if a query has
multiple identical Top-Ns that are expected to be the same. This
non-determinism also causes problems for tuple caching.
The Top-N can be made deterministic by ordering over additional columns until
the rows are literally identical. Having a mode that adds all the additional
columns to make it deterministic would avoid the need for customers to do this
themselves.
Adding the additional columns would have a very small impact on performance
when there are few duplicates, but it would definitely add a performance
penalty when there are many duplicates.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)