[jira] [Created] (IMPALA-13944) Top-N should have a mode to handle duplicates deterministically

Joe McDonnell (Jira) Tue, 08 Apr 2025 12:47:33 -0700

Joe McDonnell created IMPALA-13944:
--------------------------------------

             Summary: Top-N should have a mode to handle duplicates 
deterministically
                 Key: IMPALA-13944
                 URL: https://issues.apache.org/jira/browse/IMPALA-13944
             Project: IMPALA
          Issue Type: Task
          Components: Frontend
    Affects Versions: Impala 5.0.0
            Reporter: Joe McDonnell



Top-N is not deterministic when there are duplicates:
{noformat}
> select id from functional.alltypes order by int_col limit 5;
+------+
| id   |
+------+
| 1880 |
| 1890 |
| 1870 |
| 1840 |
| 1850 |
+------+
Fetched 5 row(s) in 0.12s

> select id from functional.alltypes order by int_col limit 5;
+-----+
| id  |
+-----+
| 970 |
| 980 |
| 960 |
| 930 |
| 940 |
+-----+
Fetched 5 row(s) in 0.12s{noformat}
This is expected, but the non-determinism can create problems if a query has 
multiple identical Top-Ns that are expected to be the same. This 
non-determinism also causes problems for tuple caching.

The Top-N can be made deterministic by ordering over additional columns until 
the rows are literally identical. Having a mode that adds all the additional 
columns to make it deterministic would avoid the need for customers to do this 
themselves.

Adding the additional columns would have a very small impact on performance 
when there are few duplicates, but it would definitely add a performance 
penalty when there are many duplicates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-13944) Top-N should have a mode to handle duplicates deterministically

Reply via email to