okumin created HIVE-27480:
-----------------------------
Summary: OFFSET without ORDER BY generates wrong results
Key: HIVE-27480
URL: https://issues.apache.org/jira/browse/HIVE-27480
Project: Hive
Issue Type: Bug
Components: Query Planning
Affects Versions: 4.0.0-alpha-2, 3.1.3, 2.4.0
Reporter: okumin
Assignee: okumin
Without ORDER BY, Hive doesn't add an additional single reducer, and it pushes
OFFSET and LIMIT to the original vertex. It can generate incorrect results
because OFFSET should be counted globally unlike LIMIT.
We would make the following changes.
- To fix the incorrect behavior
- To add a new `hive.strict.checks.*` to prevent this usage. Mostly, OFFSET
without ORDER BY is meaningless
We can reproduce the issue in the following steps.
*Prepare test data*
The following SQLs generate a test table with multiple files.
{code:sql}
$ beeline -e "
drop table test;
create table test(id int);
insert into test values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
insert into test values (11), (12), (13), (14), (15), (16), (17), (18), (19),
(20);
insert into test values (21), (22), (23), (24), (25), (26), (27), (28), (29),
(30);
"
{code}
*Reproduce the issue*
No rows are returned with `limit 10 offset 10`.
{code:sql}
$ beeline -e "select * from test limit 10, 10" --hiveconf
hive.fetch.task.conversion=none --hiveconf tez.grouping.min-size=1 --hiveconf
tez.grouping.max-size=1
...
+----------+
| test.id |
+----------+
+----------+
No rows selected (13.595 seconds)
{code}
*Expected behavior*
It should return any 10 rows like this.
{code:sql}
$ beeline -e "select * from test limit 10, 10"
...
+----------+
| test.id |
+----------+
| 11 |
| 12 |
| 13 |
| 14 |
| 15 |
| 16 |
| 17 |
| 18 |
| 19 |
| 20 |
+----------+
10 rows selected (0.175 seconds)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)