[
https://issues.apache.org/jira/browse/PHOENIX-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138435#comment-15138435
]
Rajeshbabu Chintaguntla commented on PHOENIX-2665:
--------------------------------------------------
[~jamestaylor]
Here are the schemas and uploaded some random data:
{noformat}
CREATE TABLE IF NOT EXISTS test (ID INTEGER PRIMARY KEY,unsig_id UNSIGNED_INT,
big_id BIGINT,unsig_long_id UNSIGNED_LONG)
{noformat}
{noformat}
create index idx on test(unsig_id);
{noformat}
The explain plan is this.
{noformat}
0: jdbc:phoenix:localhost> explain select unsig_id,id from test group by id,
unsig_id;
+------------------------------------------+
| PLAN |
+------------------------------------------+
| CLIENT 2-CHUNK PARALLEL 1-WAY FULL SCAN OVER IDX |
| SERVER FILTER BY FIRST KEY ONLY |
| SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["ID", "UNSIG_ID"] |
| CLIENT MERGE SORT |
+------------------------------------------+
4 rows selected (0.041 seconds)
{noformat}
The problem is this:
-------------------------
After creating iterators in BaseResultIterators we fetch first set of rows from
the server. The hbase client scanner maintains last fetched row so that if any
thing like splits happen then it will set the last fetched row as start row for
the scan and create scanners. Since the scan ranges not proper we throw
StaleRegionBoundaryException then we try to create two parallel scans from the
boundaries [last_fetched_row, actual_scan_stop_row) but we need to create
scanners for the boundaries [actual_scan_start_row, actual_scan_stop_row). The
last_fetched_row may not be the proper row key in the index for aggregate
queries.
The solution is to have copy of scan so that we use proper start and stop row
to prepare the parallel scans.
> index split while running group by query is returning duplicate results
> -----------------------------------------------------------------------
>
> Key: PHOENIX-2665
> URL: https://issues.apache.org/jira/browse/PHOENIX-2665
> Project: Phoenix
> Issue Type: Bug
> Reporter: Rajeshbabu Chintaguntla
> Assignee: Rajeshbabu Chintaguntla
> Priority: Blocker
> Fix For: 4.7.0
>
>
> When there is a index split while running group by query is returning
> duplicate results.
> Instead of returning 500,000 records it's returning 729,500 records.
> {noformat}
> +------------------------------------------+------------------------------------------+
> | 4999 | 499999
> |
> +------------------------------------------+------------------------------------------+
> 500,000 rows selected (11.996 seconds)
> {noformat}
> {noformat}
> +------------------------------------------+------------------------------------------+
> | 4999 | 499999
> |
> +------------------------------------------+------------------------------------------+
> 729,500 rows selected (15.291 seconds)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)