[jira] [Commented] (PHOENIX-3438) Phoenix should be able to optimize LIMIT query with IN clause

James Taylor (JIRA) Thu, 10 Nov 2016 07:39:11 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654360#comment-15654360
 ]


James Taylor commented on PHOENIX-3438:
---------------------------------------

The row order has to be preserved across the entire set of rows for the 
aggregation to be done in-place by a single scan. Phoenix won't compile a query 
into multiple scans except for its parallelization partitioned along row key 
based on guideposts (i.e. If there are enough score+container_id rows to span 
multiple guideposts) or for union (where you're essentially telling it which 
scans to run separately). The trade off will be multiple RPCs plus the dedup 
you're stuck doing versus a single scan over potentially more data. If Phoenix 
did optimize it into multiple scans the way you envisioned, the execution plan 
would pretty much look the same as the union query (minus the deduping).

> Phoenix should be able to optimize LIMIT query with IN clause
> -------------------------------------------------------------
>
>                 Key: PHOENIX-3438
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3438
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.8.0
>            Reporter: Joel Palmert
>            Assignee: James Taylor
>            Priority: Critical
>             Fix For: 4.8.3
>
>
> This is taken from a real production issue we're having but has been 
> simplified here. Steps to repro:
> CREATE TABLE IF NOT EXISTS TEST.TEST (
>     ORGANIZATION_ID CHAR(15) NOT NULL,
>     SCORE INTEGER NOT NULL,
>     ENTITY_ID CHAR(15)
>     CONSTRAINT TEST_PK PRIMARY KEY (
>         ORGANIZATION_ID,
>         SCORE DESC
>     )
> ) VERSIONS=1, MULTI_TENANT=TRUE, REPLICATION_SCOPE=1, TTL=43200
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM test.test
> WHERE organization_id IN ('org1','org3')
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN ON 2 KEYS OVER TEST.TEST ['org1       
>     '] - ['org2           ']
>     SERVER AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT MERGE SORT
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]
> Notice that Phoenix gets and sends all the rows for the two organizations to 
> the client. Since the ORDER BY is in PK order I would expect it to just 
> 1. Get the first row for each organization (in index order) rather than 
> scanning all the rows.
> 2. Merge them (only 2 rows in this case)
> 3. Limit 1
> Consider the following query and plan for what I would expect to see:
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM(
>       SELECT entity_id, score
>       FROM(
>               SELECT entity_id, score
>               FROM test.test
>               WHERE organization_id = 'org1'
>               ORDER BY score DESC
>               LIMIT 1)
>       UNION ALL
>       SELECT entity_id, score
>               FROM(
>               SELECT entity_id, score
>               FROM test.test
>               WHERE organization_id = 'org2'
>               ORDER BY score DESC
>               LIMIT 1))
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> UNION ALL OVER 2 QUERIES
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org1           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org2           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
> CLIENT SORTED BY [ENTITY_ID]
> CLIENT AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-3438) Phoenix should be able to optimize LIMIT query with IN clause

Reply via email to