[ 
https://issues.apache.org/jira/browse/PHOENIX-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388161#comment-16388161
 ] 

James Taylor commented on PHOENIX-4336:
---------------------------------------

The only difference in processing of these two queries is that:
- the first query will do a point get for each local index row on the server to 
find the data columns to return back to the client.
- the second query would initiate client-side batched point lookups (using 
SkipScanFilter) to retrieve the data rows based on the rows it gets back from 
the inner index query.

Any thoughts, [~rajeshbabu] or [~sergey.soldatov]?

> SELECT + ORDER BY performance much worse than self-join
> -------------------------------------------------------
>
>                 Key: PHOENIX-4336
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4336
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Marcin Januszkiewicz
>            Priority: Major
>
> We have a wide table with 100M records created with the following DDL:
> {code:sql}
> CREATE TABLE traces (
>   rowkey VARCHAR PRIMARY KEY,
>   time VARCHAR,
>   number VARCHAR,
>   +40 more columns)
> CREATE LOCAL INDEX ix_0 ON traces (UPPER(number)) INCLUDE (time, + some other 
> columns used for filtering)
> {code}
> We want to select into a large (~30M records) subset of this data with the 
> query:
> {code:sql}
> SELECT *all columns*
>   FROM traces
>   WHERE (UPPER(number) LIKE 'PO %')
>   ORDER BY time DESC, ROWKEY
>   LIMIT 101;
> {code}
> {noformat}
> +-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> |                                                                PLAN         
>                                                         | EST_BYTES_READ  | 
> EST_ROWS_READ  |  EST_INFO_TS   |
> +-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> | CLIENT 234-CHUNK 39850892 ROWS 73610048115 BYTES PARALLEL 234-WAY RANGE 
> SCAN OVER               TRACES       [1,'PO '] - [1,'PO!']  | 73610048115     
> | 39850892       | 1509102519122  |
> |     SERVER TOP 101 ROWS SORTED BY [cf."time" DESC, "ROWKEY"]                
>                                                         | 73610048115     | 
> 39850892       | 1509102519122  |
> | CLIENT MERGE SORT                                                           
>                                                         | 73610048115     | 
> 39850892       | 1509102519122  |
> | CLIENT LIMIT 101                                                            
>                                                         | 73610048115     | 
> 39850892       | 1509102519122  |
> +-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> {noformat}
> This times out after 15 minutes and puts a huge load on our cluster.
> We have an alternate way of selecting this data:
> {code:sql}
> SELECT t.rowkey, *all columns*
> FROM TRACES t
> JOIN (
>   SELECT rowkey
>   FROM TRACES
>   WHERE (UPPER(number) LIKE 'PO %')
>   ORDER BY time DESC, ROWKEY
>   LIMIT 101
> ) ix
> ON t.ROWKEY = ix.ROWKEY
> order by t.ROWKEY;
> {code}
> {noformat}
> +---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
> |                                                                    PLAN     
>                                                                 | 
> EST_BYTES_READ  | EST_ROWS_READ  |  EST_INFO |
> +---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
> | CLIENT 1500-CHUNK 97154640 ROWS 462422573830 BYTES PARALLEL 5-WAY FULL SCAN 
> OVER               TRACES                                       | 73610048115 
>     | 39850892       | 150910251 |
> | CLIENT MERGE SORT                                                           
>                                                                 | 73610048115 
>     | 39850892       | 150910251 |
> |     PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE)                                
>                                                                 | 73610048115 
>     | 39850892       | 150910251 |
> |         CLIENT 234-CHUNK 39850892 ROWS 73610048115 BYTES PARALLEL 234-WAY 
> RANGE SCAN OVER               TRACES       [1,'PO '] - [1,'PO!']  | 
> 73610048115     | 39850892       | 150910251 |
> |             SERVER TOP 101 ROWS SORTED BY [cf."time" DESC, "ROWKEY"]        
>                                                                 | 73610048115 
>     | 39850892       | 150910251 |
> |         CLIENT MERGE SORT                                                   
>                                                                 | 73610048115 
>     | 39850892       | 150910251 |
> |         CLIENT LIMIT 101                                                    
>                                                                 | 73610048115 
>     | 39850892       | 150910251 |
> |     DYNAMIC SERVER FILTER BY T.ROWKEY IN (IX.ROWKEY)                        
>                                                                 | 73610048115 
>     | 39850892       | 150910251 |
> +---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
> {noformat}
> Which completes in just under a minute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to