Marcin Januszkiewicz created PHOENIX-4336:
---------------------------------------------
Summary: SELECT + ORDER BY performance much worse than self-join
Key: PHOENIX-4336
URL: https://issues.apache.org/jira/browse/PHOENIX-4336
Project: Phoenix
Issue Type: Bug
Reporter: Marcin Januszkiewicz
We have a wide table with 100M records created with the following DDL:
{code:sql}
CREATE TABLE traces (
rowkey VARCHAR PRIMARY KEY,
time VARCHAR,
number VARCHAR,
+40 more columns)
CREATE LOCAL INDEX ix_0 ON traces (UPPER(number)) INCLUDE (time, + some other
columns used for filtering)
{code}
We want to select into a large (~30M records) subset of this data with the
query:
{code:sql}
SELECT *all columns*
FROM traces
WHERE (UPPER(number) LIKE 'PO %')
ORDER BY time DESC, ROWKEY
LIMIT 101;
{code}
{noformat}
+-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
| PLAN
| EST_BYTES_READ |
EST_ROWS_READ | EST_INFO_TS |
+-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
| CLIENT 234-CHUNK 39850892 ROWS 73610048115 BYTES PARALLEL 234-WAY RANGE SCAN
OVER TRACES [1,'PO '] - [1,'PO!'] | 73610048115 |
39850892 | 1509102519122 |
| SERVER TOP 101 ROWS SORTED BY [cf."time" DESC, "ROWKEY"]
| 73610048115 |
39850892 | 1509102519122 |
| CLIENT MERGE SORT
| 73610048115 |
39850892 | 1509102519122 |
| CLIENT LIMIT 101
| 73610048115 |
39850892 | 1509102519122 |
+-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
{noformat}
This times out after 15 minutes and puts a huge load on our cluster.
We have an alternate way of selecting this data:
{code:sql}
SELECT t.rowkey, *all columns*
FROM TRACES t
JOIN (
SELECT rowkey
FROM TRACES
WHERE (UPPER(number) LIKE 'PO %')
ORDER BY time DESC, ROWKEY
LIMIT 101
) ix
ON t.ROWKEY = ix.ROWKEY
order by t.ROWKEY;
{code}
{noformat}
+---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
| PLAN
| EST_BYTES_READ
| EST_ROWS_READ | EST_INFO |
+---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
| CLIENT 1500-CHUNK 97154640 ROWS 462422573830 BYTES PARALLEL 5-WAY FULL SCAN
OVER TRACES | 73610048115
| 39850892 | 150910251 |
| CLIENT MERGE SORT
| 73610048115
| 39850892 | 150910251 |
| PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE)
| 73610048115
| 39850892 | 150910251 |
| CLIENT 234-CHUNK 39850892 ROWS 73610048115 BYTES PARALLEL 234-WAY
RANGE SCAN OVER TRACES [1,'PO '] - [1,'PO!'] | 73610048115
| 39850892 | 150910251 |
| SERVER TOP 101 ROWS SORTED BY [cf."time" DESC, "ROWKEY"]
| 73610048115
| 39850892 | 150910251 |
| CLIENT MERGE SORT
| 73610048115
| 39850892 | 150910251 |
| CLIENT LIMIT 101
| 73610048115
| 39850892 | 150910251 |
| DYNAMIC SERVER FILTER BY T.ROWKEY IN (IX.ROWKEY)
| 73610048115
| 39850892 | 150910251 |
+---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
{noformat}
Which completes in just under a minute.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)