[
https://issues.apache.org/jira/browse/KUDU-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241691#comment-15241691
]
Dan Burkert commented on KUDU-1363:
-----------------------------------
I thought about this a little more after commenting yesterday evening. It's
not necessarily true that IN list predicates are inefficient, only that making
them efficient is going to be a little bit tricky. As an example, consider the
following schema:
{code:SQL}
CREATE TABLE machine_metrics
(STRING host, STRING metric, TIMESTAMP time, DOUBLE value)
PRIMARY KEY (host, metric, time);
{code}
So we have a pretty ordinary time series schema, with the somewhat unusual
characteristic of sorting first by the host and metric instead of timestamp.
With a table like this we may want to have a query that retrieves a few metrics
across a few different hosts for a single day, such as:
{code:SQL}
SELECT * from machine_metrics
WHERE host IN ('host-001', 'host-235')
AND metric IN ('load-avg-1min', 'load-avg-5min')
AND time >= 2016-04-01T00:00:00
AND time < 2016-04-02T00:00:00;
{code}
In the most naive way, this scan could be satisfied by doing a full table scan,
and simply applying the predicates to each record as they are scanned. But
since the predicates are specified on primary key columns, Kudu could be a
little bit smarter and convert the full table scan into 4 individual scanners
which scan just the necessary rows which match the predicates. The scanners
would have the following primary key bounds:
{code:SQL}
PK > ('host-001', 'load-avg-1min', 2016-04-01T00:00:00) AND PK <= ('host-001',
'load-avg-1min', 2016-04-02T00:00:00)
PK > ('host-001', 'load-avg-5min', 2016-04-01T00:00:00) AND PK <= ('host-001',
'load-avg-5min', 2016-04-02T00:00:00)
PK > ('host-235', 'load-avg-1min', 2016-04-01T00:00:00) AND PK <= ('host-235',
'load-avg-1min', 2016-04-02T00:00:00)
PK > ('host-235', 'load-avg-5min', 2016-04-01T00:00:00) AND PK <= ('host-235',
'load-avg-5min', 2016-04-02T00:00:00)
{code}
Today Kudu is smart enough to push equality and range predicates into a single
primary key bound (see the optimization guide linked above for examples), but
only a single primary key bound is supported, not multiple. As a bonus, I
think adding this level of optimization would negate the need for a multi-get
API.
> Add Multiple column range predicates for the same column in a single scan
> -------------------------------------------------------------------------
>
> Key: KUDU-1363
> URL: https://issues.apache.org/jira/browse/KUDU-1363
> Project: Kudu
> Issue Type: New Feature
> Reporter: Chris George
>
> Currently adding multiple column range predicates for the same column does
> essentially an AND between the two predicates which will cause no results to
> be returned.
> This would greatly increase performance were I can complete in one scan what
> would otherwise take two.
> As an example using the java api:
> ColumnRangePredicate columnRangePredicateColumnNameA = new
> ColumnRangePredicate(new ColumnSchema.ColumnSchemaBuilder("column_name",
> Type.STRING).build());
> columnRangePredicateColumnNameA.setLowerBound("A");
> columnRangePredicateColumnNameA.setUpperBound("A");
> ColumnRangePredicate columnRangePredicateColumnNameB = new
> ColumnRangePredicate(new ColumnSchema.ColumnSchemaBuilder("column_name",
> Type.STRING).build());
> columnRangePredicateColumnNameB.setLowerBound("B");
> columnRangePredicateColumnNameB.setUpperBound("B");
> which would be equivalent:
> select * from some_table where column_name="A" or column_name="B"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)