[jira] [Commented] (PHOENIX-914) Native HBase timestamp support to optimize date range queries in Phoenix

James Taylor (JIRA) Sat, 20 Dec 2014 12:10:05 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254943#comment-14254943
 ]


James Taylor commented on PHOENIX-914:
--------------------------------------

bq. [~jleech] If Phoenix is re-writing every cell of a row for every update
Phoenix does not do this. Phoenix updates only the KVs of the columns being 
updated plus the empty KV that's with every row.

bq. [~jleech] Conceptually, the timestamp of the version of a row at a 
particular point in time is the max of all of the cell timestamps
Agreed. Even easier, we can use the timestamp of our empty KV as the definitive 
timestamp.

[~jleech] - Phoenix already handles flashback queries. We use a connection 
property to define the upper bound of the time range (and not the Oracle 
syntax). See 
http://phoenix.apache.org/faq.html#Can_phoenix_work_on_tables_with_arbitrary_timestamp_as_flexible_as_HBase_API.
 Also, take a look at some other interesting/pending work in PHOENIX-590 to 
support querying over multiple versions of the same row.

 bq. [~jeffreyz] Pseudo column approach has to smartly, implicitly derive scan 
time range from where clause expression as a super set time range bound.
The easiest way to achieve this is to treat the pseudo column as a special PK 
column. The code in WhereOptimizer already has all the logic to combine ranges 
across AND/ORs. We'd probably need a bit of logic to check if this pseudo 
column is referenced (we can track that in the visitor of the WhereOptimizer). 
Based on what the WhereOptimizer determines, we'd either extract the pseudo 
column expressions (in which case the optimization would kick in), or we'd 
evaluate this at filter time (which would still be correct, but would be 
slower). This is the same conceptually as when we can extract a row key column 
and turn it into a start/stop row on the scan, or not (in which case it's 
evaluated as a filter). This happens at compile time, so it would work well, I 
believe.

As far as the two approaches that have been discussed for referencing the row 
timestamp: 1) pseudo column approach versus 2) a read-only row key column that 
is transparently set to the row timestamp, maybe both could be supported. The 
pseudo column would be a reserved column name - if we allow it to be declared 
in the CREATE TABLE call, then it can optionally be used in the row key as well.

It seems to me that in any time series use case, there will be a column in the 
row key that maps to time. Either that, or perhaps a sequence would be used. 
Sometimes, as [~jeffreyz] points out, you don't want the row key timestamp to 
be controlled by HBase, is it may be slightly different (and represent a time 
when the data was rolled up, for example). As a side note, can this case can be 
accommodated by declaring another, different column outside the row key to 
capture this information? But if we still this it's better to support both a 
row key column and have a pseudo column (assume ROW_TIMESTAMP is the pseudo 
column), here's an example of what a query might look like:

{code}
CREATE TABLE server_metrics (
    host VARCHAR, created_date DATE, 
    cpu_metric DOUBLE, io_wait LONG, 
    CONSTRAINT pk PRIMARY KEY (host, created_date));

SELECT avg(io_wait) FROM server_metics
WHERE host LIKE 'SF-%'
AND created_date > CURRENT_DATE() - 7
AND ROW_TIMESTAMP > CURRENT_DATE() - 7;
{code}

The expression filtering the create_date would need to be duplicated for the 
pseudo column (which is somewhat unfortunate). If the created_date is not 
filtered, then the skip scan optimization wouldn't kick in (which would have a 
big negative impact on perf). Would this be the situation for your use case, 
[~jeffreyz]? Or would you filter created_date, but on a different date range?

Using the alternative timestamp row key column approach, it could look 
something like this:

{code}
CREATE TABLE server_metrics (
    host VARCHAR, --no need to declare the pseudo column as it's type is known
    cpu_metric DOUBLE, io_wait LONG, 
    CONSTRAINT pk PRIMARY KEY (host, ROW_TIMESTAMP)); --add pseudo column to PK

SELECT avg(io_wait) FROM server_metics
WHERE host LIKE 'SF-%'
AND created_date > CURRENT_DATE() - 7
{code}

In this case, only one filter is needed. Both the new Cell timestamp 
optimization as well as the skip scan optimization would kick in for this query.

What do you think, [~jeffreyz]? Are you still interested in pursuing this JIRA?

[~iwritecode] - you'd get a big perf gain with this optimization for your event 
use case.

> Native HBase timestamp support to optimize date range queries in Phoenix 
> -------------------------------------------------------------------------
>
>                 Key: PHOENIX-914
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-914
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.0.0
>            Reporter: Vladimir Rodionov
>            Assignee: Jeffrey Zhong
>         Attachments: PHOENIX-914.patch
>
>
> For many applications one of the column of a table can be (and must be) 
> naturally mapped 
> to HBase timestamp. What it gives us is the optimization on StoreScanner 
> where HFiles with timestamps out of range of
> a Scan operator will be omitted. Let us say that we have time-series type of 
> data (EVENTS) and custom compaction, where we create 
> series of HFiles with continuous non-overlapping timestamp ranges.
> CREATE TABLE IF NOT EXISTS ODS.EVENTS (
>     METRICID  VARCHAR NOT NULL,
>     METRICNAME VARCHAR,
>     SERVICENAME VARCHAR NOT NULL,
>     ORIGIN VARCHAR NOT NULL,
>     APPID VARCHAR,
>     IPID VARCHAR,
>     NVALUE DOUBLE,
>     TIME TIMESTAMP NOT NULL  /+ TIMESTAMP +/,
>     DATA VARCHAR,
>     SVALUE VARCHAR
>     CONSTRAINT PK PRIMARY KEY (METRICID, SERVICENAME, ORIGIN, APPID, IPID, 
> TIME)
> ) SALT_BUCKETS=40, IMMUTABLE_ROWS=true,VERSIONS=1,DATA_BLOCK_ENCODING='NONE';
> Make note on   TIME TIMESTAMP NOT NULL  /+ TIMESTAMP +/ - this is the Hint to 
> Phoenix that the column
> TIME must be mapped to HBase timestamp. 
> The Query:
> Select all events of type 'X' for last 7 days
> SELECT * from EVENTS WHERE METRICID = 'X' and TIME < NOW() and TIME > NOW() - 
> 7*24*3600000; (this may be not correct SQL syntax of course)
> These types of queries will be efficiently optimized if:
> 1. Phoenix maps  TIME column to HBase timestamp
> 2. Phoenix smart enough to map WHERE clause on TIME attribute to Scan 
> timerange 
> Although this :
> Properties props = new Properties();
> props.setProperty(PhoenixRuntime.CURRENT_SCN_ATTRIB, Long.toString(ts));
> Connection conn = DriverManager.connect(myUrl, props);
> conn.createStatement().execute("UPSERT INTO myTable VALUES ('a')");
> conn.commit();
> will work in my case- it may not be efficient from performance point of view 
> because for every INSERT/UPSERT 
> new Connection object and new Statement is created, beside this we still need 
> the optimization 2. (see above). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-914) Native HBase timestamp support to optimize date range queries in Phoenix

Reply via email to