[jira] [Commented] (PHOENIX-4552) Allow ROW_TIMESTAMP declaration for non PK column

James Taylor (JIRA) Tue, 23 Jan 2018 17:21:26 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336703#comment-16336703
 ]


James Taylor commented on PHOENIX-4552:
---------------------------------------

If you handle schema drift by either setting columns that are no longer 
relevant to null in the current data load or have something in the PK that 
changes when the schema changes, you should be fine to make your schema 
mutable. You can just not declare a ROW_TIMESTAMP column until this JIRA is 
implemented as it's just an optimization.

Before I forget, here's an outline of the work that'd be required to implement 
this JIRA:
- Remove check for ROW_TIMESTAMP only appearing in primary key
- Change WhereOptimizer to treat ROW_TIMESTAMP as being the last PK column when 
it's not in the PK. This will cause all the right stuff to happen when the 
ROW_TIMESTAMP column is used in AND and OR expressions as well as >, <, = 
expressions.
- When the ScanRanges is built at the end of 
WhereOptimizer.pushKeyExpressionsToScan(), ensure that the scan start/stop is 
not impacted when the ROW_TIMESTAMP column is not in the PK.
- Depending on how smart we want to try to be, we'd likely want to either 
disable the optimization if we have multiple values for the ROW_TIMESTAMP 
column in our query (i.e. the column was used in an OR or IN clause) or have a 
post filter on the ROW_TIMESTAMP column.
- Double check the upsert code path would function correctly.

We'd want to document the behavior this would have - for example columns set 
*before* the ROW_TIMESTAMP value being filtered against would be filtered out 
for a ROW_TIMESTAMP_COL > ? expression (i.e. the same "row" could have 
multiple, different timestamps). This would be a bit weird/hard to understand 
for use cases in which different columns are set at different times.

Also, we have restrictions that disallows indexes on tables with ROW_TIMESTAMP 
which would remain in place. We might be able to relax them some if we end up 
supporting IMMUTABLE indexes (i.e. mutable tables that indexes only columns 
that are write-once/append-only).



> Allow ROW_TIMESTAMP declaration for non PK column
> -------------------------------------------------
>
>                 Key: PHOENIX-4552
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4552
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Stephen Petschulat
>            Priority: Minor
>
> By declaring a ROW_TIMESTAMP constraint on a Phoenix table, it does two 
> things 1) expose the hbase native timestamp as this column and 2) prepend 
> your primary key with this timestamp as well.
> It would be useful to have a similar feature that only exposes the hbase 
> native timestamp. This would allow explicit setting of the timestamp when 
> upserting data while allowing multiple hbase versions. It is possible to then 
> query for that specific key and version(s).
> Potential approach:
> {code:sql}
> CREATE TABLE COMMENTS (
>    COMMENT_ID INT NOT NULL,
>    REVISION_NUM BIGINT NOT NULL ROW_TIMESTAMP,    // NEW use of keyword
>    COMMENT_BODY TEXT
>    CONSTRAINT PK PRIMARY KEY(COMMENT_ID))
> UPSERT INTO COMMENTS (123, 1, 'edit 1 comment')
> UPSERT INTO COMMENTS (123, 2, 'edit 2 of comment')
> UPSERT INTO COMMENTS (123, 3, 'edit 3 of comment')
> {code}
>  
> Current behavior of ROW_TIMESTAMP would create a new primary for each upsert, 
> so querying by primary key is no longer straightforward when you don't know 
> the version number at query time. 
> {code:sql}
> SELECT * FROM COMMENTS WHERE COMMENT_ID = 123  // => returns most recent 
> version 'edit 3 of comment'
> SELECT * FROM COMMENTS WHERE COMMENT_ID = 123 AND REVISION_NUM = 1   // => 
> returns explicit version 'edit 1 comment'
> {code}
>  
> It can also be useful to return multiple versions (related: PHOENIX-590)
> {code:sql}
> SELECT * FROM COMMENTS WHERE COMMENT_ID = 123 AND REVISION_NUM < 3   // => 
> returns 2 rows
> {code}
>  
> Or just the highest version less than or equal to a particular version 
> (allowing snapshot queries):
> {code:sql}
> // set CurrentSCN=2 on connection
> SELECT * FROM COMMENTS WHERE COMMENT_ID = 123 // => returns 'edit 2 of 
> comment'
> {code}
> CurrentSCN already allows this type of snapshot query but not against an 
> explicitly set timestamp with multiple versions. The primary key injection 
> prevents this. The above query would behave similar to:
> {code:java}
> scan 'COMMENTS', {TIMERANGE => [0, <maxversionid+1>]}
> {code}
>  This returns the highest versioned value for each key that is less than a 
> specified maximum version number.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PHOENIX-4552) Allow ROW_TIMESTAMP declaration for non PK column

Reply via email to