[jira] [Comment Edited] (PHOENIX-514) Support functional indexes

James Taylor (JIRA) Tue, 23 Dec 2014 15:27:59 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257565#comment-14257565
 ]


James Taylor edited comment on PHOENIX-514 at 12/23/14 11:26 PM:
-----------------------------------------------------------------

One meta point is that rather than having branching logic everywhere that an 
index is created/used, this new mechanism should replace the old one (as it's 
more general and a superset of the existing functionality). Here's a more 
detailed list of the high level changes necessary to support this. 
- Change the grammar to allow an expression instead of just a column_name in 
col_def_name_with_sort_order.
- Add a value_expression VARCHAR column to the SYSTEM.CATALOG table. This will 
be a new string value that persists the expression.toString() value of the 
functional index.
- In MetaDataEndpointImpl, when we construct a PColumn for an index, we can 
have the PColumnImpl value default to a column reference in the data table when 
value_expression is null so that we don't need a script to update existing 
index definitions.
- In MetaDataClient.createIndex(), we'll want to use the 
value_expression.getDataType() information to form the table metadata. We'll 
also need to collect a parallel list of value_expression strings to persist in 
the system table. Note that you can only "stringify" an Expression and not a 
ParseNode, so you'll need to compile the ParseNode. Also, it's important to 
note that the PK columns should be put at the end of a functional index. We'll 
want to make sure that for the current usage (i.e. where just a column name is 
used as the indexed columns), that that PK column is not repeated in the index 
table definition. For example, the following should produce an index table 
definition with an "UPPER(K2) VARCHAR, K1 VARCHAR, K2 VARCHAR" primary key 
constraint as we need to maintain our 1:1 correspondence between index rows and 
data rows (remember, with functional indexes, there will be more than one data 
row with the same value for UPPER).
{code}
CREATE TABLE t(k1 VARCHAR, k2 VARCHAR, CONSTRAINT pk PRIMARY KEY (k1, k2));
CREATE INDEX i ON t(UPPER(K2));
{code}
This is as opposed to a non functional index, which should still produce an 
index table definition with a "K2 VARCHAR, K1 VARCHAR" primary key constraint:
{code}
CREATE TABLE t(k1 VARCHAR, k2 VARCHAR, CONSTRAINT pk PRIMARY KEY (k1, k2));
CREATE INDEX i ON t(K2);
{code}
- The initial index population occurs in two place: in PostIndexDDLCompiler for 
immutable and global indexes and in MetaDataClient.buildIndex():727 for local 
indexes. For local indexes, both the initial population and incremental 
maintenance goes through the IndexMaintainer which we'll cover next. For 
PostIndexDDLCompiler, we generate an UPSERT SELECT statement that selects from 
the data table and inserts into the index table. Currently it's more or less a 
straight copy from the data column to the index column. You'll need to insert 
the value_expression in place of the data column in the select part of the 
UPSERT SELECT statement.
- For index maintenance, this occurs in IndexMaintainer for global and local 
indexes. For immutable indexes, this is done on the client side in two places: 
in DeleteCompiler where we "translate" the delete statement for a data table 
for each index table and run them all, and in 
MutationState.addRowMutations():227 where we again go through the 
IndexMaintainer to generate the index row Puts given a data row Puts. So there 
are really two places that need to change: IndexMaintainer and DeleteCompiler.
- For DeleteCompiler, everything should just work the same, as we have the same 
columns in the row key as we did before (just not in the same position). For 
example, given the same index as described above with K1 and UPPER(K2) being 
indexed, we'd have the original K2 value at the end of the index row key, so a 
DELETE WHERE K2='Joe' would translate fine (albeit a bit slow because we'd end 
up having to do a full table scan). Under certain circumstances, it may be more 
efficient to handle these cases by running a SELECT on the client side against 
the data table (instead of a DELETE) which includes all the columns from the 
indexed, and then issue a DELETE on the client-side for the data row and each 
translated index row (going through the IndexMaintainer to get this). A 
separate JIRA can be filed for this, because it's orthogonal to the functional 
index work (it may improve perf for the non functional index case as well). 
~~For DeleteCompiler, we end up with TableRef[] and List<queryPlans> which are 
parallel to each other and contain the data table plus immutable index tables 
along with their respective QueryPlan. The idea here is that a DELETE with it's 
WHERE clause is just like a SELECT. So when you do the index usage part that 
translates a SELECT query to use the index table and it's functional index, 
you'll have done what you need to do for DELETE to work as well. This works 
well now to allow us to delete rows from a data table with immutable indexes 
and still maintain the index tables. However, in the case of a functional 
index, there will be more cases in which we cannot do the translation of the 
DELETE statement which will lead to more occurrences of DELETE throwing at 
DeleteCompiler.compile():388. For example, the following would throw, because 
we wouldn't be able to generate the corresponding DELETE for the index table:
{code}
CREATE TABLE t(k VARCHAR PRIMARY KEY);
CREATE INDEX i ON t(UPPER(K));
DELETE FROM t WHERE k='Joe';
{code}
However, the following would be ok:
{code}
DELETE FROM t WHERE UPPER(k)='Joe';
{code}
In theory, we could handle these cases by running a SELECT on the client side 
against the data table (instead of a DELETE) which includes all the columns 
from the indexed, and then issue a DELETE on the client-side for the data row 
and each translated index row (going through the IndexMaintainer to get this). 
I think this work can be done in a separate JIRA.~~
- For IndexMaintainer, you need to replace the Set<ColumnReference> 
indexedColumns member variable with a Set<Expression> member variable to store 
the compiled value_expression for any functional index. We currently only store 
either the PK column position of the data row or the data column name for non 
PK columns. The idea of the IndexMaintainer is that it's the "compiled" form 
that captures the minimum amount of information necessary to translate a data 
row to it's index row. Since Expression is serializable in a pretty compact 
way, we can just serialize/deserialize these in the readFields and writeFields 
methods. Then you'd change buildRowKey():400 to evaluate the function 
expression. We'll need to have ValueGetter implement Tuple as well, as that's 
what you'll pass through the evaluate method which supplies the "row state" 
during evaluation. You'll also need to modify buildDataRowKey() to not use the 
value expression (as that's the evaluated value), but rather the PK fields from 
the data row which will also be in the index row.
- For the *index usage*, you'll need to update QueryOptimizer. We currently 
take the original query and rewrite it only once (not once per index) replacing 
column references with the name that they'd have if they occurred in an index. 
If we can't find a column reference, then we assume the index cannot be used - 
the exception is for local indexes which can be joined back to the data table 
in an efficient manner (since the rows are co-located on the same region 
server). For these, we track which columns cannot be resolved, push this to the 
server side, and lookup the data row for a given index row to retrieve the 
column values. We'll no longer be able to re-write the original query once. 
Instead, we'll need to work at the Expression level instead of at the ParseNode 
level, since only Expression nodes have equality implemented (plus, in the 
process of compilation, we normalize the Expression). The QueryPlan from the 
original query has most of the Expressions that need to be translated (for 
GROUP BY, ORDER BY, and SELECT expressions). The WHERE clause is not available, 
though - that's something that we'd want to capture and put in the QueryPlan so 
you can access it. That'd be a change to QueryCompiler, to pass through the 
pre-optimized WHERE expression in the construction of the 
ScanPlan/AggregatePlan.
- Create a new ExpressionVisitor that walks an Expression and it's children. At 
each point, you'd want to check if the Expression is contained in a 
Map<Expression,ColumnExpression> where the key is your value_expression and the 
value is the column reference to your index column. If found, the column 
expression replaces the Expression.
- Use this new visitor to walk the Expression tree for all expressions in the 
original SELECT expression. At the end, you'd have a "re-written" statement 
which could then be compiled. Note that we'd need a slightly different entry 
point to QueryCompiler that compiles based on this re-written statement. A 
first cut, I suppose, could call toString() on the re-written statement and 
then call QueryCompile on the string. That'll cause the parsing to get redone, 
so this is suboptimal. Better would be to have compile work on Expressions 
rather than ParseNodes. Some of that work is already covered by PHOENIX-1491.
- Once you have the QueryPlan for each index, the code in QueryOptimizer should 
do the correct thing still in choosing which one is "best". So the main change 
is the way in which you'd need to get the compiled QueryPlan for each index.

A fair chunk of work to get this complete. Hopefully there's enough detail here 
to guide the implementation. Let me know if you have questions/comments.



was (Author: jamestaylor):
One meta point is that rather than having branching logic everywhere that an 
index is created/used, this new mechanism should replace the old one (as it's 
more general and a superset of the existing functionality). Here's a more 
detailed list of the high level changes necessary to support this. 
- Change the grammar to allow an expression instead of just a column_name in 
col_def_name_with_sort_order.
- Add a value_expression VARCHAR column to the SYSTEM.CATALOG table. This will 
be a new string value that persists the expression.toString() value of the 
functional index.
- In MetaDataEndpointImpl, when we construct a PColumn for an index, we can 
have the PColumnImpl value default to a column reference in the data table when 
value_expression is null so that we don't need a script to update existing 
index definitions.
- In MetaDataClient.createIndex(), we'll want to use the 
value_expression.getDataType() information to form the table metadata. We'll 
also need to collect a parallel list of value_expression strings to persist in 
the system table. Note that you can only "stringify" an Expression and not a 
ParseNode, so you'll need to compile the ParseNode. Also, it's important to 
note that the PK columns should be put at the end of a functional index. We'll 
want to make sure that for the current usage (i.e. where just a column name is 
used as the indexed columns), that that PK column is not repeated in the index 
table definition. For example, the following should produce an index table 
definition with an "UPPER(K2) VARCHAR, K1 VARCHAR, K2 VARCHAR" primary key 
constraint as we need to maintain our 1:1 correspondence between index rows and 
data rows (remember, with functional indexes, there will be more than one data 
row with the same value for UPPER).
{code}
CREATE TABLE t(k1 VARCHAR, k2 VARCHAR, CONSTRAINT pk PRIMARY KEY (k1, k2));
CREATE INDEX i ON t(UPPER(K2));
{code}
This is as opposed to a non functional index, which should still produce an 
index table definition with a "K2 VARCHAR, K1 VARCHAR" primary key constraint:
{code}
CREATE TABLE t(k1 VARCHAR, k2 VARCHAR, CONSTRAINT pk PRIMARY KEY (k1, k2));
CREATE INDEX i ON t(K2);
{code}
- The initial index population occurs in two place: in PostIndexDDLCompiler for 
immutable and global indexes and in MetaDataClient.buildIndex():727 for local 
indexes. For local indexes, both the initial population and incremental 
maintenance goes through the IndexMaintainer which we'll cover next. For 
PostIndexDDLCompiler, we generate an UPSERT SELECT statement that selects from 
the data table and inserts into the index table. Currently it's more or less a 
straight copy from the data column to the index column. You'll need to insert 
the value_expression in place of the data column in the select part of the 
UPSERT SELECT statement.
- For index maintenance, this occurs in IndexMaintainer for global and local 
indexes. For immutable indexes, this is done on the client side in two places: 
in DeleteCompiler where we "translate" the delete statement for a data table 
for each index table and run them all, and in 
MutationState.addRowMutations():227 where we again go through the 
IndexMaintainer to generate the index row Puts given a data row Puts. So there 
are really two places that need to change: IndexMaintainer and DeleteCompiler.
- For DeleteCompiler, we end up with TableRef[] and List<queryPlans> which are 
parallel to each other and contain the data table plus immutable index tables 
along with their respective QueryPlan. The idea here is that a DELETE with it's 
WHERE clause is just like a SELECT. So when you do the index usage part that 
translates a SELECT query to use the index table and it's functional index, 
you'll have done what you need to do for DELETE to work as well. This works 
well now to allow us to delete rows from a data table with immutable indexes 
and still maintain the index tables. However, in the case of a functional 
index, there will be more cases in which we cannot do the translation of the 
DELETE statement which will lead to more occurrences of DELETE throwing at 
DeleteCompiler.compile():388. For example, the following would throw, because 
we wouldn't be able to generate the corresponding DELETE for the index table:
{code}
CREATE TABLE t(k VARCHAR PRIMARY KEY);
CREATE INDEX i ON t(UPPER(K));
DELETE FROM t WHERE k='Joe';
{code}
However, the following would be ok:
{code}
DELETE FROM t WHERE UPPER(k)='Joe';
{code}
In theory, we could handle these cases by running a SELECT on the client side 
against the data table (instead of a DELETE) which includes all the columns 
from the indexed, and then issue a DELETE on the client-side for the data row 
and each translated index row (going through the IndexMaintainer to get this). 
I think this work can be done in a separate JIRA.
- For IndexMaintainer, you need to replace the Set<ColumnReference> 
indexedColumns member variable with a Set<Expression> member variable to store 
the compiled value_expression for any functional index. We currently only store 
either the PK column position of the data row or the data column name for non 
PK columns. The idea of the IndexMaintainer is that it's the "compiled" form 
that captures the minimum amount of information necessary to translate a data 
row to it's index row. Since Expression is serializable in a pretty compact 
way, we can just serialize/deserialize these in the readFields and writeFields 
methods. Then you'd change buildRowKey():400 to evaluate the function 
expression. We'll need to have ValueGetter implement Tuple as well, as that's 
what you'll pass through the evaluate method which supplies the "row state" 
during evaluation. You'll also need to modify buildDataRowKey() to not use the 
value expression (as that's the evaluated value), but rather the PK fields from 
the data row which will also be in the index row.
- For the *index usage*, you'll need to update QueryOptimizer. We currently 
take the original query and rewrite it only once (not once per index) replacing 
column references with the name that they'd have if they occurred in an index. 
If we can't find a column reference, then we assume the index cannot be used - 
the exception is for local indexes which can be joined back to the data table 
in an efficient manner (since the rows are co-located on the same region 
server). For these, we track which columns cannot be resolved, push this to the 
server side, and lookup the data row for a given index row to retrieve the 
column values. We'll no longer be able to re-write the original query once. 
Instead, we'll need to work at the Expression level instead of at the ParseNode 
level, since only Expression nodes have equality implemented (plus, in the 
process of compilation, we normalize the Expression). The QueryPlan from the 
original query has most of the Expressions that need to be translated (for 
GROUP BY, ORDER BY, and SELECT expressions). The WHERE clause is not available, 
though - that's something that we'd want to capture and put in the QueryPlan so 
you can access it. That'd be a change to QueryCompiler, to pass through the 
pre-optimized WHERE expression in the construction of the 
ScanPlan/AggregatePlan.
- Create a new ExpressionVisitor that walks an Expression and it's children. At 
each point, you'd want to check if the Expression is contained in a 
Map<Expression,ColumnExpression> where the key is your value_expression and the 
value is the column reference to your index column. If found, the column 
expression replaces the Expression.
- Use this new visitor to walk the Expression tree for all expressions in the 
original SELECT expression. At the end, you'd have a "re-written" statement 
which could then be compiled. Note that we'd need a slightly different entry 
point to QueryCompiler that compiles based on this re-written statement. A 
first cut, I suppose, could call toString() on the re-written statement and 
then call QueryCompile on the string. That'll cause the parsing to get redone, 
so this is suboptimal. Better would be to have compile work on Expressions 
rather than ParseNodes. Some of that work is already covered by PHOENIX-1491.
- Once you have the QueryPlan for each index, the code in QueryOptimizer should 
do the correct thing still in choosing which one is "best". So the main change 
is the way in which you'd need to get the compiled QueryPlan for each index.

A fair chunk of work to get this complete. Hopefully there's enough detail here 
to guide the implementation. Let me know if you have questions/comments.


> Support functional indexes
> --------------------------
>
>                 Key: PHOENIX-514
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-514
>             Project: Phoenix
>          Issue Type: Task
>            Reporter: James Taylor
>            Assignee: Thomas D'Silva
>              Labels: enhancement
>
> Instead of only defining the set of columns from the data table that make up 
> an index, you should be able to use expressions.  For example:
>       CREATE INDEX upper_last_name_idx ON person (UPPER(last_name))
> Then in queries that use UPPER(last_name), we can replace them with column 
> references to the index table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PHOENIX-514) Support functional indexes

Reply via email to