[ 
https://issues.apache.org/jira/browse/PHOENIX-29?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907300#comment-13907300
 ] 

James Taylor commented on PHOENIX-29:
-------------------------------------

The empty KV serves a bunch of purposes:
-  prevents us from having to project everything into the scan for cases like 
this. If there are multiple cf, then this is a big perf win, b/c we won't open 
the other stores during filtering. There are other examples of queries where 
this is necessary, in particular queries that use the IS NULL operator.
- prevents row from "disappearing" if you set all KV column values to null. For 
example, say you have a table like this: CREATE TABLE t (a VARCHAR PRIMARY KEY, 
b VARCHAR); If you have a row, 'abc' with a value of b='xyz', and then set b to 
null, we issue an HBase Delete for that KeyValue. If we don't have an empty key 
value, the row would gone.
- allow schema that only contain row key columns. For example, you can have 
CREATE TABLE t (a VARCHAR PRIMARY KEY); The only KeyValue in this case is our 
empty key value. This would be the case for any secondary index that doesn't 
include any covered column.

For some of these, we could take other approaches. For example, we could store 
an empty byte array as the value when you set a column to null. Or we could 
disallow schemas that don't include at least one key value column and let the 
user manage adding an empty value for each row in a kind of dummy key value 
column that they'd create. We could detect when we have a query that needs 
everything projected (as it's often not required). So it's possible we could 
not add this empty key value column (or make it optional).


> Add custom filter to more efficiently navigate KeyValues in row
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-29
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-29
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>         Attachments: PHOENIX-29.patch, PHOENIX-29_V2.patch
>
>
> Currently HBase is 50% faster at selecting the first KV in a row than in 
> selecting any other column. The reason is that when you project a column into 
> a Scan, HBase uses its ExplicitColumTracker which does a reseek to the 
> column. The only case where this is not necessary is when the column is the 
> first one.
> In most cases (unless you have thousands of versions), it'd be more efficient 
> to just do a NEXT instead of a reseek (especially if your KV is the next 
> one). We can provide our own custom filter through which we pass two lists:
> 1) all KVs referenced in the select expressions. These are the only ones that 
> need to be returned back to the client which is another advantage we'd get 
> writing this custom filter.
> 2) all KVs referenced in the WHERE clause.
> The filter could sort the KVs using the standard KeyValue.COMPARATOR and 
> merge between them and the incoming KVs, using NEXT instead of a reseek. We 
> could potentially use a reseek if the number of columns in the table is 
> beyond a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to