[ 
https://issues.apache.org/jira/browse/PHOENIX-29?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906593#comment-13906593
 ] 

James Taylor commented on PHOENIX-29:
-------------------------------------

There is one case we have to be careful of. We want to make sure to have a test 
case for a query like this:

select a, b from t where c = 5

when both a and b are absent. In this case, c would be projected, our initial 
filter will filter the row in with a c KV. Then your filter would remove the c 
KV, since it's not in the SELECT. However, we have an empty KV that is 
contained in every row and should also be included in your columns to keep. 
That's this bit in ParallelIterators:

        if (projector.isProjectEmptyKeyValue()) {

So if this is true, make sure your filter has this empty KV in it's column list 
and you should be ok. Note the optimization we do by using a 
FirstKeyOnlyFilter, as we've seen this to be faster. Even in this case, we'd 
want to make sure to add this empty cq:cf to your list in your new filter.

There may be a special case as well for what we call a "mapped" VIEW, which is 
mapping a Phoenix table directly to an existing HBase table where you *don't* 
want us to add this empty key value. In that case, I have to think a bit more. 
Might not want to add your filter in this case. Probably safest to add it only 
if projector.isProjectEmptyKeyValue() is true.


> Add custom filter to more efficiently navigate KeyValues in row
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-29
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-29
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>         Attachments: PHOENIX-29.patch, PHOENIX-29_V2.patch
>
>
> Currently HBase is 50% faster at selecting the first KV in a row than in 
> selecting any other column. The reason is that when you project a column into 
> a Scan, HBase uses its ExplicitColumTracker which does a reseek to the 
> column. The only case where this is not necessary is when the column is the 
> first one.
> In most cases (unless you have thousands of versions), it'd be more efficient 
> to just do a NEXT instead of a reseek (especially if your KV is the next 
> one). We can provide our own custom filter through which we pass two lists:
> 1) all KVs referenced in the select expressions. These are the only ones that 
> need to be returned back to the client which is another advantage we'd get 
> writing this custom filter.
> 2) all KVs referenced in the WHERE clause.
> The filter could sort the KVs using the standard KeyValue.COMPARATOR and 
> merge between them and the incoming KVs, using NEXT instead of a reseek. We 
> could potentially use a reseek if the number of columns in the table is 
> beyond a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to