[
https://issues.apache.org/jira/browse/PHOENIX-29?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907204#comment-13907204
]
James Taylor commented on PHOENIX-29:
-------------------------------------
Case #3 is an improvement only for mapped VIEWs. We can do this in a follow up
check-in if you want. With a mapped VIEW, we do not have an empty key value in
each row. When a user creates a view directly against an HBase table, it's
read-only and we do not insert this empty key value like we do if a table is
created. So at query time, we cannot rely on it being there. The reason we add
this empty key value is for cases like the above:
{code}
select a, b from t where c = 5
{code}
Since we don't have an empty key value, we have to project everything, since
otherwise we'd miss rows where a and b are null. This is already somewhat more
expensive than with regular tables - imagine the case where there are five
column families - we'd open each store for all of them. While for the table
case, we know we have the empty key value so we only need to project the single
column family that contains our empty key value. By implementing case #3, we'd
prune what gets returned back to the client for mapped views (just like we do
with your patch for tables). But the trick is that we'd need to dynamically
insert an empty key value in each row before your filter runs. Then, the same
thing will happen with your filter - there'd be no a or b KV, c would get
removed, and b/c there's the empty key value, we'd still return back a row to
the client (which is what we need to have happen).
> Add custom filter to more efficiently navigate KeyValues in row
> ---------------------------------------------------------------
>
> Key: PHOENIX-29
> URL: https://issues.apache.org/jira/browse/PHOENIX-29
> Project: Phoenix
> Issue Type: Bug
> Reporter: James Taylor
> Attachments: PHOENIX-29.patch, PHOENIX-29_V2.patch
>
>
> Currently HBase is 50% faster at selecting the first KV in a row than in
> selecting any other column. The reason is that when you project a column into
> a Scan, HBase uses its ExplicitColumTracker which does a reseek to the
> column. The only case where this is not necessary is when the column is the
> first one.
> In most cases (unless you have thousands of versions), it'd be more efficient
> to just do a NEXT instead of a reseek (especially if your KV is the next
> one). We can provide our own custom filter through which we pass two lists:
> 1) all KVs referenced in the select expressions. These are the only ones that
> need to be returned back to the client which is another advantage we'd get
> writing this custom filter.
> 2) all KVs referenced in the WHERE clause.
> The filter could sort the KVs using the standard KeyValue.COMPARATOR and
> merge between them and the incoming KVs, using NEXT instead of a reseek. We
> could potentially use a reseek if the number of columns in the table is
> beyond a certain threshold.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)