[ 
https://issues.apache.org/jira/browse/PHOENIX-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742917#comment-14742917
 ] 

Lars Hofhansl commented on PHOENIX-1940:
----------------------------------------

In my test above I should note that all data fit into the block cache. If data 
is read from disk I'd expect the smaller set to win hands down.

[~stack], the patch is already done and committed in PHOENIX-2237 (that one 
halfs the amount spent in compare).

Looking for further improvements. The core of the issue is that because (a) 
HBase might not return values for all columns and (b) the CQ does not directly 
imply its ordinal position, we need to search for it (HBase's Result object 
does the same, BTW, might want to look at doing the same that I did for 
KeyValueUtil.getColumnLatest, although there we may get multiple versions, and 
hence we'd have to do the inexact match).

I do not have a patch for the positional logic, yet. Need to think about it a 
bit. What we need to achieve is that from the CQ we can directly get the 
ordinal position. That would be possible if it were an int, we'd still need to 
decode the int (perhaps from a string), but that's faster than all the compares 
implies in a binary search.

The 2nd part is then that we need have the columns in a data structure that 
allows ordinal access. An array would be best, but it would need placeholders 
for the columns that weren't returned. Even a HashMap would be an improvement, 
since a hash of an Integer is its value and int compares are cheap. I'll corner 
[~giacomotaylor] this week, and discuss :)


> Push expected List<Cell> ordinal position in KeyValueColumnExpression
> ---------------------------------------------------------------------
>
>                 Key: PHOENIX-1940
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1940
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>
> Looks like quite a bit of time is spent in the binary search done to get the 
> latest Cell value when we're evaluating expressions on the server side (up to 
> 60% is spent in KeyValueUtil.getColumnLatest()). Since we know the set of 
> column qualifiers being projected into the scan, we could push the expected 
> position (assuming all columns have values). If the Cell is not in that 
> position, we could fall back to a binary search.
> Further enhancements could be to: allow a not null constraint on KeyValue 
> columns and either a) require all non null values to be provided on an 
> UPSERT, or b) do a check and put to enforce it (for transactional tables this 
> could be enforced).
> Additionally, the table could declare that dynamic columns are not allowed. 
> If both of the above are true, then we'd be able guaranteed positional access 
> the List<Cell> that we get back from an HBase Scanner.
> One further enhancement would be to collect a set of all ColumnExpression 
> instances on the server side for all expressions sent over. Then, we'd bind 
> them once, outside of the general expression evaluation of all expressions in 
> a statement for a given row. An example of where this would save time would 
> be in evaluating the following TPCH-Q1 aggregate query:
> {code}
> SELECT
>     l_returnflag,
>     l_linestatus,
>     sum(l_quantity) as sum_qty,
>     sum(l_extendedprice) as sum_base_price,
>     sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
>     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
>     avg(l_quantity) as avg_qty,
>     avg(l_extendedprice) as avg_price,
>     avg(l_discount) as avg_disc,
>     count(*) as count_order
> FROM
>     lineitem
> WHERE
>     l_shipdate <= date '1998-12-01' - interval '90' day
> GROUP BY
>     l_returnflag,
>     l_linestatus
> ORDER BY
>     l_returnflag,
>     l_linestatus;
> {code}
> During aggregation, the KeyValueColumnExpression for l_extendedprice would be 
> evaluated four times currently, once per occurrence in different SELECT 
> expressions. This enhancement would cut that down to once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to