[jira] [Commented] (PHOENIX-1940) Push expected List ordinal position in KeyValueColumnExpression

Lars Hofhansl (JIRA) Sat, 12 Sep 2015 18:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742284#comment-14742284
 ]


Lars Hofhansl commented on PHOENIX-1940:
----------------------------------------

As a test I shortened all the CQs like so:
{code}
CREATE TABLE LINEITEM1 ( OK    INTEGER not null,
PK     INTEGER ,
SK     INTEGER ,
LN  INTEGER not null,
QTY    DECIMAL(15,2) ,
EP  DECIMAL(15,2) ,
DSC    DECIMAL(15,2) ,
TAX         DECIMAL(15,2) ,
RF  CHAR(1) ,
LS  CHAR(1) ,
SD    DATE ,
CD  DATE ,
RD DATE ,
SI CHAR(25) ,
SM     CHAR(10) ,
CO      VARCHAR(44),
constraint pk primary key (ok, ln));
{code}
The result is 1.22GB in size (as opposed to 1.77GB with the longer names).

To my surprise the query is _not_ noticeably faster! Most the time is still 
spent in SearchComparator.compare, followed by SQM.match, followed by 
FastDiffDeltaEncode.decodeNext.

The good news is that SearchComparator.compare now accounts for about 30% of 
the CPU time (was around 60% before), but there's still a lot of work to do 
(Phoenix and HBase)


> Push expected List<Cell> ordinal position in KeyValueColumnExpression
> ---------------------------------------------------------------------
>
>                 Key: PHOENIX-1940
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1940
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>
> Looks like quite a bit of time is spent in the binary search done to get the 
> latest Cell value when we're evaluating expressions on the server side (up to 
> 60% is spent in KeyValueUtil.getColumnLatest()). Since we know the set of 
> column qualifiers being projected into the scan, we could push the expected 
> position (assuming all columns have values). If the Cell is not in that 
> position, we could fall back to a binary search.
> Further enhancements could be to: allow a not null constraint on KeyValue 
> columns and either a) require all non null values to be provided on an 
> UPSERT, or b) do a check and put to enforce it (for transactional tables this 
> could be enforced).
> Additionally, the table could declare that dynamic columns are not allowed. 
> If both of the above are true, then we'd be able guaranteed positional access 
> the List<Cell> that we get back from an HBase Scanner.
> One further enhancement would be to collect a set of all ColumnExpression 
> instances on the server side for all expressions sent over. Then, we'd bind 
> them once, outside of the general expression evaluation of all expressions in 
> a statement for a given row. An example of where this would save time would 
> be in evaluating the following TPCH-Q1 aggregate query:
> {code}
> SELECT
>     l_returnflag,
>     l_linestatus,
>     sum(l_quantity) as sum_qty,
>     sum(l_extendedprice) as sum_base_price,
>     sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
>     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
>     avg(l_quantity) as avg_qty,
>     avg(l_extendedprice) as avg_price,
>     avg(l_discount) as avg_disc,
>     count(*) as count_order
> FROM
>     lineitem
> WHERE
>     l_shipdate <= date '1998-12-01' - interval '90' day
> GROUP BY
>     l_returnflag,
>     l_linestatus
> ORDER BY
>     l_returnflag,
>     l_linestatus;
> {code}
> During aggregation, the KeyValueColumnExpression for l_extendedprice would be 
> evaluated four times currently, once per occurrence in different SELECT 
> expressions. This enhancement would cut that down to once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1940) Push expected List ordinal position in KeyValueColumnExpression

Reply via email to