[
https://issues.apache.org/jira/browse/PHOENIX-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742284#comment-14742284
]
Lars Hofhansl commented on PHOENIX-1940:
----------------------------------------
As a test I shortened all the CQs like so:
{code}
CREATE TABLE LINEITEM1 ( OK INTEGER not null,
PK INTEGER ,
SK INTEGER ,
LN INTEGER not null,
QTY DECIMAL(15,2) ,
EP DECIMAL(15,2) ,
DSC DECIMAL(15,2) ,
TAX DECIMAL(15,2) ,
RF CHAR(1) ,
LS CHAR(1) ,
SD DATE ,
CD DATE ,
RD DATE ,
SI CHAR(25) ,
SM CHAR(10) ,
CO VARCHAR(44),
constraint pk primary key (ok, ln));
{code}
The result is 1.22GB in size (as opposed to 1.77GB with the longer names).
To my surprise the query is _not_ noticeably faster! Most the time is still
spent in SearchComparator.compare, followed by SQM.match, followed by
FastDiffDeltaEncode.decodeNext.
The good news is that SearchComparator.compare now accounts for about 30% of
the CPU time (was around 60% before), but there's still a lot of work to do
(Phoenix and HBase)
> Push expected List<Cell> ordinal position in KeyValueColumnExpression
> ---------------------------------------------------------------------
>
> Key: PHOENIX-1940
> URL: https://issues.apache.org/jira/browse/PHOENIX-1940
> Project: Phoenix
> Issue Type: Bug
> Reporter: James Taylor
>
> Looks like quite a bit of time is spent in the binary search done to get the
> latest Cell value when we're evaluating expressions on the server side (up to
> 60% is spent in KeyValueUtil.getColumnLatest()). Since we know the set of
> column qualifiers being projected into the scan, we could push the expected
> position (assuming all columns have values). If the Cell is not in that
> position, we could fall back to a binary search.
> Further enhancements could be to: allow a not null constraint on KeyValue
> columns and either a) require all non null values to be provided on an
> UPSERT, or b) do a check and put to enforce it (for transactional tables this
> could be enforced).
> Additionally, the table could declare that dynamic columns are not allowed.
> If both of the above are true, then we'd be able guaranteed positional access
> the List<Cell> that we get back from an HBase Scanner.
> One further enhancement would be to collect a set of all ColumnExpression
> instances on the server side for all expressions sent over. Then, we'd bind
> them once, outside of the general expression evaluation of all expressions in
> a statement for a given row. An example of where this would save time would
> be in evaluating the following TPCH-Q1 aggregate query:
> {code}
> SELECT
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> FROM
> lineitem
> WHERE
> l_shipdate <= date '1998-12-01' - interval '90' day
> GROUP BY
> l_returnflag,
> l_linestatus
> ORDER BY
> l_returnflag,
> l_linestatus;
> {code}
> During aggregation, the KeyValueColumnExpression for l_extendedprice would be
> evaluated four times currently, once per occurrence in different SELECT
> expressions. This enhancement would cut that down to once.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)