[
https://issues.apache.org/jira/browse/PHOENIX-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520502#comment-14520502
]
Lars Hofhansl edited comment on PHOENIX-1940 at 4/29/15 11:22 PM:
------------------------------------------------------------------
And imported with this: {{psql.py -t LINEITEM -d '|' localhost lineitem.csv}}
after ungzipping and renaming the data file to csv.
was (Author: lhofhansl):
And imported with this: {{psql.py -t LINEITEM -d '|' phoenix-1:2181
lineitem.csv}} after ungzipping and renaming the data file to csv.
> Push expected List<Cell> ordinal position in KeyValueColumnExpression
> ---------------------------------------------------------------------
>
> Key: PHOENIX-1940
> URL: https://issues.apache.org/jira/browse/PHOENIX-1940
> Project: Phoenix
> Issue Type: Bug
> Reporter: James Taylor
>
> Looks like quite a bit of time is spent in the binary search done to get the
> latest Cell value when we're evaluating expressions on the server side (up to
> 60% is spent in KeyValueUtil.getColumnLatest()). Since we know the set of
> column qualifiers being projected into the scan, we could push the expected
> position (assuming all columns have values). If the Cell is not in that
> position, we could fall back to a binary search.
> Further enhancements could be to: allow a not null constraint on KeyValue
> columns and either a) require all non null values to be provided on an
> UPSERT, or b) do a check and put to enforce it (for transactional tables this
> could be enforced).
> Additionally, the table could declare that dynamic columns are not allowed.
> If both of the above are true, then we'd be able guaranteed positional access
> the List<Cell> that we get back from an HBase Scanner.
> One further enhancement would be to collect a set of all ColumnExpression
> instances on the server side for all expressions sent over. Then, we'd bind
> them once, outside of the general expression evaluation of all expressions in
> a statement for a given row. An example of where this would save time would
> be in evaluating the following TPCH-Q1 aggregate query:
> {code}
> SELECT
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> FROM
> lineitem
> WHERE
> l_shipdate <= date '1998-12-01' - interval '90' day
> GROUP BY
> l_returnflag,
> l_linestatus
> ORDER BY
> l_returnflag,
> l_linestatus;
> {code}
> During aggregation, the KeyValueColumnExpression for l_extendedprice would be
> evaluated four times currently, once per occurrence in different SELECT
> expressions. This enhancement would cut that down to once.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)