Dayue Gao created KYLIN-2501:
--------------------------------
Summary: Stream Aggregate GTRecords at Query Server
Key: KYLIN-2501
URL: https://issues.apache.org/jira/browse/KYLIN-2501
Project: Kylin
Issue Type: Improvement
Components: Query Engine, Storage - HBase
Affects Versions: v1.6.0
Reporter: Dayue Gao
Assignee: Dayue Gao
*Problem*
When query server needs to handle millions of records from storage,
CubeTupleConverter could become performance bottleneck.
An experiment shows that converting 5 millions records takes ~11s, which
accounts for 50% of the total query time.
*Motivation*
Records returned from each storage partition is guaranteed to be ordered.
Therefore we could reduce the number of records passed to CubeTupleConverter by
# merge sorted records from all partitions, similar to what we have done in
KYLIN-1787
# use a [stream
aggregate|https://blogs.msdn.microsoft.com/craigfr/2006/09/13/stream-aggregate/]
algorithm on merged stream to aggregate those records with the same key
*Proposal*
# Add a new physical operator GTStreamAggregateScanner which implements the
stream aggregate algorithm
# Refine SortedIteratorMergerWithLimit that was used to merge sort records from
different partitions. The previous implementation has performance issues
(KYLIN-2483) due to expensive record clone
# Leverage GTStreamAggregateScanner to aggregate records on merged stream
*Scope*
Stream aggregate has some good properties such as low memory usage and
streamable ordered outputs, making it better than hash/sort based alternatives
when input is already sorted. So I bet the new GTStreamAggregateScanner
operator can also be used to accelerating cubing and coprocessor in certain
cases. I will leave it as future works.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)