[
https://issues.apache.org/jira/browse/PHOENIX-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640677#comment-14640677
]
Ankit Singhal commented on PHOENIX-2126:
----------------------------------------
[~samarthjain],
As also mentioned by James,
Minheap will also help in MergeSortTopNResultIterator as the number of
comparison will reduce to O(logK) from O(K)(except first iteration in which
there will still be O(K)) as you will update heap with the next element from
the iterator from which you have removed the element. so overall performance
for MergeSortTopNResultIterator will be O(limit*logK) where K is the number of
salt buckets or guide post chunks.
But Anyways I have not optimized this and falling back to the original
implementation of minIterator whenever MergeSortTopResultIterator is used as
this will not yield much benefit as MergeSortTopResultIterator is used only
when you have orderBy on key which itself is not the high cardinality problem.
High cardinalty problem generally becomes when you have orderBy on aggregate
value (sum(k)), in which MergeSortRowKeyResultIterator(not
MergeSortTopNResultIterator) is used and you need to sort and aggregate all the
values returned from servers and so in this case the implementation is actually
changed to minheap and multi-threaded implementation.
> Improving performance of merge sort by multi-threaded and minheap
> implementation
> --------------------------------------------------------------------------------
>
> Key: PHOENIX-2126
> URL: https://issues.apache.org/jira/browse/PHOENIX-2126
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 4.1.0, 4.2.0
> Reporter: Ankit Singhal
> Attachments: PHOENIX-2126_v1.0.patch
>
>
> {code}
> CREATE TABLE IF NOT EXISTS test (
> dim1 INTEGER NOT NULL,
> A.B INTEGER,
> A.M DECIMAL,
> CONSTRAINT PK PRIMARY KEY
> (dim1))
> SALT_BUCKETS =256,DEFAULT_COLUMN_FAMILY='A';
> {code}
> *Query to benchmark:-*
> {code}
> select dim1,sum(b),sum(m) from test where Datemth>=201505 and Datemth<=201505
> and dim1 IS NOT NULL group by dim1 order by sum(m) desc nulls last limit 10;
> {code}
> *current scenario:-*
> *CASE 1: * consider the case when dim1 is high cardinality attribute (10K+)
> and table have salt bucket set to 256, we will get 256 iterators from above
> query at the client and MergeSortRowKeyResultIterator has to merge these 256
> iterators with single thread. So let's say each iterator has 10k tuples
> returned, then merge sort needs to merge 2.5M tuples which will be costly if
> it is done with single thread and the query spend most of its time on client
> *CASE 2: * consider the case when dim1 is high cardinality attribute (10K+)
> and table have salt bucket set to 1, we will get 1 iterator from above query
> at the client and MergeSortRowKeyResultIterator doesn't need to merge
> anything. Here, it is fine with single thread.
> *CASE 3: * consider the case when dim1 is low cardinality attribute (10-100)
> and table have salt bucket set to 256, we will get 256 iterator from above
> query at the client and MergeSortRowKeyResultIterator has to merge these 256
> iterators with single thread. here the single thread is also fine as he has
> to merge only 2560 tuples.
> *Solution for case1 problem is:-*
> Optimized the implementation of merging 'n'-sorted iterators(having 'm'
> tuples) by using "min heap" which optimizes the time complexity from
> O(n2m) to O(nmLogn) (as heapify takes (Logn) time).
> And, By using multiple-threads('t') to process group of iterators which
> further optimized the complexity to
> T(nm)=T(nm)/t+T(t)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)