[
https://issues.apache.org/jira/browse/PHOENIX-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639738#comment-14639738
]
Samarth Jain commented on PHOENIX-2126:
---------------------------------------
[~ankit.singhal] - this is interesting. Since the sorting here is entirely
compute bound, I would recommend to limit the number of threads to the number
of cores available on the machine running the phoenix client. I am also curious
to know whether the implementation of priority queue you are using will change
depending on the query has a desc or asc on it? Would it make sense to use min
or max heap depending on the sort order needed? Would be good to test this -
both unit test and manual testing on a large enough data set.
Also, looks like that when you are merging all the iterators, you are
pre-building the sorted list. This could be problematic if the limit is high
enough to make the client go OOM. [~jamestaylor] may have some opinion here too.
{code}
+ public PeekingResultIterator
mergeIterators(List<PeekingResultIterator> iterators) throws SQLException
+ {
+
+ PriorityQueue < IteratorContainer > heap=new PriorityQueue <
IteratorContainer >();
+
+ for(int i=0;i < iterators.size();++i)
+ {
+ IteratorContainer iteratorContainer = new
IteratorContainer(iterators.get(i));
+ if (iteratorContainer.isNull())
+ continue;
+ heap.add(iteratorContainer);
+ }
+ LinkedList<Tuple> sortedList=new LinkedList<Tuple>();
+ while(!heap.isEmpty())
+ {
+ IteratorContainer iteratorContainer=heap.poll();
+ sortedList.add(iteratorContainer.next());
+ if (iteratorContainer.isNull())
+ continue;
+ heap.add(iteratorContainer);
+ }
+ return new MergePeekIterator(sortedList);
+ }
{code}
> Improving performance of merge sort by multi-threaded and minheap
> implementation
> --------------------------------------------------------------------------------
>
> Key: PHOENIX-2126
> URL: https://issues.apache.org/jira/browse/PHOENIX-2126
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 4.1.0, 4.2.0
> Reporter: Ankit Singhal
> Attachments: PHOENIX-2126_v1.0.patch
>
>
> {code}
> CREATE TABLE IF NOT EXISTS test (
> dim1 INTEGER NOT NULL,
> A.B INTEGER,
> A.M DECIMAL,
> CONSTRAINT PK PRIMARY KEY
> (dim1))
> SALT_BUCKETS =256,DEFAULT_COLUMN_FAMILY='A';
> {code}
> *Query to benchmark:-*
> {code}
> select dim1,sum(b),sum(m) from test where Datemth>=201505 and Datemth<=201505
> and dim1 IS NOT NULL group by dim1 order by sum(m) desc nulls last limit 10;
> {code}
> *current scenario:-*
> *CASE 1: * consider the case when dim1 is high cardinality attribute (10K+)
> and table have salt bucket set to 256, we will get 256 iterators from above
> query at the client and MergeSortRowKeyResultIterator has to merge these 256
> iterators with single thread. So let's say each iterator has 10k tuples
> returned, then merge sort needs to merge 2.5M tuples which will be costly if
> it is done with single thread and the query spend most of its time on client
> *CASE 2: * consider the case when dim1 is high cardinality attribute (10K+)
> and table have salt bucket set to 1, we will get 1 iterator from above query
> at the client and MergeSortRowKeyResultIterator doesn't need to merge
> anything. Here, it is fine with single thread.
> *CASE 3: * consider the case when dim1 is low cardinality attribute (10-100)
> and table have salt bucket set to 256, we will get 256 iterator from above
> query at the client and MergeSortRowKeyResultIterator has to merge these 256
> iterators with single thread. here the single thread is also fine as he has
> to merge only 2560 tuples.
> *Solution for case1 problem is:-*
> Optimized the implementation of merging 'n'-sorted iterators(having 'm'
> tuples) by using "min heap" which optimizes the time complexity from
> O(n2m) to O(nmLogn) (as heapify takes (Logn) time).
> And, By using multiple-threads('t') to process group of iterators which
> further optimized the complexity to
> T(nm)=T(nm)/t+T(t)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)