[jira] [Commented] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

Samarth Jain (JIRA) Wed, 15 Apr 2015 12:08:52 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496713#comment-14496713
 ]


Samarth Jain commented on PHOENIX-1779:
---------------------------------------

Perf numbers with the latest patch. 

select * from table with million rows and 16 salt buckets
scanner cache size of 100 

With patch 
Average time ~ 1800 ms

Without patch
Average time ~ 13300 ms

Perf gain ~ 7.5x

3-way union all for tables with million rows and 16 salt buckets
select * from tableA union all select * from tableB union all select * from 
tableC

With Patch
Average time ~ 11000 ms

Without patch
Average time ~ 35000 ms

Perf gain ~ 3x

There is more scope of improvement with Union All queries. With this patch we 
are only parallelizing fetching of next batches within each sub-select. In the 
above example we are fetching batches for 16 scanners in parallel. We could do 
better and parallelize fetching of batches for all the 48 scanners. This should 
yield get us closer to a similar perf gain that we were getting with regular 
single select queries.


> Parallelize fetching of next batch of records for scans corresponding to 
> queries with no order by 
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1779
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1779
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Samarth Jain
>            Assignee: Samarth Jain
>         Attachments: PHOENIX-1779.patch, PHOENIX-1779_v2.patch, wip.patch, 
> wip3.patch, wipwithsplits.patch
>
>
> Today in Phoenix we parallelize the first execution of scans i.e. we load 
> only the first batch of records up to the scan's cache size in parallel. 
> Loading of subsequent batches of records in scanners is essentially serial. 
> This could be improved especially for queries, including the ones with no 
> order by clauses,  that do not need any kind of merge sort on the client. 
> This could also potentially improve the performance of UPSERT SELECT 
> statements that load data from one table and insert into another. One such 
> use case being creating immutable indexes for tables that already have data. 
> It could also potentially improve the performance of our MapReduce solution 
> for bulk loading data by improving the speed of the loading/mapping phase. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

Reply via email to