[
https://issues.apache.org/jira/browse/PHOENIX-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496713#comment-14496713
]
Samarth Jain commented on PHOENIX-1779:
---------------------------------------
Perf numbers with the latest patch.
select * from table with million rows and 16 salt buckets
scanner cache size of 100
With patch
Average time ~ 1800 ms
Without patch
Average time ~ 13300 ms
Perf gain ~ 7.5x
3-way union all for tables with million rows and 16 salt buckets
select * from tableA union all select * from tableB union all select * from
tableC
With Patch
Average time ~ 11000 ms
Without patch
Average time ~ 35000 ms
Perf gain ~ 3x
There is more scope of improvement with Union All queries. With this patch we
are only parallelizing fetching of next batches within each sub-select. In the
above example we are fetching batches for 16 scanners in parallel. We could do
better and parallelize fetching of batches for all the 48 scanners. This should
yield get us closer to a similar perf gain that we were getting with regular
single select queries.
> Parallelize fetching of next batch of records for scans corresponding to
> queries with no order by
> --------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-1779
> URL: https://issues.apache.org/jira/browse/PHOENIX-1779
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Samarth Jain
> Assignee: Samarth Jain
> Attachments: PHOENIX-1779.patch, PHOENIX-1779_v2.patch, wip.patch,
> wip3.patch, wipwithsplits.patch
>
>
> Today in Phoenix we parallelize the first execution of scans i.e. we load
> only the first batch of records up to the scan's cache size in parallel.
> Loading of subsequent batches of records in scanners is essentially serial.
> This could be improved especially for queries, including the ones with no
> order by clauses, that do not need any kind of merge sort on the client.
> This could also potentially improve the performance of UPSERT SELECT
> statements that load data from one table and insert into another. One such
> use case being creating immutable indexes for tables that already have data.
> It could also potentially improve the performance of our MapReduce solution
> for bulk loading data by improving the speed of the loading/mapping phase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)