[
https://issues.apache.org/jira/browse/PHOENIX-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Samarth Jain updated PHOENIX-1779:
----------------------------------
Attachment: wip3.patch
Previous patches had a bug that was causing performance gains to come only from
avoiding merge sort and wasn't really parallelizing loading batches as it
should have.
With the bug fixed, the performance gains look pretty impressive. For a 10
million row table spread over 2 regions on the same region server and 249
guideposts, following are the numbers:
Reading out all the records from a table doing select * from T:
Scanner caching - 100 which is also the hbase's default
With patch - 22841 ms
Without patch - 135282 ms
Gain - 6x
Scanner caching - 500
With patch - 22030 ms
Without patch - 99075 ms
Gain - 4.5x
Scanner caching - 1000:
With patch - 20899 ms
Without patch - 98899 ms
Gain - 4.5 - 5x
Scanner caching size - 2000
With patch - 31000 ms
Without patch - 88904 ms
Gain - 2.5 - 3x
> Parallelize fetching of next batch of records for scans corresponding to
> queries with no order by
> --------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-1779
> URL: https://issues.apache.org/jira/browse/PHOENIX-1779
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Samarth Jain
> Assignee: Samarth Jain
> Attachments: wip.patch, wip3.patch, wipwithsplits.patch
>
>
> Today in Phoenix we parallelize the first execution of scans i.e. we load
> only the first batch of records up to the scan's cache size in parallel.
> Loading of subsequent batches of records in scanners is essentially serial.
> This could be improved especially for queries, including the ones with no
> order by clauses, that do not need any kind of merge sort on the client.
> This could also potentially improve the performance of UPSERT SELECT
> statements that load data from one table and insert into another. One such
> use case being creating immutable indexes for tables that already have data.
> It could also potentially improve the performance of our MapReduce solution
> for bulk loading data by improving the speed of the loading/mapping phase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)