[
https://issues.apache.org/jira/browse/TAJO-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyunsik Choi updated TAJO-584:
------------------------------
Attachment: TAJO-584.patch
> Improve distributed merge sort
> ------------------------------
>
> Key: TAJO-584
> URL: https://issues.apache.org/jira/browse/TAJO-584
> Project: Tajo
> Issue Type: Improvement
> Components: distributed query plan, physical operator
> Reporter: Hyunsik Choi
> Assignee: Hyunsik Choi
> Fix For: 0.8-incubating
>
> Attachments: TAJO-584.patch
>
>
> In Tajo, sort operator is similar to merge sort, but it works in a
> distributed manner. The first sort phase sorts each fragment in local
> machine, the intermediate data are shuffled in range partition, and then the
> the second sort phase in each node sorts the range-partitioned data.
> However, the second sort phase reads all shuffled data via one scanner. It
> causes performance degrade. This patch improves the second sort phase to
> merge directly all already-sorted intermediate data. It significantly reduces
> the response time of sort queries.
> I carried out some simple benchmark with the following query on TPC-H 100GB
> data sets:
> {code:sql}
> select l_orderkey from lineitem order by l_orderkey;
> {code}
> The lineitem table occupies 75GB. The query response time are dramatically
> reduced from 480 to 260 secs. This patch exploits the design of TAJO-36. So,
> this patch requires TAJO-36.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)