Hyunsik Choi created TAJO-584:
---------------------------------
Summary: Improve distributed merge sort
Key: TAJO-584
URL: https://issues.apache.org/jira/browse/TAJO-584
Project: Tajo
Issue Type: Improvement
Components: distributed query plan, physical operator
Reporter: Hyunsik Choi
Assignee: Hyunsik Choi
Fix For: 0.8-incubating
In Tajo, sort operator is similar to merge sort, but it works in a distributed
manner. The first sort phase sorts each fragment in local machine, the
intermediate data are shuffled in range partition, and then the the second sort
phase in each node sorts the range-partitioned data.
However, the second sort phase reads all shuffled data via one scanner. It
causes performance degrade. This patch improves the second sort phase to merge
directly all already-sorted intermediate data. It significantly reduces the
response time of sort queries.
I carried out some simple benchmark with the following query on TPC-H 100GB
data sets:
{code:sql}
select l_orderkey from lineitem order by l_orderkey;
{code}
The lineitem table occupies 75GB. The query response time are dramatically
reduced from 480 to 260 secs. This patch exploits the design of TAJO-36. So,
this patch requires TAJO-36.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)