Hanumath Rao Maduri created DRILL-6115:
------------------------------------------

             Summary: SingleMergeExchange is not scaling up when many minor 
fragments are allocated for a query.
                 Key: DRILL-6115
                 URL: https://issues.apache.org/jira/browse/DRILL-6115
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Relational Operators
    Affects Versions: 1.12.0
            Reporter: Hanumath Rao Maduri
            Assignee: Hanumath Rao Maduri
         Attachments: Enhancing Drill to multiplex ordered merge exchanges.docx

SingleMergeExchange is created when a global order is required in the output. 
The following query produces the SingleMergeExchange.
{code:java}
0: jdbc:drill:zk=local> explain plan for select L_LINENUMBER from 
dfs.`/drill/tables/lineitem` order by L_LINENUMBER;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(L_LINENUMBER=[$0])
00-02 SingleMergeExchange(sort0=[0])
01-01 SelectionVectorRemover
01-02 Sort(sort0=[$0], dir0=[ASC])
01-03 HashToRandomExchange(dist0=[[$0]])
02-01 Scan(table=[[dfs, /drill/tables/lineitem]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///drill/tables/lineitem, 
condition=null], columns=[`L_LINENUMBER`], maxwidth=15]])
{code}

On a 10 node cluster if the table is huge then DRILL can spawn many minor 
fragments which are all merged on a single node with one merge receiver. Doing 
so will create lot of memory pressure on the receiver node and also execution 
bottleneck. To address this issue, merge receiver should be multiphase merge 
receiver. 

Ideally for large cluster one can introduce tree merges so that merging can be 
done parallel. But as a first step I think it is better to use the existing 
infrastructure for multiplexing operators to generate an OrderedMux so that all 
the minor fragments pertaining to one DRILLBIT should be merged and the merged 
data can be sent across to the receiver operator.

On a 10 node cluster if each node processes 14 minor fragments.

Current version of code merges 140 minor fragments
the proposed version has two level merges 1 - 14 merge in each drillbit which 
is parallel 
and 10 minorfragments are merged at the receiver node.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to