[jira] [Updated] (DRILL-5975) Resource utilization

weijie.tong (JIRA) Sat, 18 Nov 2017 21:54:52 -0800

     [ 
https://issues.apache.org/jira/browse/DRILL-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


weijie.tong updated DRILL-5975:
-------------------------------
    Description: 
h1. Motivation

Now the resource utilization radio of Drill's cluster is not too good. Most of 
the cluster resource is wasted. We can not afford too much concurrent queries. 
Once the system accepted more queries with a not high cpu load, the query which 
originally is very quick will become slower and slower.

The reason is Drill does not supply a scheduler . It just assume all the nodes 
have enough calculation resource. Once a query comes, it will schedule the 
related fragments to random nodes not caring about the node's load. Some nodes 
will suffer more cpu context switch to satisfy the coming query. The profound 
causes to this is that the runtime minor fragments construct a runtime tree 
whose nodes spread different drillbits. The runtime tree is a memory pipeline 
that is all the nodes will stay alone the whole lifecycle of a query by sending 
out data to upper nodes successively, even though some node could run quickly 
and quit immediately.What's more the runtime tree is constructed before actual 
running. The schedule target to Drill will become the whole runtime tree nodes.

h1. Design
It will be hard to schedule the runtime tree nodes as a whole. So I try to 
solve this by breaking the runtime cascade nodes. The graph below describes the 
initial design. 
!https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!    
[graph 
link|https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png]

Every Drillbit instance will have a RecordBatchManager which will accept all 
the RecordBatchs written by the senders of local different MinorFragments. The 
RecordBatchManager will hold the RecordBatchs in memory firstly then disk 
storage . Once the first RecordBatch of a MinorFragment sender of one query 
occurs , it will notice the FragmentScheduler. The FragmentScheduler is 
instanced by the Foreman.It holds the whole PlanFragment execution graph.It 
will allocate a new corresponding FragmentExecutor to run the generated 
RecordBatch. The allocated FragmentExecutor will then notify the corresponding 
FragmentManager to indicate that I am ready to receive the data. Then the 
FragmentManger will send out the RecordBatch one by one to the corresponding 
FragmentExecutor's receiver like what the current Sender does by throttling the 
data stream.

What we can gain from this design is :
a. The computation leaf node does not to wait for the consumer's speed to end 
its life to release the resource.
b. The sending data logic will be isolated from the computation nodes and 
shared by different FragmentManagers.
c. We can schedule the MajorFragments according to Drillbit's actual resource 
capacity at runtime.
d. Drill's pipeline data processing characteristic is also retained.

h1. Plan

This will be a large PR ,so I plan to divide it into some small ones.
a. to implement the RecordManager.
b. to implement a simple random FragmentScheduler and the whole event flow.
c. to implement a primitive FragmentScheduler which may reference the Sparrow 
project.




  was:
h1. Motivation

Now the resource utilization radio of Drill's cluster is not too good. Most of 
the cluster resource is wasted. We can not afford too much concurrent queries. 
Once the system accepted more queries with a not high cpu load, the query which 
originally is very quick will become slower and slower.

The reason is Drill does not supply a scheduler . It just assume all the nodes 
have enough calculation resource. Once a query comes, it will schedule the 
related fragments to random nodes not caring about the node's load. Some nodes 
will suffer more cpu context switch to satisfy the coming query. The profound 
causes to this is that the runtime minor fragments construct a runtime tree 
whose nodes spread different drillbits. The runtime tree is a memory pipeline 
that is all the nodes will stay alone the whole lifecycle of a query by sending 
out data to upper nodes successively, even though some node could run quickly 
and quit immediately.What's more the runtime tree is constructed before actual 
running. The schedule target to Drill will become the whole runtime tree nodes.

h1. Design
It will be hard to schedule the runtime tree nodes as a whole. So I try to 
solve this by breaking the runtime cascade nodes. The graph below describes the 
initial design. 
!https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!    
[graph 
link|https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png]

Every Drillbit instance will have a RecordBatchManager which will accept all 
the RecordBatchs written by the senders of local different MinorFragments. The 
RecordBatchManager will hold the RecordBatchs in memory firstly then disk 
storage . Once the first RecordBatch of one MinorFragment sender of one query 
occur , it will notice the FragmentScheduler. The FragmentScheduler is 
instanced by the Foreman.It holds the whole PlanFragment execution graph.It 
will allocate a new FragmentExecutor to run the generated RecordBatch. The 
allocated FragmentExecutor will then notify the corresponding FragmentManager 
to indicate that I am ready to receive the data. Then the FragmentManger will 
send out the RecordBatch one by one to the corresponding FragmentExecutor's 
receiver like what the current Sender does by throttling the data stream.

What we can gain from this design is :
a. The computation leaf node does not to wait for the consumer's speed to end 
its life to release the resource.
b. The sending data logic will be isolated from the computation nodes and 
shared by different FragmentManagers.
c. We can schedule the MajorFragments according to Drillbit's actual resource 
capacity at runtime.
d. Drill's pipeline data processing characteristic is also retained.

h1. Plan

This will be a large PR ,so I plan to divide it into some small ones.
a. to implement the RecordManager.
b. to implement a simple random FragmentScheduler and the whole event flow.
c. to implement a primitive FragmentScheduler which may reference the Sparrow 
project.





> Resource utilization
> --------------------
>
>                 Key: DRILL-5975
>                 URL: https://issues.apache.org/jira/browse/DRILL-5975
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 2.0.0
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>
> h1. Motivation
> Now the resource utilization radio of Drill's cluster is not too good. Most 
> of the cluster resource is wasted. We can not afford too much concurrent 
> queries. Once the system accepted more queries with a not high cpu load, the 
> query which originally is very quick will become slower and slower.
> The reason is Drill does not supply a scheduler . It just assume all the 
> nodes have enough calculation resource. Once a query comes, it will schedule 
> the related fragments to random nodes not caring about the node's load. Some 
> nodes will suffer more cpu context switch to satisfy the coming query. The 
> profound causes to this is that the runtime minor fragments construct a 
> runtime tree whose nodes spread different drillbits. The runtime tree is a 
> memory pipeline that is all the nodes will stay alone the whole lifecycle of 
> a query by sending out data to upper nodes successively, even though some 
> node could run quickly and quit immediately.What's more the runtime tree is 
> constructed before actual running. The schedule target to Drill will become 
> the whole runtime tree nodes.
> h1. Design
> It will be hard to schedule the runtime tree nodes as a whole. So I try to 
> solve this by breaking the runtime cascade nodes. The graph below describes 
> the initial design. 
> !https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!   
>  [graph 
> link|https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png]
> Every Drillbit instance will have a RecordBatchManager which will accept all 
> the RecordBatchs written by the senders of local different MinorFragments. 
> The RecordBatchManager will hold the RecordBatchs in memory firstly then disk 
> storage . Once the first RecordBatch of a MinorFragment sender of one query 
> occurs , it will notice the FragmentScheduler. The FragmentScheduler is 
> instanced by the Foreman.It holds the whole PlanFragment execution graph.It 
> will allocate a new corresponding FragmentExecutor to run the generated 
> RecordBatch. The allocated FragmentExecutor will then notify the 
> corresponding FragmentManager to indicate that I am ready to receive the 
> data. Then the FragmentManger will send out the RecordBatch one by one to the 
> corresponding FragmentExecutor's receiver like what the current Sender does 
> by throttling the data stream.
> What we can gain from this design is :
> a. The computation leaf node does not to wait for the consumer's speed to end 
> its life to release the resource.
> b. The sending data logic will be isolated from the computation nodes and 
> shared by different FragmentManagers.
> c. We can schedule the MajorFragments according to Drillbit's actual resource 
> capacity at runtime.
> d. Drill's pipeline data processing characteristic is also retained.
> h1. Plan
> This will be a large PR ,so I plan to divide it into some small ones.
> a. to implement the RecordManager.
> b. to implement a simple random FragmentScheduler and the whole event flow.
> c. to implement a primitive FragmentScheduler which may reference the Sparrow 
> project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (DRILL-5975) Resource utilization

Reply via email to