weijie.tong created DRILL-5975:
----------------------------------
Summary: Resource utilization
Key: DRILL-5975
URL: https://issues.apache.org/jira/browse/DRILL-5975
Project: Apache Drill
Issue Type: New Feature
Affects Versions: 2.0.0
Reporter: weijie.tong
Assignee: weijie.tong
h1. Motivation
Now the resource utilization radio of Drill's cluster is not too good. Most of
the cluster resource is wasted. We can not afford too much concurrent queries.
Once the system accepted more queries with a not high cpu load, the query which
originally is very quick will become slower and slower.
The reason is Drill does not supply a scheduler . It just assume all the nodes
have enough calculation resource. Once a query comes, it will schedule the
related fragments to random nodes not caring about the node's load. Some nodes
will suffer more cpu context switch to satisfy the coming query. The profound
causes to this is that the runtime minor fragments construct a runtime tree
whose nodes spread different drillbits. The runtime tree is a memory pipeline
that is all the nodes will stay alone the whole lifecycle of a query by sending
out data to upper nodes successively, even though some node could run quickly
and quit immediately.What's more the runtime tree is constructed before actual
running. The schedule target to Drill will become the whole runtime tree nodes.
h1. Design
It will be hard to schedule the runtime tree nodes as a whole. So I try to
solve this by breaking the runtime cascade nodes. The graph below describes the
initial design.
!https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!
Every Drillbit instance will have a RecordBatchManager which will accept all
the RecordBatchs written by the senders of local different MinorFragments. The
RecordBatchManager will hold the RecordBatchs in memory firstly then disk
storage . Once the first RecordBatch of one MinorFragment sender of one query
occur , it will notice the FragmentScheduler. The FragmentScheduler is
instanced by the Foreman.It holds the whole PlanFragment execution graph.It
will allocate a new FragmentExecutor to run the generated RecordBatch. The
allocated FragmentExecutor will then notify the corresponding FragmentManager
to indicate that I am ready to receive the data. Then the FragmentManger will
send out the RecordBatch one by one to the corresponding FragmentExecutor's
receiver like what the current Sender does by throttling the data stream.
h1. Plan
This will be a large PR ,so I plan to divide it into some small ones.
a. to implement the RecordManager.
b. to implement a simple random FragmentScheduler and the whole event flow.
c. to implement a primitive FragmentScheduler which may reference the Sparrow
project.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)