[jira] [Commented] (DRILL-5975) Resource utilization

Paul Rogers (JIRA) Sun, 26 Nov 2017 10:10:53 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266114#comment-16266114
 ]


Paul Rogers commented on DRILL-5975:
------------------------------------

Thanks [~weijie] for the description of the workload. I wonder if yours is a 
special case? You mentioned the use of Druid, which, as I recall, does its own 
aggregation across the multi-dimensional records stored in each of its data 
chunks. That is unlike the usual Drill use case in which Drill does the work of 
reading from disk, decoding data, and performing aggregation.

It may be worth exploring the differences that Druid introduces.

First, what is the architecture between Drill and Druid? Does Drill send 
queries to Druid brokers for distribution, or does Drill connect directly to 
historical or real-time nodes?

Second, does Druid run on the same cluster as Drill? If so, how do you ensure 
that each Drillbit reads only from the historical or real-time node on that 
same node (for data locality?)

Or, is Druid on a separate cluster? How, then, do you coordinate between the, 
say, 120 Drill nodes and the however many Druid nodes?

Third, how many queries can Druid effectively run per node? When running a 
Drill query, what is the CPU load on the Druid cluster?

Fourth, left to its own, Drill will try to run 16 * 70% = 11 Druid readers per 
major fragment per node. Given your knowledge of Drill internals, you may have 
adjusted that number. If left unchanged, each Drill query will fire off 120 
nodes * 11 readers/node = 1320 Druid queries.

The above suggests that this particular configuration may be limited by Druid, 
not by Drill itself.

As mentioned earlier, most (non-Druid) Drill clusters are CPU-bound because the 
do the grunt-work of reading data, decoding, filtering and so on. In the Druid 
use case, Druid does all that work; Drill may just be sitting around waiting 
for the work to be done by Druid.

This is easy to test, just look at the CPU usage on the Druid cluster when 
running Drill queries.

If Druid turns out to be the bottleneck, you may want to determine the best 
throughput for Druid nodes: how man queries, with which distribution, maximizes 
the Druid hardware?

Then, work out the size of the Drill cluster needed to drive Druid fully. It 
may be that you simply have to large a Drill cluster for the available Druid 
throughput, most Drill tasks just wait for data. Fewer Drill resources may 
better utilize the Drill cluster.

Said yet another way, you have two distributed systems, each trying to spread 
work: Drill and Druid. They may be working at cross purposes. Certainly their 
sizing needs to be coordinated for even workload.

On the other hand, if neither Drill nor Druid are near 100% CPU, then there is 
an even more complex interaction happening between the two systems. As hinted 
above, perhaps Drill and Druid should exist on the same cluster, with Drill 
taking the place of the Druid coordinator: Drill ships queries directly to the 
historical or real-time nodes on the same node as the Drillbit.

Under this revised architecture, the goal is to drive each Drill/Druid node to 
full CPU usage.

Perhaps the scheduler design is intended to address this use case. If so, then 
perhaps you could add a bit more explanation about how that will work.

> Resource utilization
> --------------------
>
>                 Key: DRILL-5975
>                 URL: https://issues.apache.org/jira/browse/DRILL-5975
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 2.0.0
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>
> h1. Motivation
> Now the resource utilization radio of Drill's cluster is not too good. Most 
> of the cluster resource is wasted. We can not afford too much concurrent 
> queries. Once the system accepted more queries with a not high cpu load, the 
> query which originally is very quick will become slower and slower.
> The reason is Drill does not supply a scheduler . It just assume all the 
> nodes have enough calculation resource. Once a query comes, it will schedule 
> the related fragments to random nodes not caring about the node's load. Some 
> nodes will suffer more cpu context switch to satisfy the coming query. The 
> profound causes to this is that the runtime minor fragments construct a 
> runtime tree whose nodes spread different drillbits. The runtime tree is a 
> memory pipeline that is all the nodes will stay alone the whole lifecycle of 
> a query by sending out data to upper nodes successively, even though some 
> node could run quickly and quit immediately.What's more the runtime tree is 
> constructed before actual running. The schedule target to Drill will become 
> the whole runtime tree nodes.
> h1. Design
> It will be hard to schedule the runtime tree nodes as a whole. So I try to 
> solve this by breaking the runtime cascade nodes. The graph below describes 
> the initial design. 
> !https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!   
>  [graph 
> link|https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png]
> Every Drillbit instance will have a RecordBatchManager which will accept all 
> the RecordBatchs written by the senders of local different MinorFragments. 
> The RecordBatchManager will hold the RecordBatchs in memory firstly then disk 
> storage . Once the first RecordBatch of a MinorFragment sender of one query 
> occurs , it will notice the FragmentScheduler. The FragmentScheduler is 
> instanced by the Foreman.It holds the whole PlanFragment execution graph.It 
> will allocate a new corresponding FragmentExecutor to run the generated 
> RecordBatch. The allocated FragmentExecutor will then notify the 
> corresponding FragmentManager to indicate that I am ready to receive the 
> data. Then the FragmentManger will send out the RecordBatch one by one to the 
> corresponding FragmentExecutor's receiver like what the current Sender does 
> by throttling the data stream.
> What we can gain from this design is :
> a. The computation leaf node does not to wait for the consumer's speed to end 
> its life to release the resource.
> b. The sending data logic will be isolated from the computation nodes and 
> shared by different FragmentManagers.
> c. We can schedule the MajorFragments according to Drillbit's actual resource 
> capacity at runtime.
> d. Drill's pipeline data processing characteristic is also retained.
> h1. Plan
> This will be a large PR ,so I plan to divide it into some small ones.
> a. to implement the RecordManager.
> b. to implement a simple random FragmentScheduler and the whole event flow.
> c. to implement a primitive FragmentScheduler which may reference the Sparrow 
> project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5975) Resource utilization

Reply via email to