Hanifi and I had a great conversation late last week about how Drill
currently provides parallelization. Hanifi suggested we move to a model
whereby there is a fixed threadpool for all Drill work and we treat all
operator and/or fragment operations as tasks that can be scheduled within
that pool. This would serve the following purposes:

1. reduce the number of threads that Drill creates
2. Decrease wasteful context switching (especially in high concurrency
scenarios)
3. Provide more predictable slas for Drill infrastructure tasks such as
heartbeats/rpc and cancellations/planning and queue management/etc (a key
hot-button for Vicki :)

For reference, this is already the threading model we use for the RPC
threads and is a fairly standard asynchronous programming model. When
Hanifi and I met, we brainstormed on what types of changes might need to be
done and ultimately thought that in order to do this, we'd realistically
want to move iterator trees from a pull model to a push model within a
node.

After spending more time thinking about this idea, I had the following
thoughts:

- We could probably accomplish the same behavior staying with a pull model
and using the IteraOutcome.NOT_YET to return.
- In order for this to work effectively, all code would need to be
non-blocking (including reading from disk, writing to socket, waiting for
zookeeper responses, etc)
- Task length (or coarseness) would be need to be quantized appropriately.
While operating at the RootExec.next() might be attractive, it is too
coarse to get reasonable sharing and we'd need to figure out ways to have
time-based exit within operators.
- With this approach, one of the biggest challenges would be reworking all
the operators to be able to unwind the stack to exit execution (to yield
their thread).

Given those challenges, I think there may be another, simpler solution that
could cover items 2 & 3 above without dealing with all the issues that we
would have to deal with in the proposal that Hanifi suggested. At its core,
I see the biggest issue is dealing with the unwinding/rewinding that would
be required to move between threads. This is very similar to how we needed
to unwind in the case of memory allocation before we supported realloc and
causes substantial extra code complexity. As such, I suggest we use a pause
approach that uses something similar to a semaphore for the number of
active threads we allow. This could be done using the existing
shouldContinue() mechanism where we suspend or reacquire thread use as we
pass through this method. We'd also create some alternative shoudlContinue
methods such as shouldContinue(Lock toLock) and shouldContinue(Queue
queueToTakeFrom), etc so that shouldContinue would naturally wrap blocking
calls with the right logic. This would be a fairly simple set of changes
and we could see how well it improves issues 2 & 3 above.

On top of this, I think we still need to implement automatic
parallelization scaling of the cluster. Even a rudimentary monitoring of
cluster load and parallel reduction of max_width_per_node would
substantially improve the behavior of the cluster under heavy concurrent
loads. (And note, I think that this is required no matter what we implement
above.)

Thoughts?
Jacques

--
Jacques Nadeau
CTO and Co-Founder, Dremio

Reply via email to