Hanifi and I had a great conversation late last week about how Drill currently provides parallelization. Hanifi suggested we move to a model whereby there is a fixed threadpool for all Drill work and we treat all operator and/or fragment operations as tasks that can be scheduled within that pool. This would serve the following purposes:
1. reduce the number of threads that Drill creates 2. Decrease wasteful context switching (especially in high concurrency scenarios) 3. Provide more predictable slas for Drill infrastructure tasks such as heartbeats/rpc and cancellations/planning and queue management/etc (a key hot-button for Vicki :) For reference, this is already the threading model we use for the RPC threads and is a fairly standard asynchronous programming model. When Hanifi and I met, we brainstormed on what types of changes might need to be done and ultimately thought that in order to do this, we'd realistically want to move iterator trees from a pull model to a push model within a node. After spending more time thinking about this idea, I had the following thoughts: - We could probably accomplish the same behavior staying with a pull model and using the IteraOutcome.NOT_YET to return. - In order for this to work effectively, all code would need to be non-blocking (including reading from disk, writing to socket, waiting for zookeeper responses, etc) - Task length (or coarseness) would be need to be quantized appropriately. While operating at the RootExec.next() might be attractive, it is too coarse to get reasonable sharing and we'd need to figure out ways to have time-based exit within operators. - With this approach, one of the biggest challenges would be reworking all the operators to be able to unwind the stack to exit execution (to yield their thread). Given those challenges, I think there may be another, simpler solution that could cover items 2 & 3 above without dealing with all the issues that we would have to deal with in the proposal that Hanifi suggested. At its core, I see the biggest issue is dealing with the unwinding/rewinding that would be required to move between threads. This is very similar to how we needed to unwind in the case of memory allocation before we supported realloc and causes substantial extra code complexity. As such, I suggest we use a pause approach that uses something similar to a semaphore for the number of active threads we allow. This could be done using the existing shouldContinue() mechanism where we suspend or reacquire thread use as we pass through this method. We'd also create some alternative shoudlContinue methods such as shouldContinue(Lock toLock) and shouldContinue(Queue queueToTakeFrom), etc so that shouldContinue would naturally wrap blocking calls with the right logic. This would be a fairly simple set of changes and we could see how well it improves issues 2 & 3 above. On top of this, I think we still need to implement automatic parallelization scaling of the cluster. Even a rudimentary monitoring of cluster load and parallel reduction of max_width_per_node would substantially improve the behavior of the cluster under heavy concurrent loads. (And note, I think that this is required no matter what we implement above.) Thoughts? Jacques -- Jacques Nadeau CTO and Co-Founder, Dremio
