Thanks for the reference to the paper Tim. Definitely worth a read. And yes, I think this is important enough a topic that we need a thorough discussion so the more inputs we can get, the better.
Parth On Thu, Oct 9, 2014 at 12:37 AM, Timothy Chen <[email protected]> wrote: > Hi Parth, > > Thanks for providing an update, this is really great to see more > design discussions on the list! > > The pipeline chains definitely makes lot of sense, and I still > remember discussions offline around this in the past. > > The global memory efficency seems like a scheduling problem, as the > delay of a chain only benefits if there are other chains in-flight, > and be at a chain level or at a query level. > > I don't have much to add yet, but love to see how we can start simple on > this. > > One paper that is relevant that I just started to read is from the > recent OSDI 14, will chime in more once I grasp it. > > https://www.usenix.org/conference/osdi14/technical-sessions/presentation/boutin > > Tim > > > On Wed, Oct 8, 2014 at 11:03 PM, Parth Chandra <[email protected]> > wrote: > > Hi everyone, > > > > Aman, Jinfeng and I had an initial offline discussion about memory > planning > > in Drill. I’m summarizing the discussion below and hoping to initiate a > > discussion around some of these ideas. > > > > > > Order of Execution > > ------------------------- > > Assertion: For memory management Order of Execution is a fundamental > issue. > > > > One of the problems with memory usage in the execution engine is that all > > operators start up simultaneously and start allocating memory even though > > the downstream operators may not be ready to consume their output. > > > > For example, in the plan below : > > > > Agg > > | > > HJ2 > > / \ > > HJ1 Scan3 > > / \ > > Scan1 Scan2 > > > > > > the scan operators all begin reading data simultaneously. In this case, > the > > Hash Joins are blocking operations and the output of Scan3 cannot be > > consumed until HJ1 is ready to emit its results. If, say, Scan2 is on the > > build side of the hash table for HJ1, then HJ1 will not emit any records > > until Scan2 completes its operation. If Scan3 starts immediately, it is > > consuming memory that could be utilized by Scan2. Instead, if we delay > the > > start of Scan3 until after HJ1 is ready to emit records, we can utilize > > memory more efficiently. > > > > To address this, we can think of the query plan in terms of pipeline > > chains, where a pipeline chain is a chain of operators terminated by a > > blocking operator. > > > > In the example, there would be three pipeline chains : > > PC1 : Scan1-HJ1-HJ2-Agg > > PC2 : Scan 2 > > PC3 : Scan 3 > > > > Now, we can see that we can start PC1 and PC2, but PC3 can be delayed > until > > PC2 is completed and PC1 has reached HJ2. > > > > One thing we need to consider is that multiple major fragments can be > part > > of a single pipeline chain. All these major fragments can begin execution > > if the pipeline chain is ready to begin execution. > > > > We need to think this one through, though. There are probably many > details > > to be hashed out, though one thing is certain: the planner has to provide > > some more information to the execution engine in terms of the ordering of > > the pipeline chains. In other words, implementing this needs work on both > > the planner and the execution engine. We also need to work out the > details > > of how the idea of a pipeline chain will be reconciled with the idea of > > major/minor fragments which are currently the units of execution. > > > > Fragment memory limit > > —---------------------------- > > We have implemented a simple method to limit the use of memory by a > > single fragment in 0.6 (it is disabled by default). This prevents a > single > > fragment from hogging too much memory while other fragments may be > starved. > > However the current implementation has some drawbacks: > > i) The fragment memory allocators have no idea of how much memory is > > really needed by the fragment. The fragment limit is therefore determined > > by dividing the available memory *equally* among the fragments. This is > not > > a fair method; a better choice would be to allocate fragment limits based > > on the relative needs of the executing fragments. > > ii) The idea of limiting memory use by a fragment is a little too > narrow, > > since the purpose is to allow many queries, not fragments, to be able to > > run together. The current mechanism favours queries that may have many > > fragments over queries with fewer fragments which may have equivalent > > memory needs. > > > > To address this, we need to assign memory limits per query instead of > > per fragment. In addition, we have some estimates at the query level for > > the amount of memory that the query may need. We should probably change > the > > memory limit implementation to use this estimate and assign limits > > proportionately. In addition, the limit should be calculated per query > (and > > assigned to all fragments of the same query). It might be even better if > we > > could estimate the memory requirement per fragment and use that as the > > limit. > > > > Again, some work needs to be done to figure out what data is > available > > that the allocators can use and what data can be calculated/estimated at > > the planning stage to allow the allocators to distribute memory fairly. > > > > > > All critiques and criticisms are welcome, we’re hoping to get a good > > discussion going around this. > > > > > > Parth >
