Hi Julian, Thanks for starting the discussion.
I¹m trying to make sense of the interactions among all these parts (Œmemory use¹/¹boundaries selection¹/Œdegree of parallelism¹). You were defining some kind of optimal degree of parallelism, that might be computed after assigning operators to stages. But at the same time, I think in some systems the decision on boundaries (aka break points) might be influenced by parallelism degree too; given a tree of operators, you might want to move the boundaries in the plan to adjust the parallelism for those operators to an optimal value? In addition, some systems may let users manually specify that parallelism e.g. using hints, which establishes the parallelism of some (or all) operators beforehand, which may in turn move boundaries too? Thus, in the metadata provider for Œmemory use¹, I think it makes sense to provide methods to obtain memory consumption _for a given operator_ and _for a given subtree_. That would ease the optimization process, where you may want to find the optimal degree of parallelism if boundaries have already been defined in the plan, where you could target the bigger problem of finding the optimal boundaries as well as degree of parallelism for the whole plan, etc. What do you think? Thanks, Jesús On 2/23/15, 10:26 PM, "Julian Hyde" <[email protected]> wrote: >I am currently helping the Hive team compute the memory usage of >operators (more precisely, a set of operators that live in the same >process) using a Calcite metadata provider. > >Part of this task is to create an ³Average tuple size² metadata >provider based on a ³Average column size² metadata provider. This is >fairly uncontroversial. (I¹ll create a jira case when >https://issues.apache.org/ is back up, and we can discuss details such >as how to compute average size of columns computed using built-in >functions, user-defined functions, or aggregate functions.) > >Also we want to create a ³CumulativeMemoryUseWithinProcess² metadata >provider. That would start at 0 for a table scan or exchange consumer, >but increase for each operator building a hash-table or using sort >buffers until we reach the edge of the process. > >But we realized that we also need to determine the degree of >parallelism, because we need to multiply AverageTupleSize not by >RowCount but by RowCountWithinPartition. (If the data set has 100M >rows, each 100 bytes and is bucketed 5 ways then each process will >need memory for 20M rows, i.e 20M rows * 100 bytes/row = 2GB.) > >Now, if we are already at the stage of planning where we have already >determined the degree of parallelism, then we could expose this as a >ParallelismDegree metadata provider. > >But maybe we¹re getting the cart before the horse? Maybe we should >compute the degree of parallelism AFTER we have assigned operators to >processes. We actually want to choose the parallelism degree such that >we can fit all necessary data in memory. There might be several >operators, all building hash-tables or using sort buffers. The natural >break point (at least in Tez) is where the data needs to be >re-partitioned, i.e. an Exchange operator. > >So maybe we should instead compute ³CumulativeMemoryUseWithinStage². >(I'm coining the term ³stage² to mean all operators within like >processes, summing over all buckets.) Let¹s suppose that, in the above >data set, we have two operators, each of which needs to build a hash >table, and we have 2GB memory available to each process. >CumulativeMemoryUseWithinStage is 100M rows * 100 bytes/row * 2 >hash-tables = 20GB. So, the parallelism degree should be 20GB / 2GB = >10. > >10 is a better choice of parallelism degree than 5. We lose a little >(more nodes used, and more overhead combining the 10 partitions into >1) but gain a lot (we save the cost of sending the data over the >network). > >Thoughts on this? How do other projects determine the degree of >parallelism? > >Julian
