I am currently helping the Hive team compute the memory usage of operators (more precisely, a set of operators that live in the same process) using a Calcite metadata provider.
Part of this task is to create an “Average tuple size” metadata provider based on a “Average column size” metadata provider. This is fairly uncontroversial. (I’ll create a jira case when https://issues.apache.org/ is back up, and we can discuss details such as how to compute average size of columns computed using built-in functions, user-defined functions, or aggregate functions.) Also we want to create a “CumulativeMemoryUseWithinProcess” metadata provider. That would start at 0 for a table scan or exchange consumer, but increase for each operator building a hash-table or using sort buffers until we reach the edge of the process. But we realized that we also need to determine the degree of parallelism, because we need to multiply AverageTupleSize not by RowCount but by RowCountWithinPartition. (If the data set has 100M rows, each 100 bytes and is bucketed 5 ways then each process will need memory for 20M rows, i.e 20M rows * 100 bytes/row = 2GB.) Now, if we are already at the stage of planning where we have already determined the degree of parallelism, then we could expose this as a ParallelismDegree metadata provider. But maybe we’re getting the cart before the horse? Maybe we should compute the degree of parallelism AFTER we have assigned operators to processes. We actually want to choose the parallelism degree such that we can fit all necessary data in memory. There might be several operators, all building hash-tables or using sort buffers. The natural break point (at least in Tez) is where the data needs to be re-partitioned, i.e. an Exchange operator. So maybe we should instead compute “CumulativeMemoryUseWithinStage”. (I'm coining the term “stage” to mean all operators within like processes, summing over all buckets.) Let’s suppose that, in the above data set, we have two operators, each of which needs to build a hash table, and we have 2GB memory available to each process. CumulativeMemoryUseWithinStage is 100M rows * 100 bytes/row * 2 hash-tables = 20GB. So, the parallelism degree should be 20GB / 2GB = 10. 10 is a better choice of parallelism degree than 5. We lose a little (more nodes used, and more overhead combining the 10 partitions into 1) but gain a lot (we save the cost of sending the data over the network). Thoughts on this? How do other projects determine the degree of parallelism? Julian
