I am currently helping the Hive team compute the memory usage of
operators (more precisely, a set of operators that live in the same
process) using a Calcite metadata provider.

Part of this task is to create an “Average tuple size” metadata
provider based on a “Average column size” metadata provider. This is
fairly uncontroversial. (I’ll create a jira case when
https://issues.apache.org/ is back up, and we can discuss details such
as how to compute average size of columns computed using built-in
functions, user-defined functions, or aggregate functions.)

Also we want to create a “CumulativeMemoryUseWithinProcess” metadata
provider. That would start at 0 for a table scan or exchange consumer,
but increase for each operator building a hash-table or using sort
buffers until we reach the edge of the process.

But we realized that we also need to determine the degree of
parallelism, because we need to multiply AverageTupleSize not by
RowCount but by RowCountWithinPartition. (If the data set has 100M
rows, each 100 bytes and is bucketed 5 ways then each process will
need memory for 20M rows, i.e 20M rows * 100 bytes/row = 2GB.)

Now, if we are already at the stage of planning where we have already
determined the degree of parallelism, then we could expose this as a
ParallelismDegree metadata provider.

But maybe we’re getting the cart before the horse? Maybe we should
compute the degree of parallelism AFTER we have assigned operators to
processes. We actually want to choose the parallelism degree such that
we can fit all necessary data in memory. There might be several
operators, all building hash-tables or using sort buffers. The natural
break point (at least in Tez) is where the data needs to be
re-partitioned, i.e. an Exchange operator.

So maybe we should instead compute “CumulativeMemoryUseWithinStage”.
(I'm coining the term “stage” to mean all operators within like
processes, summing over all buckets.) Let’s suppose that, in the above
data set, we have two operators, each of which needs to build a hash
table, and we have 2GB memory available to each process.
CumulativeMemoryUseWithinStage is 100M rows * 100 bytes/row * 2
hash-tables = 20GB. So, the parallelism degree should be 20GB / 2GB =
10.

10 is a better choice of parallelism degree than 5. We lose a little
(more nodes used, and more overhead combining the 10 partitions into
1) but gain a lot (we save the cost of sending the data over the
network).

Thoughts on this? How do other projects determine the degree of parallelism?

Julian

Reply via email to