Yes, absolutely, that's the only sane way to approach this.

Calcite's metadata provider model does make it possible to use an estimate of 
parallelism at one stage of planning, then use the real number at a later stage.

In my next email I propose some new metadata providers. Hopefully they could be 
fitted into Drill's planning process. Also advise whether the terminology 
(processes, splits, stages, memory, size) seems intuitive even though not tied 
to a particular execution engine.

Julian


> On Feb 24, 2015, at 10:23 AM, Jacques Nadeau <[email protected]> wrote:
> 
> A little food for thought given our work on Drill...
> 
> We punted on trying to optimize all of this at once.  Very little of the
> core plan exploration process is influenced directly by degree of
> parallelism.  We do manage between parallelized and not and use a few
> "hints" during planning to make some decisions where we need to.
> Afterwards, we do a secondary pass where we determine parallelism and
> memory consumption.  Afterwards, we will do a replan with more conservative
> memory settings if our first plan turns out not to fit into available
> cluster memory.  While we may lose some plans that are ideal, it makes the
> whole process substantially easier to reason about.
> 
> We also did this because some of our initial experimentation where we
> included a number of these things as part of the planning process caused
> the planning time to get out of hand.
> 
> 
> 
> 
> 
> On Mon, Feb 23, 2015 at 1:26 PM, Julian Hyde <[email protected]> wrote:
> 
>> I am currently helping the Hive team compute the memory usage of
>> operators (more precisely, a set of operators that live in the same
>> process) using a Calcite metadata provider.
>> 
>> Part of this task is to create an “Average tuple size” metadata
>> provider based on a “Average column size” metadata provider. This is
>> fairly uncontroversial. (I’ll create a jira case when
>> https://issues.apache.org/ is back up, and we can discuss details such
>> as how to compute average size of columns computed using built-in
>> functions, user-defined functions, or aggregate functions.)
>> 
>> Also we want to create a “CumulativeMemoryUseWithinProcess” metadata
>> provider. That would start at 0 for a table scan or exchange consumer,
>> but increase for each operator building a hash-table or using sort
>> buffers until we reach the edge of the process.
>> 
>> But we realized that we also need to determine the degree of
>> parallelism, because we need to multiply AverageTupleSize not by
>> RowCount but by RowCountWithinPartition. (If the data set has 100M
>> rows, each 100 bytes and is bucketed 5 ways then each process will
>> need memory for 20M rows, i.e 20M rows * 100 bytes/row = 2GB.)
>> 
>> Now, if we are already at the stage of planning where we have already
>> determined the degree of parallelism, then we could expose this as a
>> ParallelismDegree metadata provider.
>> 
>> But maybe we’re getting the cart before the horse? Maybe we should
>> compute the degree of parallelism AFTER we have assigned operators to
>> processes. We actually want to choose the parallelism degree such that
>> we can fit all necessary data in memory. There might be several
>> operators, all building hash-tables or using sort buffers. The natural
>> break point (at least in Tez) is where the data needs to be
>> re-partitioned, i.e. an Exchange operator.
>> 
>> So maybe we should instead compute “CumulativeMemoryUseWithinStage”.
>> (I'm coining the term “stage” to mean all operators within like
>> processes, summing over all buckets.) Let’s suppose that, in the above
>> data set, we have two operators, each of which needs to build a hash
>> table, and we have 2GB memory available to each process.
>> CumulativeMemoryUseWithinStage is 100M rows * 100 bytes/row * 2
>> hash-tables = 20GB. So, the parallelism degree should be 20GB / 2GB =
>> 10.
>> 
>> 10 is a better choice of parallelism degree than 5. We lose a little
>> (more nodes used, and more overhead combining the 10 partitions into
>> 1) but gain a lot (we save the cost of sending the data over the
>> network).
>> 
>> Thoughts on this? How do other projects determine the degree of
>> parallelism?
>> 
>> Julian
>> 

Reply via email to