On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote: >> Have you tested the effect of this patch on planner memory consumption >> with multi-way joins between tables with many partitions? If you >> haven't, you probably should. (Testing runtime would be good, too.) >> Does it grow linearly? Quadratically? Exponentially? Minor leaks >> don't matter, but if we're generating too much garbage we'll have to >> make sure it gets cleaned up soon enough to prevent runaway memory >> usage. > > I tried to check memory usage with various combinations of number of > partitions and number of relations being joined. For higher number of > relations being joined like 10 with 100 partitions, OOM killer kicked > in during the planning phase. I am suspecting > adjust_partitionrel_attrs() (changed that name to > adjust_join_appendrel_attrs() to be in sync with > adjust_appendrel_attrs()) to be the culprit. It copies expression > trees every time for joining two children. That's an exponentially > increasing number as the number of legal joins increases > exponentially. I am still investigating this.
I think the root of this problem is that the existing paths shares a lot more substructure than the ones created by the new code. Without a partition-wise join, the incremental memory usage for a joinrel isn't any different whether the underlying rel is partitioned or not. If it's partitioned, we'll be pointing to an AppendPath; if not, we'll be pointing to some kind of Scan. But the join itself creates exactly the same amount of new stuff regardless of what's underneath it. With partitionwise join, that ceases to be true. Every joinrel - and the number of those grows exponentially in the number of baserels, IICU - needs its own list of paths for every member rel. So if a non-partition-wise join created X paths, and there are K partitions, a partition-wise join creates X * K paths. That's a lot. Although we might be able to save some memory by tightening things up here and there - for example, right now the planner isn't real smart about recycling paths that are evicted by add_path(), and there's probably other wastage as well - I suspect that what this shows is that the basic design of this patch is not going to be viable. Intuitively, it's often going to be the case that we want the "same plan" for every partition-set. That is, if we have A JOIN B ON A.x = B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility partitioned, then the result should be an Append plan with 100 join plans under it, and all 100 of those plans should be basically mirror images of each other. Of course, that's not really right in general: for example, it could be that A1 is big and A2 is small while B1 is small and B2 is big, so that the right plan for (A1 JOIN B1) and for (A2 JOIN B2) are totally different from each other. But in many practical cases we'll want to end up with a plan of precisely the same shape for all children, and the current design ignores this, expending both memory and CPU time to compute essentially-equivalent paths across all children. One way of attacking this problem is to gang together partitions which are equivalent for planning purposes, as discussed in the paper "Join Optimization Techniques for Partitioned Tables" by Herodotou, Borisov, and Babu. However, it's not exactly clear how to do this: we could gang together partitions that have the same index definitions, but the sizes of the heaps, the sizes of their indexes, and the row counts will vary from one partition to the next, and any of those things could cause the plan choice to be different for one partition vs. the next. We could try to come up with heuristics for when those things are likely to be true. For example, suppose we compute the set of partitions such that all joined relations have matching index definitions on all tables; then, we take the biggest table in the set and consider all tables more than half that size as part of one gang. The biggest table becomes the leader and we compute partition-wise paths for just that partition; the other members of the gang will eventually get a plan that is of the same shape, but we don't actually create it that plan until after scan/join planning is concluded. Another idea is to try to reduce peak memory usage by performing planning separately for each partition-set. For example, suppose we decide to do a partition-wise join of A, B, and C. Initially, this gets represented as a PartitionJoinPath tree, like this: PartitionJoinPath -> AppendPath for A -> PartitionJoinPath -> AppendPath for B -> AppendPath for C Because we haven't created individual join paths for the members, this doesn't use much memory. Somehow, we come up with a cost for the PartitionJoinPath; it probably won't be entirely accurate. Once scan/join planning is concluded, if our final path contains a PartitionJoinPath, we go back and loop over the partitions. For each partition, we switch to a new memory context, perform planning, copy the best path and its substructure back to the parent context, and then reset the context. In that way, peak memory usage only grows by about a factor of 2 rather than a factor equal to the partition count, because we don't need to keep every possibly-useful path for every partition all at the same time, but rather every possibly-useful path for a single partition. Maybe there are other ideas but I have a feeling any way you slice it this is going to be a lot of work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers