Hi, On 16.10.2014 03:26, Jeff Davis wrote: > On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote: >> What about removing the callback per se and just keeping the >> argument, as it were. That is, a MemoryContext has a field of type >> "size_t *" that is normally NULL, but if set to non-NULL, then we >> increment the pointed-to value for pallocs and decrement for >> pfrees. > > I like that idea; patch attached. Two differences: > * At block level, not chunk level. > * I use a uint64, because a size_t is only guaranteed to hold one > allocation, and we need to sum up many allocations. > > When unused, it does still appear to have a little overhead in > Robert's test on his power machine. It seems to be between a 0.5-1.0% > regression. There's not much extra code on that path, just a branch, > pointer dereference, and an addition; so I don't think it will get > much cheaper than it is. > > This regression seems harder to reproduce on my workstation, or > perhaps it's just noisier.
I assume it's the "reindex pgbench_accounts_pkey" test, correct? I did the same test on my workstation (with i7-4770R CPU), and ISTM there's a measurable difference, although in surprising directions. For binaries compiled with gcc 4.7.3 and clang 3.3 and 3.5, the results look like this (5 runs for each combination): gcc 4.7.3 / master ------------------ LOG: internal sort ended, ... CPU 1.24s/4.32u sec elapsed 9.78 sec LOG: internal sort ended, ... CPU 1.29s/4.31u sec elapsed 9.80 sec LOG: internal sort ended, ... CPU 1.26s/4.31u sec elapsed 9.80 sec LOG: internal sort ended, ... CPU 1.26s/4.37u sec elapsed 9.78 sec LOG: internal sort ended, ... CPU 1.33s/4.29u sec elapsed 9.78 sec gcc 4.7.3 / patched ------------------- LOG: internal sort ended, ... CPU 1.35s/4.27u sec elapsed 9.78 sec LOG: internal sort ended, ... CPU 1.33s/4.30u sec elapsed 9.74 sec LOG: internal sort ended, ... CPU 1.34s/4.27u sec elapsed 9.75 sec LOG: internal sort ended, ... CPU 1.27s/4.26u sec elapsed 9.78 sec LOG: internal sort ended, ... CPU 1.35s/4.26u sec elapsed 9.78 sec clang 3.3 / master ------------------ LOG: internal sort ended, ... CPU 1.32s/4.61u sec elapsed 10.00 sec LOG: internal sort ended, ... CPU 1.27s/4.48u sec elapsed 9.95 sec LOG: internal sort ended, ... CPU 1.35s/4.48u sec elapsed 9.99 sec LOG: internal sort ended, ... CPU 1.32s/4.49u sec elapsed 9.97 sec LOG: internal sort ended, ... CPU 1.32s/4.47u sec elapsed 10.04 sec clang 3.3 / patched ------------------- LOG: internal sort ended, ... CPU 1.35s/4.59u sec elapsed 10.13 sec LOG: internal sort ended, ... CPU 1.31s/4.61u sec elapsed 10.06 sec LOG: internal sort ended, ... CPU 1.28s/4.63u sec elapsed 10.10 sec LOG: internal sort ended, ... CPU 1.27s/4.58u sec elapsed 10.01 sec LOG: internal sort ended, ... CPU 1.29s/4.60u sec elapsed 10.05 sec clang 3.5 / master ------------------ LOG: internal sort ended, ... CPU 1.30s/4.46u sec elapsed 9.96 sec LOG: internal sort ended, ... CPU 1.30s/4.49u sec elapsed 9.96 sec LOG: internal sort ended, ... CPU 1.30s/4.53u sec elapsed 9.95 sec LOG: internal sort ended, ... CPU 1.25s/4.51u sec elapsed 9.95 sec LOG: internal sort ended, ... CPU 1.30s/4.50u sec elapsed 9.95 sec clang 3.5 / patched ------------------- LOG: internal sort ended, ... CPU 1.32s/4.59u sec elapsed 9.97 sec LOG: internal sort ended, ... CPU 1.27s/4.49u sec elapsed 9.91 sec LOG: internal sort ended, ... CPU 1.29s/4.48u sec elapsed 9.88 sec LOG: internal sort ended, ... CPU 1.31s/4.49u sec elapsed 9.93 sec LOG: internal sort ended, ... CPU 1.23s/4.56u sec elapsed 9.90 sec So while on clang 3.3 I really see about ~1% slowdown, on the other two compilers it seems a tad faster (but it might easily be a noise). It'd be interesting to know what compiler/version Robert used ... >> One thought in either case is that we don't have to touch the API >> for MemoryContextCreate: rather, the feature can be enabled in a >> separate call after making the context. > > That seems fairly awkward to me because the pointer needs to be > inherited from the parent context when not specified. I left the > extra API call in. I don't see a big difference between adding a new API "create" method and a adding a separate "enable tracking" method. What however seems awkward is using the 'uint64*' directly as a parameter, instead of using 'enable_tracking' flag as in the previous versions. Is there a reason for that? This allows supplying a pointer to a uint64 variable, i.e. something like this: { uint64 mem_used = 0; MemoryContext context = MemoryContextCreateTracked(..., &mem_used); ... if (mem_used > work_mem * 1024L) { ... do something ... } } Thanks to exposing the total_mem like this, it's possible to get the current value directly like this (without the API methods available in the previous versions). It also makes it possible to "detach" the accounting from the parent (i.e. not count it against the parent). But is this intended/sane? > The inheritance is awkward anyway, though. If you create a tracked > context as a child of an already-tracked context, allocations in the > newer one won't count against the original. I don't see a way around > that without introducing even more performance problems. I see this is as the main drawback of the current design :-( The question is whether we can accept such limitation, and enforce it somehow. For example if you request a new context with tracking, and you find out the parent context already has tracking enabled, we can error out. This would effectively make it "internal only" feature, because the user code (e.g. custom aggregates in extensions) can't rely on getting non-tracking context, thus making it impossible to use tracking. Maybe that's acceptable limitation? If we can get into such conflicting situation without any external code, it's pretty much DOA. I can't think of such example, but am not convinced there isn't any ... Regarding the "performance problems" - ISTM it's important to differentiate between impact on performance when no memory accounting is enabled (but the code is there), and impact with the accounting. In other words: (a) free when not used (or as close to 'free' as possible) (b) cheap when used (as cheap as possible) I think the current patch achieves (a) - it's difficult to beat the single (a != NULL) check. We haven't seen any benchmarks for (b), at least not with the current patch. But even if we introduce some additional overhead (and it's impossible not to), we get something in return - the question is whether the cost is adequate. Also, if we don't have this accounting, the overhead for the local implementations may be easily higher. > If this patch is unacceptable, my only remaining idea is to add the > memory only for the current context with no inheritance (thereby > eliminating the branch, also). That will require HashAgg to traverse all > the child contexts to check whether the memory limit has been exceeded. > As Tomas pointed out, that could be a lot of work in the case of > array_agg with many groups. I can't really imagine dropping the inheritance ... Hierarchy of memory contexts with accounting that does not support hierarchies seems rather strange IMNSHO, making it either useless or at least very expensive in the interesting cases (the array_agg with a context per group is a prime example). ISTM the reason for the issues with inheritance (with two memory contexts with tracking enabled) is the "single uint64 counter". In one of my previous messages (from 23/8), I proposed to use this struct to build "parallel" accounting hierarchy (tweaked to match the v6 patch): typedef struct MemoryAccountingData { uint64 track_mem; struct MemoryAccountingData * parent; } MemoryAccountingData; Each context has either a private instance (if tracking was enabled), or point to a nearest context with tracking enabled (if existing). In the 'tracking disabled' case, this should have exactly the same overhead as the v6 patch (still a single (pointer != NULL) check). In the 'tracking enabled' case, it'll be more expensive, because the changes need to propagate up the tree. But that happens only if there are multiple contexts with tracking enabled - if there's a single one, it should be exactly the same as the v6 patch. Haven't tried measuring the impact though, because we don't have actual code using it. The measurements in the 23/8 say that with palloc_bench, the overhead was ~3% (compared to 'tracking disabled'). Also, all that benchmark is doing is palloc, so it's not really +3% in practice. A few minor comments regarding the patch: (a) I see we're not tracking self/total values any more. I don't think we really need those two numbers, the total is enough. Correct? (b) both AllocSetReset/AllocSetDelete update the accounting info for each block separately: if (context->track_mem != NULL) *context->track_mem -= block->endptr - ((char *) block); Wouldn't it be better to accumulate the total into a local variable and then update the track_mem at once? This might be more important when using the MemoryAccountingData structure and multiple tracking-enabled contexts, though. (c) Most of the changes in AllocSetContextCreateTracked are only renames 'context => set', which can be easily eliminated by using AllocSet context = (AllocSet)MemoryContextCreate(... and casting to MemoryContext at the one place where it's needed. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers