Re: [HACKERS] Memory Accounting v11

2015-07-17 Thread Jeff Davis
On Fri, 2015-07-17 at 15:52 +1200, David Rowley wrote:

 Should we mark the patch as returned with feedback in the commitfest
 app then?

I believe the memory accounting patch has been rejected. Instead, the
work will be done in the HashAgg patch.

Thank you for the review!

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-16 Thread David Rowley
On 16 July 2015 at 05:07, Jeff Davis pg...@j-davis.com wrote:

 On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:

 
  I think a heuristic would be more suited here and ignoring memory
  consumption for internal types means that we are not making the memory
  accounting useful for a set of usecases.
 
 OK, I will drop this patch and proceed with the HashAgg patch, with a
 heuristic for internal types.


Should we mark the patch as returned with feedback in the commitfest app
then?

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Jeff Davis
On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:
 tuplesort.c does its own accounting, and TBH that seems like the right
 thing to do here, too.  The difficulty is, I think, that some
 transition functions use an internal data type for the transition
 state, which might not be a single palloc'd chunk.  But since we can't
 spill those aggregates to disk *anyway*, that doesn't really matter.

So would it be acceptable to just ignore the memory consumed by
internal, or come up with some heuristic?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Atri Sharma
On Wed, Jul 15, 2015 at 12:57 PM, Jeff Davis pg...@j-davis.com wrote:

 On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:
  tuplesort.c does its own accounting, and TBH that seems like the right
  thing to do here, too.  The difficulty is, I think, that some
  transition functions use an internal data type for the transition
  state, which might not be a single palloc'd chunk.  But since we can't
  spill those aggregates to disk *anyway*, that doesn't really matter.

 So would it be acceptable to just ignore the memory consumed by
 internal, or come up with some heuristic?

 Regards,
 Jeff Davis


I think a heuristic would be more suited here and ignoring memory
consumption for internal types means that we are not making the memory
accounting useful for a set of usecases.



-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Jeff Davis
On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:

 
 I think a heuristic would be more suited here and ignoring memory
 consumption for internal types means that we are not making the memory
 accounting useful for a set of usecases. 
 
OK, I will drop this patch and proceed with the HashAgg patch, with a
heuristic for internal types.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Robert Haas
On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 Firstly, do we really have good benchmarks and measurements? I really doubt
 that. We do have some numbers for REINDEX, where we observed 0.5-1%
 regression on noisy results from a Power machine (and we've been unable to
 reproduce that on x86). I don't think that's a representative benchmark, and
 I'd like to see more thorough measurements. And I agreed to do this, once
 Jeff comes up with a new version of the patch.

 Secondly, the question is whether the performance is impacted more by the
 additional instructions, or by other things - say, random padding, as was
 explained by Andrew Gierth in [1].

 I don't know whether that's happening in this patch, but if it is, it seems
 rather strange to use this against this patch and not the others (because
 there surely will be other patches causing similar issues).

It matters, at least to me, an awful lot where we're adding lines of
code.  If you want to add modest amounts of additional code to CREATE
TABLE or CHECKPOINT or something like that, I really don't care,
because that stuff doesn't execute frequently enough to really matter
to performance even if we add a significant chunk of overhead, and
we're doing other expensive stuff at the same time, like fsync.  The
same can't be said about functions like LWLockAcquire and
AllocSetAlloc that routinely show up at the top of CPU profiles.

I agree that there is room to question whether the benchmarks I did
are sufficient reason to think that the abstract concern that putting
more code into a function might make it slower translates into a
measurable drop in performance in practice.  But I think when it comes
to these very hot code paths, extreme conservatism is warranted.  We
can agree to disagree about that.

 tuplesort.c does its own accounting, and TBH that seems like the right
 thing to do here, too.  The difficulty is, I think, that some
 transition functions use an internal data type for the transition
 state, which might not be a single palloc'd chunk.  But since we can't
 spill those aggregates to disk *anyway*, that doesn't really matter.
 If the transition is a varlena or a fixed-length type, we can know how
 much space it's consuming without hooking into the memory context
 framework.

 I respectfully disagree. Our current inability to dump/load the state has
 little to do with how we measure consumed memory, IMHO.

 It's true that we do have two kinds of aggregates, depending on the nature
 of the aggregate state:

 (a) fixed-size state (data types passed by value, variable length types
 that do not grow once allocated, ...)

 (b) continuously growing state (as in string_agg/array_agg)

 Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a
 solution for dump/load of the aggregate stats - which we need to implement
 anyway for parallel aggregate.

 I know there was a proposal to force all aggregates to use regular data
 types as aggregate stats, but I can't see how that could work without a
 significant performance penalty. For example array_agg() is using internal
 to pass ArrayBuildState - how do you turn that to a regular data type
 without effectively serializing/deserializing the whole array on every
 transition?

That is a good point.  Tom suggested that his new expanded-format
stuff might be able to be adapted to the purpose, but I have no idea
how possible that really is.

 And even if we come up with a solution for array_agg, do we really believe
 it's possible to do for all custom aggregates? Maybe I'm missing something
 but I doubt that. ISTM designing ephemeral data structure allows tweaks that
 are impossible otherwise.

 What might be possible is extending the aggregate API with another custom
 function returning size of the aggregate state. So when defining an
 aggregate using 'internal' for aggregate state, you'd specify transfunc,
 finalfunc and sizefunc. That seems doable, I guess.

And infunc and outfunc.  If we don't use the expanded-format stuff for
aggregates with a type-internal transition state, we won't be able to
use input and output functions to serialize and deserialize them,
either.

 I find the memory accounting as a way more elegant solution, though.

I think we're just going to have to agree to disagree on that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Robert Haas
On Wed, Jul 15, 2015 at 3:27 AM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:
 tuplesort.c does its own accounting, and TBH that seems like the right
 thing to do here, too.  The difficulty is, I think, that some
 transition functions use an internal data type for the transition
 state, which might not be a single palloc'd chunk.  But since we can't
 spill those aggregates to disk *anyway*, that doesn't really matter.

 So would it be acceptable to just ignore the memory consumed by
 internal, or come up with some heuristic?

Either one would be OK with me.  I'd probably ignore that for the
first version of the patch and just let the hash table grow without
bound in that case.  Then in a later patch I'd introduce additional
infrastructure to deal with that part of the problem.  But if
something else works out well while coding it up, I'd be OK with that,
too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Tomas Vondra

Hi,

On 07/15/2015 09:21 PM, Robert Haas wrote:

On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:

Firstly, do we really have good benchmarks and measurements? I really doubt
that. We do have some numbers for REINDEX, where we observed 0.5-1%
regression on noisy results from a Power machine (and we've been unable to
reproduce that on x86). I don't think that's a representative benchmark, and
I'd like to see more thorough measurements. And I agreed to do this, once
Jeff comes up with a new version of the patch.

Secondly, the question is whether the performance is impacted more by the
additional instructions, or by other things - say, random padding, as was
explained by Andrew Gierth in [1].

I don't know whether that's happening in this patch, but if it is,
it seems rather strange to use this against this patch and not the
others (because there surely will be other patches causing similar
issues).


It matters, at least to me, an awful lot where we're adding lines of
code. If you want to add modest amounts of additional code to CREATE
TABLE or CHECKPOINT or something like that, I really don't care,
because that stuff doesn't execute frequently enough to really
matter to performance even if we add a significant chunk of overhead,
and we're doing other expensive stuff at the same time, like fsync.
The same can't be said about functions like LWLockAcquire and
AllocSetAlloc that routinely show up at the top of CPU profiles.

I agree that there is room to question whether the benchmarks I did
are sufficient reason to think that the abstract concern that putting
more code into a function might make it slower translates into a
measurable drop in performance in practice. But I think when it comes
to these very hot code paths, extreme conservatism is warranted. We
can agree to disagree about that.


No, that is not what I tried to say. I certainly agree that we need to 
pay attention when adding stuff hot paths - there's no disagreement 
about this.


The problem with random padding is that adding code somewhere may 
introduce padding that affects other pieces of code. That is essentially 
the point of Andrew's explanation that I linked in my previous message.


And the question is - are the differences we've measured really due to 
code added to the hot path, or something introduced by random padding 
due to some other changes (possibly in a code that was not even executed 
during the test)?


I don't know how significant impact this may have in this case, or how 
serious this is in general, but we're talking about 0.5-1% difference on 
a noisy benchmark. And if such cases of random padding really are a 
problem in practice, we've certainly missed plenty of other patches with 
the same impact.


Because effectively what Jeff's last patch did was adding a single int64 
counter to MemoryContextData structure, and incrementing it for each 
allocated block (not chunk). I can't really imagine a solution making it 
cheaper, because there are very few faster operations. Even opt-in 
won't make this much faster, because you'll have to check a flag and 
you'll need two fields (flag + counter).


Of course, this assumes local counter (i.e. not updating counters the 
parent contexts), but I claim that's OK. I've been previously pushing 
for eagerly updating all the parent contexts, so that finding out the 
allocated memory is fast even when there are many child contexts - a 
prime example was array_agg() before I fixed it. But I changed my mind 
on this and now say screw them. I claim that aggregates using a 
separate memory context for each group are a lost case - they already 
suffer a significant overhead, and should be fixed just like we fixed 
array_agg().


That was effectively the outcome of pgcon discussions - to use the 
simple int64 counter, do the accounting for all contexts, and update 
only the local counter. For cases with many child contexts finding out 
the amount of allocated memory won't be cheap, but well - there's not 
much we can do about that.



I know there was a proposal to force all aggregates to use regular
data types as aggregate stats, but I can't see how that could work
without a significant performance penalty. For example array_agg()
is using internal to pass ArrayBuildState - how do you turn that to
a regular data type without effectively serializing/deserializing
the whole array on every transition?


That is a good point. Tom suggested that his new expanded-format
stuff might be able to be adapted to the purpose, but I have no idea
 how possible that really is.


Me neither, and maybe there's a good solution for that, making my 
concerns unfounded.



What might be possible is extending the aggregate API with another
custom function returning size of the aggregate state. So when
defining an aggregate using 'internal' for aggregate state, you'd
specify transfunc, finalfunc and sizefunc. That seems doable, I
guess.


And infunc and outfunc.  If we don't use 

Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Tomas Vondra

Hi,

On 07/15/2015 07:07 PM, Jeff Davis wrote:

On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:



I think a heuristic would be more suited here and ignoring memory
consumption for internal types means that we are not making the
memory accounting useful for a set of usecases.


OK, I will drop this patch and proceed with the HashAgg patch, with
a heuristic for internal types.


Could someone briefly explain what heuristics are we talking about?


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-14 Thread Robert Haas
On Sat, Jul 11, 2015 at 2:28 AM, Jeff Davis pg...@j-davis.com wrote:
 After talking with a few people at PGCon, small noisy differences in CPU
 timings can appear for almost any tweak to the code, and aren't
 necessarily cause for major concern.

I agree with that in general, but the concern is a lot bigger when the
function is something that is called everywhere and accounts for a
measurable percentage of our total CPU usage on almost any workload.
If memory allocation got slower because, say, you added some code to
regexp.c and it caused AllocSetAlloc to split a cache line where it
hadn't previously, I wouldn't be worried about that; the next patch,
like as not, will buy the cost back again.  But here you really are
adding code to a hot path.

tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too.  The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk.  But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.
If the transition is a varlena or a fixed-length type, we can know how
much space it's consuming without hooking into the memory context
framework.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-14 Thread Tomas Vondra

Hi,

On 07/14/2015 10:19 PM, Robert Haas wrote:

On Sat, Jul 11, 2015 at 2:28 AM, Jeff Davis pg...@j-davis.com wrote:

After talking with a few people at PGCon, small noisy differences
in CPU timings can appear for almost any tweak to the code, and
aren't necessarily cause for major concern.


I agree with that in general, but the concern is a lot bigger when the
function is something that is called everywhere and accounts for a
measurable percentage of our total CPU usage on almost any workload.
If memory allocation got slower because, say, you added some code to
regexp.c and it caused AllocSetAlloc to split a cache line where it
hadn't previously, I wouldn't be worried about that; the next patch,
like as not, will buy the cost back again.  But here you really are
adding code to a hot path.


I think Jeff was suggesting that we should ignore changes measurably 
affecting performance - I'm one of those he discussed this patch with at 
pgcon, and I can assure you impact on performance was one of the main 
topics of the discussion.


Firstly, do we really have good benchmarks and measurements? I really 
doubt that. We do have some numbers for REINDEX, where we observed 
0.5-1% regression on noisy results from a Power machine (and we've been 
unable to reproduce that on x86). I don't think that's a representative 
benchmark, and I'd like to see more thorough measurements. And I agreed 
to do this, once Jeff comes up with a new version of the patch.


Secondly, the question is whether the performance is impacted more by 
the additional instructions, or by other things - say, random padding, 
as was explained by Andrew Gierth in [1].


I don't know whether that's happening in this patch, but if it is, it 
seems rather strange to use this against this patch and not the others 
(because there surely will be other patches causing similar issues).


[1] 
http://www.postgresql.org/message-id/87vbitb2zp@news-spur.riddles.org.uk



tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too.  The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk.  But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.
If the transition is a varlena or a fixed-length type, we can know how
much space it's consuming without hooking into the memory context
framework.


I respectfully disagree. Our current inability to dump/load the state 
has little to do with how we measure consumed memory, IMHO.


It's true that we do have two kinds of aggregates, depending on the 
nature of the aggregate state:


(a) fixed-size state (data types passed by value, variable length types
that do not grow once allocated, ...)

(b) continuously growing state (as in string_agg/array_agg)

Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a 
solution for dump/load of the aggregate stats - which we need to 
implement anyway for parallel aggregate.


I know there was a proposal to force all aggregates to use regular data 
types as aggregate stats, but I can't see how that could work without a 
significant performance penalty. For example array_agg() is using 
internal to pass ArrayBuildState - how do you turn that to a regular 
data type without effectively serializing/deserializing the whole array 
on every transition?


And even if we come up with a solution for array_agg, do we really 
believe it's possible to do for all custom aggregates? Maybe I'm missing 
something but I doubt that. ISTM designing ephemeral data structure 
allows tweaks that are impossible otherwise.


What might be possible is extending the aggregate API with another 
custom function returning size of the aggregate state. So when defining 
an aggregate using 'internal' for aggregate state, you'd specify 
transfunc, finalfunc and sizefunc. That seems doable, I guess.


I find the memory accounting as a way more elegant solution, though.

kind regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-11 Thread Jeff Davis
On Thu, 2015-07-09 at 14:41 +1200, David Rowley wrote:

 Are you going to implement this? or are you happy with the single
 level context tracking is the right thing?
 I'm going to mark the patch as waiting on author for now.

Attached a version of the patch that does multi-level tracking, v12. It
does this is a simpler way, just like an earlier version of the patch:
it simply traverses up the chain and adds to each parent in a
total_allocated field.

The idea you mentioned is a possible optimization of this idea, but it
only makes sense if I'm able to detect a difference between v11
(single-level) and v12 (multi-level). I tried Robert's test[1] again and
I didn't see a difference on my workstation (technically, v12 came out
the fastest, which means I'm just seeing noise anyway), so I can't
evaluate whether your idea will improve things.

After talking with a few people at PGCon, small noisy differences in CPU
timings can appear for almost any tweak to the code, and aren't
necessarily cause for major concern.

Regards,
Jeff Davis

[1] pgbench -i -s 300, then do the following 3 times each for master,
v11, and v12, and take the median of logged traces:

start server; set trace_sort=on; reindex index pgbench_accounts_pkey;
stop server

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_alloc(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 502,510 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		update_alloc((MemoryContext) set, blksize);
+ 
  		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 595,602 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_alloc(context, -(block-endptr - ((char*) block)));
+ 
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 635,640  AllocSetDelete(MemoryContext context)
--- 642,654 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
+ 
+ 		/*
+ 		 * Parent is already unlinked from this context, so we can't use
+ 		 * update_alloc(). The caller should have already subtracted from the
+ 		 * parents' total_allocated.
+ 		 */
+ 
  		free(block);
  		block = next;
  	}
***
*** 672,677  AllocSetAlloc(MemoryContext context, Size size)
--- 686,694 
  		block = (AllocBlock) malloc(blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		update_alloc(context, blksize);
+ 
  		block-aset = set;
  		block-freeptr = block-endptr = ((char *) block) + blksize;
  
***
*** 861,866  AllocSetAlloc(MemoryContext context, Size size)
--- 878,885 
  		if (block == NULL)
  			return NULL;
  
+ 		update_alloc(context, blksize);
+ 
  		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 964,969  AllocSetFree(MemoryContext context, void *pointer)
--- 983,991 
  			set-blocks = block-next;
  		else
  			prevblock-next = block-next;
+ 
+ 		update_alloc(context, -(block-endptr - ((char*) block)));
+ 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 1077,1082  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1099,1105 
  		AllocBlock	prevblock = NULL;
  		Size		chksize;
  		Size		blksize;
+ 		Size		oldblksize;
  
  		while (block != NULL)
  		{
***
*** 1094,1102  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1117,1130 
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
  		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ 		oldblksize = block-endptr - ((char *)block);
+ 
  		block = (AllocBlock) realloc(block, blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		update_alloc(context, blksize - oldblksize);
+ 
  		block-freeptr = block-endptr = ((char *) block) + blksize;
  
  		/* Update pointers since block has likely been moved */
***
*** 1361,1363  AllocSetCheck(MemoryContext context)
--- 1389,1407 
  }
  
  #endif   /* MEMORY_CONTEXT_CHECKING */
+ 
+ /*
+  * Update self_allocated and total_allocated for the context. Size can be
+  * positive or negative.
+  */
+ void
+ update_alloc(MemoryContext context, int64 size)
+ {
+ 	MemoryContext	parent;
+ 
+ 	context-self_allocated += size;
+ 
+ 	/* update total_allocated for self and all parents */
+ 	for (parent = context; parent != NULL; 

Re: [HACKERS] Memory Accounting v11

2015-07-08 Thread David Rowley
On 7 July 2015 at 20:23, David Rowley david.row...@2ndquadrant.com wrote:

 On 7 July 2015 at 18:59, Jeff Davis pg...@j-davis.com wrote:



  One other thing that I don't remember seeing discussed would be to
  just store a List in each context which would store all of the
  mem_allocated's that need to be updated during each allocation and
  deallocation of memory. The list would be setup once when the context
  is created. When a child context is setup we'd just loop up the parent
  context chain and list_concat() all of the parent's lists onto the
  child's lists. Would this method add too much overhead when a context
  is deleted? Has this been explored already?
 
 That's a good idea, as it would be faster than recursing. Also, the
 number of parents can't change, so we can just make an array, and that
 would be quite fast to update. Unless I'm missing something, this sounds
 like a nice solution. It would require more space in MemoryContextData,
 but that might not be a problem.


 Oh an array is even better, but why more space? won't it just be a pointer
 to an array? It can be NULL if there's nothing to track. Sounds like it
 would be the same amount of space, at least on a 64 bit machine.


I'd imagine that the first element of the array could just store the length
of it. The type might be slightly wider for what you need for an array
length, but this'll save having an extra field in the struct for it.

Are you going to implement this? or are you happy with the single level
context tracking is the right thing?
I'm going to mark the patch as waiting on author for now.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] Memory Accounting v11

2015-07-07 Thread Jeff Davis
On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote:

 of performance decrease anywhere. I'm just getting too much variation
 in the test results to get any sort of idea.
 
That was my experience as well. Thank you for taking a look.

 My main question here is: How sure are you that none of your intended
 use cases will need to call MemoryContextMemAllocated with recurse =
 true? I'd imagine if you ever have to
 use MemoryContextMemAllocated(ctx, true) then we'll all be sitting
 around with our stopwatches out to see if the call incurs too much of
 a performance penalty.

I don't expect HashAgg to be using many memory contexts. It did with
array_agg, but that's been changed, and was a bad pattern anyway (one
context per group is quite wasteful).

If it turns out to be slow, we can call it less frequently (e.g.
extrapolate growth rate to determine when to check).

 
 Shouldn't you be setting oldblksize after the if? I'd imagine the
 realloc() failure is rare, but it just seems better to do it just
 before it's required and only when required.

I don't think that's safe. Realloc can return an entirely new pointer,
so the endptr would be pointing outside of the allocation. I'd have to
hang on to the original pointer before the realloc, so I'd need an extra
variable anyway.


 
 Here there's a double assignment of child.

Will fix.

 
 I might be mistaken here, but can't you just set context-mem_allocted
 = 0; after that loop? 
 Or maybe it would be an improvement to only do the decrement
 if MEMORY_CONTEXT_CHECKING is defined, then Assert() that
 mem_allocated == 0 after the loop...
 
OK. Why not just always Assert that?
 
 One other thing that I don't remember seeing discussed would be to
 just store a List in each context which would store all of the
 mem_allocated's that need to be updated during each allocation and
 deallocation of memory. The list would be setup once when the context
 is created. When a child context is setup we'd just loop up the parent
 context chain and list_concat() all of the parent's lists onto the
 child's lists. Would this method add too much overhead when a context
 is deleted? Has this been explored already?
 
That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-07 Thread David Rowley
On 7 July 2015 at 18:59, Jeff Davis pg...@j-davis.com wrote:

 On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote:
  I might be mistaken here, but can't you just set context-mem_allocted
  = 0; after that loop?
  Or maybe it would be an improvement to only do the decrement
  if MEMORY_CONTEXT_CHECKING is defined, then Assert() that
  mem_allocated == 0 after the loop...
 
 OK. Why not just always Assert that?
 


Well, I thought the per loop decrement of the mem_allocated was wasteful,
as shouldn't it always get to zero after the final loop anyway?
If so, for efficiency it would be better to zero it after the loop, but if
we do that then we can't assert for zero.


  One other thing that I don't remember seeing discussed would be to
  just store a List in each context which would store all of the
  mem_allocated's that need to be updated during each allocation and
  deallocation of memory. The list would be setup once when the context
  is created. When a child context is setup we'd just loop up the parent
  context chain and list_concat() all of the parent's lists onto the
  child's lists. Would this method add too much overhead when a context
  is deleted? Has this been explored already?
 
 That's a good idea, as it would be faster than recursing. Also, the
 number of parents can't change, so we can just make an array, and that
 would be quite fast to update. Unless I'm missing something, this sounds
 like a nice solution. It would require more space in MemoryContextData,
 but that might not be a problem.


Oh an array is even better, but why more space? won't it just be a pointer
to an array? It can be NULL if there's nothing to track. Sounds like it
would be the same amount of space, at least on a 64 bit machine.

One thing to keep in mind is that I think if a parent is tracking memory,
then a child will also need to track memory too, as when the child context
is deleted, we'll need to decrement the parent's mem_allocated by that of
all its child contexts

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] Memory Accounting v11

2015-07-07 Thread Andres Freund
On 2015-07-07 16:49:58 +1200, David Rowley wrote:
 I've been looking at this patch and trying to reproduce the reported
 slowdown by using Tomas' function to try to exercise palloc() with minimal
 overhead of other code:
 
 https://github.com/tvondra/palloc_bench

That's not necessarily meaningful. Increased cache footprint (both data
and branch prediction) is often only noticeable with additional code
running pushing stuff out.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-06 Thread David Rowley
On 15 June 2015 at 07:43, Jeff Davis pg...@j-davis.com wrote:

 This patch tracks memory usage (at the block level) for all memory
 contexts. Individual palloc()s aren't tracked; only the new blocks
 allocated to the memory context with malloc().

 It also adds a new function, MemoryContextMemAllocated() which can
 either retrieve the total allocated for the context, or it can recurse
 and sum up the allocations for all subcontexts as well.

 This is intended to be used by HashAgg in an upcoming patch that will
 bound the memory and spill to disk.

 Previous discussion here:

 http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop

 Previous concerns:

 * There was a slowdown reported of around 1-3% (depending on the exact
 version of the patch) on an IBM power machine when doing an index
 rebuild. The results were fairly noisy for me, but it seemed to hold up.
 See http://www.postgresql.org/message-id/CA
 +Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajp3cro9db...@mail.gmail.com
 * Adding a struct field to MemoryContextData may be bad for the CPU
 caching behavior, and may be the cause of the above slowdown.


I've been looking at this patch and trying to reproduce the reported
slowdown by using Tomas' function to try to exercise palloc() with minimal
overhead of other code:

https://github.com/tvondra/palloc_bench

I also wanted to attempt to determine if the slowdown was due to the
mem_allocated maintenance or if it was down to the struct size increasing.
I decided to test this just by simply commenting out all of
the context-mem_allocated += / -= and just keep the mem_allocated field in
the struct, but I've been really struggling to see any sort of performance
decrease anywhere. I'm just getting too much variation in the test results
to get any sort of idea.

I've attached a spreadsheet of the results I saw. It would be really good
if Robert was able to test with the IBM PPC just with the extra field in
the struct so we can see if the 1-3% lies there, or if it's in overhead of
keeping mem_allocated up-to-date.

My main question here is: How sure are you that none of your intended use
cases will need to call MemoryContextMemAllocated with recurse = true? I'd
imagine if you ever have to use MemoryContextMemAllocated(ctx, true) then
we'll all be sitting around with our stopwatches out to see if the call
incurs too much of a performance penalty.

Other small issues:


+ oldblksize = block-endptr - ((char *)block);
+
  block = (AllocBlock) realloc(block, blksize);
  if (block == NULL)
  return NULL;
+
+ context-mem_allocated += blksize - oldblksize;
+

Shouldn't you be setting oldblksize after the if? I'd imagine the
realloc() failure is rare, but it just seems better to do it just before
it's required and only when required.

+ if (recurse)
+ {
+ MemoryContext child = context-firstchild;
+ for (child = context-firstchild;
+  child != NULL;

Here there's a double assignment of child.

***
*** 632,637  AllocSetDelete(MemoryContext context)
--- 637,644 
  {
  AllocBlock next = block-next;

+ context-mem_allocated -= block-endptr - ((char *) block);
+

I might be mistaken here, but can't you just set context-mem_allocted = 0;
after that loop?
Or maybe it would be an improvement to only do the decrement
if MEMORY_CONTEXT_CHECKING is defined, then Assert() that mem_allocated ==
0 after the loop...


One other thing that I don't remember seeing discussed would be to just
store a List in each context which would store all of the mem_allocated's
that need to be updated during each allocation and deallocation of memory.
The list would be setup once when the context is created. When a child
context is setup we'd just loop up the parent context chain and
list_concat() all of the parent's lists onto the child's lists. Would this
method add too much overhead when a context is deleted? Has this been
explored already?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


mem_accounting_benchmark.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-02 Thread Simon Riggs
On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 The current state, where HashAgg just blows up the memory, is just not
 reasonable, and we need to track the memory to fix that problem.


 Meh. HashAgg could track its memory usage without loading the entire
 system with a penalty.


 +1 to a solution like that, although I don't think that's doable without
 digging the info from memory contexts somehow.


Jeff is right, we desperately need a solution and this is the place to
start.

Tom's concern remains valid: we must not load the entire system with a
penalty.


The only questions I have are:

* If the memory allocations adapt to the usage pattern, then we expect to
see few memory chunk allocations. Why are we expecting the entire system
to experience a penalty?

* If we do not manage our resources, how are we certain this does not
induce a penalty? Not tracking memory could be worse than tracking it.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Memory Accounting v11

2015-07-02 Thread CK Tan
On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 The current state, where HashAgg just blows up the memory, is just not
 reasonable, and we need to track the memory to fix that problem.


 Meh. HashAgg could track its memory usage without loading the entire
 system with a penalty.


 +1 to a solution like that, although I don't think that's doable without
 digging the info from memory contexts somehow.


I am sorry to ask questions unrelated to the subject, but how is tracking
memory going to fix the HashAgg blow up problem? Is there a plan to make
HashAgg not blow up (i.e. spill the hash table)?

Thanks,
-cktan



On Thu, Jul 2, 2015 at 4:19 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com
 wrote:


 The current state, where HashAgg just blows up the memory, is just not
 reasonable, and we need to track the memory to fix that problem.


 Meh. HashAgg could track its memory usage without loading the entire
 system with a penalty.


 +1 to a solution like that, although I don't think that's doable without
 digging the info from memory contexts somehow.


 Jeff is right, we desperately need a solution and this is the place to
 start.

 Tom's concern remains valid: we must not load the entire system with a
 penalty.


 The only questions I have are:

 * If the memory allocations adapt to the usage pattern, then we expect to
 see few memory chunk allocations. Why are we expecting the entire system
 to experience a penalty?

 * If we do not manage our resources, how are we certain this does not
 induce a penalty? Not tracking memory could be worse than tracking it.

 --
 Simon Riggshttp://www.2ndQuadrant.com/
 http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services



Re: [HACKERS] Memory Accounting v11

2015-07-02 Thread Qingqing Zhou
On Thu, Jul 2, 2015 at 11:37 AM, CK Tan ck...@vitessedata.com wrote:

 I am sorry to ask questions unrelated to the subject, but how is tracking
 memory going to fix the HashAgg blow up problem? Is there a plan to make
 HashAgg not blow up (i.e. spill the hash table)?


Your question is probably answered here:
http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop

Regards,
Qingqing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-06-14 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 This patch tracks memory usage (at the block level) for all memory
 contexts. Individual palloc()s aren't tracked; only the new blocks
 allocated to the memory context with malloc().
 ...
 My general answer to the performance concerns is that they aren't a
 reason to block this patch, unless someone has a suggestion about how to
 fix them. Adding one field to a struct and a few arithmetic operations
 for each malloc() or realloc() seems reasonable to me.

Maybe, but this here is a pretty weak argument:

 The current state, where HashAgg just blows up the memory, is just not
 reasonable, and we need to track the memory to fix that problem.

Meh.  HashAgg could track its memory usage without loading the entire
system with a penalty.  Moreover, this is about fourth or fifth down the
list of the implementation problems with spilling hash aggregation to
disk.  It would be good to see credible solutions for the bigger issues
before we buy into adding overhead for a mechanism with no uses in core.

 Others have also mentioned that we might want to use this mechanism to
 track memory for other operators, like Sort or HashJoin, which might be
 simpler and more accurate.

That would imply redefining the meaning of work_mem for those operations;
it would suddenly include a lot more overhead space than it used to,
causing them to spill to disk much more quickly than before, unless one
changes the work_mem setting to compensate.  Not sure that people would
like such a change.

An even bigger problem is that it would pretty much break the logic around
LACKMEM() in tuplesort.c, which supposes that spilling individual tuples
to disk is a reliable and linear way to decrease the accounted-for memory.
Individual pfree's would typically not decrease the accounted total in
this implementation.  Depending on what you have in mind for the
spill-to-disk behavior, this issue could be a fail for HashAgg as well,
which is another reason not to commit this patch in advance of seeing the
use-case.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-06-14 Thread Tomas Vondra

Hi,

On 06/14/15 22:21, Tom Lane wrote:

Jeff Davis pg...@j-davis.com writes:

This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().
...
My general answer to the performance concerns is that they aren't a
reason to block this patch, unless someone has a suggestion about how to
fix them. Adding one field to a struct and a few arithmetic operations
for each malloc() or realloc() seems reasonable to me.


Maybe, but this here is a pretty weak argument:


The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.


Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.


+1 to a solution like that, although I don't think that's doable without 
digging the info from memory contexts somehow.


 Moreover, this is about fourth or fifth down the

list of the implementation problems with spilling hash aggregation to
disk.  It would be good to see credible solutions for the bigger issues
before we buy into adding overhead for a mechanism with no uses in core.


I don't think so. If we don't have an acceptable solution for tracking 
memory in hashagg, there's not really much point in doing any of the 
following steps. That's why Jeff is tackling this problem first.


Also, Jeff already posted a PoC of the memory-bounded hashagg, although 
it only worked for aggregates with fixed-size state, and not that great 
for cases like array_agg where the state grows. Maybe the improvements 
to aggregate functions proposed by David Rowley last week might help in 
those cases, as that'd allow spilling those states to disk.


So while the approach proposed by Jeff may turn out not to be the right 
one, I think memory accounting needs to be solved first (even if it's 
not committed until the whole feature is ready).



Others have also mentioned that we might want to use this mechanism
to track memory for other operators, like Sort or HashJoin, which
might be simpler and more accurate.


That would imply redefining the meaning of work_mem for those
operations; it would suddenly include a lot more overhead space than
it used to, causing them to spill to disk much more quickly than
before, unless one changes the work_mem setting to compensate. Not
sure that people would like such a change.


Don't we tweak the work_mem meaning over time anyway? Maybe not to such 
extent, but for example the hashjoin 9.5 improvements certainly change 
this by packing the tuples more densely, sizing the buckets differently 
etc. It's true the changes are in the other direction (i.e. batching 
less frequently) though.


OTOH this would make the accounting more accurate (e.g. eliminating 
differences between cases with different amounts of overhead because of 
different tuple width, for example). Wouldn't that be a good thing, even 
if people need to tweak the work_mem? Isn't that kinda acceptable when 
upgrading to the next major version?



An even bigger problem is that it would pretty much break the logic
around LACKMEM() in tuplesort.c, which supposes that spilling
individual tuples to disk is a reliable and linear way to decrease
the accounted-for memory. Individual pfree's would typically not
decrease the accounted total in this implementation. Depending on
what you have in mind for the spill-to-disk behavior, this issue
could be a fail for HashAgg as well, which is another reason not to
commit this patch in advance of seeing the use-case.


That's true, but I think the plan was always to wait for the whole 
feature, and only then commit all the pieces.



--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-06-14 Thread Tomas Vondra

Hi,

On 06/14/15 21:43, Jeff Davis wrote:

This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().


I see it's adding the new field as int64 - wouldn't a Size be more 
appropriate, considering that's what we use in mctx.h and aset.c?



It also adds a new function, MemoryContextMemAllocated() which can
either retrieve the total allocated for the context, or it can
recurse and sum up the allocations for all subcontexts as well.

This is intended to be used by HashAgg in an upcoming patch that will
bound the memory and spill to disk.

Previous discussion here:

http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop

Previous concerns:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See http://www.postgresql.org/message-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajp3cro9db...@mail.gmail.com
* Adding a struct field to MemoryContextData may be bad for the CPU
caching behavior, and may be the cause of the above slowdown.



* Previous versions of the patch updated the parent contexts'
allocations as well, which avoided the need to recurse when querying
the total allocation. That seemed to penalize cases that didn't need
to track the allocation. We discussed trying to opt-in to this
behavior, but it seemed more awkward than helpful. Now, the patch
only updates the allocation of the current context, and querying
means recursing through the child contexts.


I don't think the opt-in idea itself was awkward, it was rather about 
the particular APIs that we came up with, especially when combined with 
the 'context inheritance' thing.


I still think the opt-in approach and updating accounting for the parent 
contexts was the best one, because it (a) minimizes impact in cases that 
don't use the accounting, and (b) makes finding the current amount of 
memory cheap ...



* There was a concern that, if MemoryContextMemAllocated needs to
recurse to the child contexts, it will be too slow for HashAgg of
array_agg, which creates a child context per group. That was solved with
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1


I wouldn't say this was solved - we have fixed one particular example 
of such aggregate implementation, because it was causing OOM issues with 
many groups, but there may be other custom aggregates using the same 
pattern.


Granted, built-in aggregates are probably more critical than aggregates 
provided by extensions, but I wouldn't dare to mark this solved ...



My general answer to the performance concerns is that they aren't a
reason to block this patch, unless someone has a suggestion about how
to fix them. Adding one field to a struct and a few arithmetic
operations for each malloc() or realloc() seems reasonable to me.


I'm not buying this, sorry. While I agree that we should not expect the 
memory accounting to be entirely free, we should be very careful about 
the overhead especially if we're dropping the opt-in and thus imposing 
the overhead on everyone.


But performance concerns are not a reason to block this patch approach 
seems wrong. With any other patch a 3% regression would be considered a 
serious issue IMNSHO.



The current state, where HashAgg just blows up the memory, is just
not reasonable, and we need to track the memory to fix that problem.
Others have also mentioned that we might want to use this mechanism
to track memory for other operators, like Sort or HashJoin, which
might be simpler and more accurate.


Dropping the memory accounting implementations and keeping just this new 
solution would be nice, only if we agree the performance impact to be 
acceptable. We already have accounting solution for each of those 
places, so I don't think the unification alone outweighs the regression.


regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-06-14 Thread Jeff Davis
On Sun, 2015-06-14 at 16:21 -0400, Tom Lane wrote:
 Meh.  HashAgg could track its memory usage without loading the entire
 system with a penalty.

When I tried doing it outside of the MemoryContexts, it seemed to get
awkward quickly. I'm open to suggestion if you can point me in the right
direction. Maybe I can peek at the sizes of chunks holding state values
and group keys, and combine that with the hash table size-estimating
code?

   Moreover, this is about fourth or fifth down the
 list of the implementation problems with spilling hash aggregation to
 disk.  It would be good to see credible solutions for the bigger issues
 before we buy into adding overhead for a mechanism with no uses in core.

I had several iterations of a full implementation of the spill-to-disk
HashAgg patch[1]. Tomas Vondra has some constructive review comments,
but all of them seemed solvable. What do you see as a major unsolved
issue?

If I recall, you were concerned about things like array_agg, where an
individual state could get larger than work_mem. That's a valid concern,
but it's not the problem I was trying to solve.

Regards,
Jeff Davis

[1]
http://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers