On 2014-11-17 21:03:07 +0100, Tomas Vondra wrote:
> On 17.11.2014 19:46, Andres Freund wrote:
> > On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote:
> >> On 17.11.2014 18:04, Andres Freund wrote:
> >>> Hi,
> >>>
> >>> On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
> >>>> *** a/src/include/nodes/memnodes.h
> >>>> --- b/src/include/nodes/memnodes.h
> >>>> ***************
> >>>> *** 60,65 **** typedef struct MemoryContextData
> >>>> --- 60,66 ----
> >>>> MemoryContext nextchild; /* next child of same parent */
> >>>> char *name; /* context name (just
> >>>> for debugging) */
> >>>> bool isReset; /* T = no space alloced
> >>>> since last reset */
> >>>> + uint64 mem_allocated; /* track memory allocated for
> >>>> this context */
> >>>> #ifdef USE_ASSERT_CHECKING
> >>>> bool allowInCritSection; /* allow palloc in
> >>>> critical section */
> >>>> #endif
> >>>
> >>> That's quite possibly one culprit for the slowdown. Right now one
> >>> AllocSetContext struct fits precisely into three cachelines. After
> >>> your change it doesn't anymore.
> >>
> >> I'm no PPC64 expert, but I thought the cache lines are 128 B on that
> >> platform, since at least Power6?
> >
> > Hm, might be true.
> >
> >> Also, if I'm counting right, the MemoryContextData structure is 56B
> >> without the 'mem_allocated' field (and without the allowInCritSection),
> >> and 64B with it (at that particular place). sizeof() seems to confirm
> >> that. (But I'm on x86, so maybe the alignment on PPC64 is different?).
> >
> > The MemoryContextData struct is embedded into AllocSetContext.
>
> Oh, right. That makes is slightly more complicated, though, because
> AllocSetContext adds 6 x 8B fields plus an in-line array of freelist
> pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the
> additional field. There might be some difference because of alignment,
> but I still don't see how that one additional field might impact cachelines?
It's actually 196 bytes:
struct AllocSetContext {
MemoryContextData header; /* 0 56 */
AllocBlock blocks; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
AllocChunk freelist[11]; /* 64 88 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
Size initBlockSize; /* 152 8 */
Size maxBlockSize; /* 160 8 */
Size nextBlockSize; /* 168 8 */
Size allocChunkLimit; /* 176 8 */
AllocBlock keeper; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
/* size: 192, cachelines: 3, members: 8 */
};
And thus one additional field tipps it over the edge.
"pahole" is a very useful utility.
> But if we separated the freelist, that might actually make it faster, at
> least for calls that don't touch the freelist at all, no? Because most
> of the palloc() calls will be handled from the current block.
I seriously doubt it. The additional indirection + additional branches
are likely to make it worse.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers