Hi,

On 08/27/2010 10:46 PM, Robert Haas wrote:
What other subsystems are you imagining servicing with a dynamic
allocator?  If there were a big demand for this functionality, we
probably would have been forced to implement it already, but that's
not the case.  We've already discussed the fact that there are massive
problems with using it for something like shared_buffers, which is by
far the largest consumer of shared memory.

Understood. I certainly plan to look into that for a better understanding of the problems those pose for dynamically allocated memory.

I think it would be great if we could bring some more flexibility to
our memory management.  There are really two layers of problems here.

Full ACK.

One is resizing the segment itself, and one is resizing structures
within the segment.  As far as I can tell, there is no portable API
that can be used to resize the shm itself.  For so long as that
remains the case, I am of the opinion that any meaningful resizing of
the objects within the shm is basically unworkable.  So we need to
solve that problem first.

Why should resizing of the objects within the shmem be unworkable? Doesn't my patch(es) prove the exact opposite? Being able to resize "objects" within the shm requires some kind of underlying dynamic allocation. And I rather like to be in control of that allocator than having to deal with two dozen different implementations on different OSes and their libraries.

There are a couple of possible solutions, which have been discussed
here in the past.

I currently don't have much interest in dynamic resizing. Being able to resize the overall amount of shared memory on the fly would be nice, sure. But the total amount of RAM in a server changes rather infrequently. Being able to use what's available more efficiently is what I'm interested in. That doesn't need any kind of additional or different OS level support. It's just a matter of making better use of what's available - within Postgres itself.

Next, we have to think about how we're going to resize data structures
within this expandable shm.

Okay, that's where I'm getting interested.

Many of these structures are not things
that we can easily move without bringing the system to a halt.  For
example, it's difficult to see how you could change the base address
of shared buffers without ceasing all system activity, at which point
there's not really much advantage over just forcing a restart.
Similarly with LWLocks or the ProcArray.

I guess that's what Bruce wanted to point out by saying our data structures are mostly "continuous". I.e. not dynamic lists or hash tables, but plain simple arrays.

Maybe that's a subjective impression, but I seem to hear complaints about their fixed size and inflexibility quite often. Try to imagine the flexibility that dynamic lists could give us.

And if you can't move them,
then how will you grow them if (as will likely be the case) there's
something immediately following them in memory.  One possible solution
is to divide up these data structures into "slabs".  For example, we
might imagine allocating shared_buffers in 1GB chunks.

Why 1GB and do yet another layer of dynamic allocation within that? The buffers are (by default) 8K, so allocate in chunks of 8K. Or a tiny bit more for all of the book-keeping stuff.

To make this
work, we'd need to change the memory layout so that each chunk would
include all of the miscellaneous stuff that we need to do bookkeeping
for that chunk, such as the LWLocks and buffer descriptors.  That
doesn't seem completely impossible, but there would be some
performance penalty, because you could no longer index into shared
buffers from a single base offset.

AFAICT we currently have three fixed size blocks to manage shared buffers: the buffer blocks themselves, the buffer descriptors, the strategy status (for the freelist) and the buffer lookup table.

It's not obvious to me how these data structures should perform better than a dynamically allocated layout. One could rather argue that combining (some of) the bookkeeping stuff with data itself would lead to better locality and thus perform better.

Instead, you'd need to determine
which chunk contains the buffer you want, look up the base address for
that chunk, and then index into the chunk.  Maybe that overhead
wouldn't be significant (or maybe it would); at any rate, it's not
completely free.  There's also the problem of handling the partial
chunk at the end, especially if that happens to be the only chunk.

This sounds way too complicated, yes. Use 8K chunks and most of the problems vanish.

I think the problems for other arrays are similar, or more severe.  I
can't see, for example, how you could resize the ProcArray using this
approach.

Try not to think in terms of resizing, but dynamic allocation. I.e. being able to resize ProcArray (and thus being able to alter max_connections on the fly) would take a lot more work.

Just using the unoccupied space of the ProcArray for other subsystems that need it more urgently could be done much easier. Again, you'd want to allocate a single PGPROC at a time.

(And yes, the benefits aren't as significant as for shared_buffers, simply because PGPROC doesn't occupy that much memory).

If you want to deallocate a chunk of shared buffers, it's
not impossible to imagine an algorithm for relocating any dirty
buffers in the segment to be deallocated into the remaining available
space, and then chucking the ones that are not dirty.

Please use the dynamic allocator for that. Don't duplicate that again. Those allocators are designed for efficiently allocating small chunks, down to a few bytes.

It might not be
real cheap, but that's not the same thing as not possible.  On the
other hand, changing the backend ID of a process in flight seems
intractable.  Maybe it's not.  Or maybe there is some other approach
to resizing these data structures that can work, but it's not real
clear to me what it is.

Changing to a dynamically allocated memory model certainly requires some thought and lots of work. Yes. It's not for free.

So basically my feeling is that reworking our memory allocation in
general, while possibly worthwhile, is a whole lot of work.

Exactly.

If we
focus on getting imessages done in the most direct fashion possible,
it seems like the sort of things that could get done in six months to
a year.

Well, it works for Postgres-R as it is, so imessages already exists without a single additional month. And I don't intend to change it back to something that couldn't use a dynamic allocator. I already run into too many problems that way, see below.

If we take the approach of reworking our whole approach to
memory allocation first, I think it will take several years.  Assuming
the problems discussed above aren't totally intractable, I'd be in
favor of solving them, because I think we can get some collateral
benefits out of it that would be nice to have.  However, it's
definitely a much larger project.

Agreed.

If the allocations are
per-backend and can be made on the fly, that problem goes away.

That might hold true for imessages, which simply loose importance once the (recipient) backend vanishes. But other shared memory stuff, that would rather complicate shared memory access.

As long as we keep the shared memory area used for imessages/dynamic
allocation separate from, and independent of, the main shm, we can
still gain many of the same advantages - in particular, not PANICing
if a remap fails, and being able to resize the thing on the fly.

Separate sub-system allocators, separate code, separate bugs, lots more work. Please not. KISS.

However, I believe that the implementation will be more complex if the
area is not per-backend.  Resizing is almost certainly a necessity in
this case, for the reasons discussed above

I disagree and see the main reason in making better use of the available resources. Resizing will loose lots of importance, once you can dynamically adjust boundaries between subsystem's use of the single, huge, fixed-size shmem chunk allocated at start.

and that will have to be
done by having all backends unmap and remap the area in a coordinated
fashion,

That's assuming resizing capability.

so it will be more disruptive than unmapping and remapping a
message queue for a single backend, where you only need to worry about
the readers and writers for that particular queue.

And that's assuming a separate allocation method for the imessages sub-system.

Also, you now have
to worry about fragmentation: a simple ring buffer is great if you're
processing messages on a FIFO basis, but when you have multiple
streams of messages with different destinations, it's probably not a
great solution.

Exactly, that's where dynamic allocation shows its real advantages. No silly ring buffers required.

This goes back to my points further up: what else do you think this
could be used for?  I'm much less optimistic about this being reusable
than you are, and I'd like to hear some concrete examples of other use
cases.

Sure. And well understood. I'll try to take a try at converting shared_buffers.

Well, it's certainly nice, if you can make it work.  I haven't really
thought about all the cases, though.  The main advantages of LWLocks
is that you can take them in either shared or exclusive mode

As mentioned, the message queue has write accesses exclusively (enqueue and dequeue), so that's unneeded overhead.

and that
you can hold them for more than a handful of instructions.

Neither of the two operations needs more than a handful of instructions, so that's plain overhead as well.

If we're
trying to design a really *simple* system for message passing, LWLocks
might be just right.  Take the lock, read or write the message,
release the lock.

That's exactly how easy is is *with* the dynamic allocator: take the (even simpler) spin lock, enqueue (or dequeue) the message, release the lock again.

No locking required for writing or reading the message. Independent (and well multi-process capable / safe) alloc and free routines for memory management. That get called *before* writing the message and *after* reading it.

Mangling memory allocation with queue management is a lot more complicated to design and understand. And less efficient

But it seems like that's not really the case we're
trying to optimize for, so this may be a dead-end.

You probably need this, but 8KB seems like a pretty small chunk size.

For node-internal messaging, I probably agree. Would need benchmarking, as
it's a compromise between latency and overhead, IMO.

I've chosen 8KB so these messages (together with some GCS and other
transport headers) presumably fit into ethernet jumbo frames. I'd argue that
you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't
expect the GCS to do a better job at fragmenting, than we can do in the
upper layer (i.e. without copying data and w/o additional latency when
reassembling the packet). But again, maybe that should be benchmarked,
first.

Yeah, probably.  I think designing something that works efficiently
over a network is a somewhat different problem than designing
something that works on an individual node, and we probably shouldn't
let the designs influence each other too much.

There's no padding or sophisticated allocation needed.  You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3).  Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.

That would block parallel writers (i.e. only one process can write to the
queue at any time).

I feel like there's probably some variant of this idea that works
around that problem.  The problem is that when a worker finishes
writing a message, he needs to know whether to advance P2 only over
his own message or also over some subsequent message that has been
fully written in the meantime.  I don't know exactly how to solve that
problem off the top of my head, but it seems like it might be
possible.

Readers take the spinlock, read P1 and P2, release the spinlock, read
the data, take the spinlock, advance P1, and release the spinlock.

It would require copying data in case a process only needs to forward the
message. That's a quick pointer dequeue and enqueue exercise ATM.

If we need to do that, that's a compelling argument for having a
single messaging area rather than one per backend.  But I'm not sure I
see why we would need that sort of capability.  Why wouldn't you just
arrange for the sender to deliver the message directly to the final
recipient?

You might still want to fragment chunks of data to avoid problems if,
say, two writers are streaming data to a single reader.  In that case,
if the messages were too large compared to the amount of buffer space
available, you might get poor utilization, or even starvation.  But I
would think you wouldn't need to worry about that until the message
size got fairly high.

Some of the writers in Postgres-R allocate the chunk for the message in
shared memory way before they send the message. I.e. during a write
operation of a transaction that needs to be replicated, the backend
allocates space for a message at the start of the operation, but only fills
it with change set data during processing. That can possibly take quite a
while.

So, they know in advance how large the message will be but not what
the contents will be?  What are they doing?

I think unicast messaging is really useful and I really want it, but
the requirement that it be done through dynamic shared memory
allocations feels very uncomfortable to me (as you've no doubt
gathered).

Well, I on the other hand am utterly uncomfortable with having a separate
solution for memory allocation per sub-system (and it definitely is an
inherent problem to lots of our subsystems). Given the ubiquity of dynamic
memory allocators, I don't really understand your discomfort.

Well, the fact that something is commonly used doesn't mean it's right
for us.  Tabula raza, we might design the whole system differently,
but changing it now is not to be undertaken lightly.  Hopefully the
above comments shed some light on my concerns.  In short, (1) I don't
want to preallocate a big chunk of memory we might not use, (2) I fear
reducing the overall robustness of the system, and (3) I'm uncertain
what other systems would be able leverage a dynamic allocator of the
sort you propose.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to