On 11/20/2013 09:58 PM, Robert Haas wrote:
On Wed, Nov 20, 2013 at 8:32 AM, Heikki Linnakangas
<hlinnakan...@vmware.com> wrote:
How many allocations? What size will they have have typically, minimum and
maximum?

The facility is intended to be general, so the answer could vary
widely by application.  The testing that I have done so far suggests
that for message-passing, relatively small queue sizes (a few kB,
perhaps 1 MB at the outside) should be sufficient.  However,
applications such as parallel sort could require vast amounts of
shared memory.  Consider a machine with 1TB of memory performing a
512GB internal sort.  You're going to need 512GB of shared memory for
that.

Hmm. Those two use cases are quite different. For message-passing, you want a lot of small queues, but for parallel sort, you want one huge allocation. I wonder if we shouldn't even try a one-size-fits-all solution.

For message-passing, there isn't much need to even use dynamic shared memory. You could just assign one fixed-sized, single-reader multiple-writer queue for each backend.

For parallel sort, you'll want to utilize all the available memory and all CPUs for one huge sort. So all you really need is a single huge shared memory segment. If one process is already using that 512GB segment to do a sort, you do *not* want to allocate a second 512GB segment. You'll want to wait for the first operation to finish first. Or maybe you'll want to have 3-4 somewhat smaller segments in use at the same time, but not more than that.

* As discussed in the "Something fishy happening on frogmouth" thread, I
don't like the fact that the dynamic shared memory segments will be
permanently leaked if you kill -9 postmaster and destroy the data directory.

Your test elicited different behavior for the dsm code vs. the main
shared memory segment because it involved running a new postmaster
with a different data directory but the same port number on the same
machine, and expecting that that new - and completely unrelated -
postmaster would clean up the resources left behind by the old,
now-destroyed cluster.  I tend to view that as a defect in your test
case more than anything else, but as I suggested previously, we could
potentially change the code to use something like 1000000 + (port *
100) with a forward search for the control segment identifier, instead
of using a state file, mimicking the behavior of the main shared
memory segment.  I'm not sure we ever reached consensus on whether
that was overall better than what we have now.

I really think we need to do something about it. To use your earlier example of parallel sort, it's not acceptable to permanently leak a 512 GB segment on a system with 1 TB of RAM.

One idea is to create the shared memory object with shm_open, and wait until all the worker processes that need it have attached to it. Then, shm_unlink() it, before using it for anything. That way the segment will be automatically released once all the processes close() it, or die. In particular, kill -9 will release it. (This is a variant of my earlier idea to create a small number of anonymous shared memory file descriptors in postmaster startup with shm_open(), and pass them down to child processes with fork()). I think you could use that approach with SysV shared memory as well, by destroying the segment with sgmget(IPC_RMID) immediately after all processes have attached to it.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to