Hi,

Alvaro Herrera wrote:
Yeah.  For what I need, the launcher just needs to know when a worker
has finished and how many workers there are.

Oh, so it's not all that less communication. My replication manager also needs to know when a worker dies. You said you are using a signal from manager to postmaster to request a worker to be forked. How do you do the other part, where the postmaster needs to tell the launcher which worker terminated?

For Postgres-R, I'm currently questioning if I shouldn't merge the replication manager process with the postmaster. Of course, that would violate the "postmaster does not touch shared memory" constraint.

I suggest you don't.  Reliability from Postmaster is very important.

Yes, so? As long as I can't restart the replication manager, but operation of the whole DBMS relies on it, I have to take the postmaster dows as soon as it detects a crashed replication manager.

So I still argue that reliability is getting better than status quo, if I'm merging these two processes (because of less code for communication between the two).

Of course, the other way to gain reliability would be to make the replication manager restartable. But restarting the replication manager means recovering data from other nodes in the cluster, thus a lot of network traffic. Needless to say, this is quite an expensive operation.

That's why I'm questioning, if that's the behavior we want. Isn't it better to force the administrators to look into the issue and probably replace a broken node instead of having one node going amok by requesting recovery over and over again, possibly forcing crashes of other nodes, too, because of the additional load for recovery?

But it would make some things a lot easier:

 * What if the launcher/manager dies (but you potentially still have
   active workers)?

   Maybe, for autovacuum you can simply restart the launcher and that
   one detects workers from shmem.

   With replication, I certainly have to take down the postmaster as
   well, as we are certainly out of sync and can't simply restart the
   replication manager. So in that case, no postmaster can run without a
   replication manager and vice versa. Why not make it one single
   process, then?

Well, the point of the postmaster is that it can notice when one process
dies and take appropriate action.  When a backend dies, the postmaster
closes all others.  But if the postmaster crashes due to a bug in the
manager (due to both being integrated in a single process), how do you
close the backends?  There's no one to do it.

That's a point.

But again, as long as the replication manager won't be able to restart, you gain nothing by closing backends on a crashed node.

In my case, the launcher is not critical.  It can die and the postmaster
should just start a new one without much noise.  A worker is critical
because it's connected to tables; it's as critical as a regular backend.
So if a worker dies, the postmaster must take everyone down and cause a
restart.  This is pretty easy to do.

Yeah, that's the main difference, and I see why your approach makes perfect sense for the autovacuum case.

In contrast, the replication manager is critical (to one node), and a restart is expensive (for the whole cluster).

 * Startup races: depending on how you start workers, the launcher/
   manager may get a "database is starting up" error when requesting
   the postmaster to fork backends.
   That probably also applies to autovacuum, as those workers shouldn't
   work concurrently to a startup process. But maybe there are other
   means of ensuring that no autovacuum gets triggered during startup?

Oh, this is very easy as well.  In my case the launcher just sets a
database OID to be processed in shared memory, and then calls
SendPostmasterSignal with a particular value.  The postmaster must only
check this signal within ServerLoop, which means it won't act on it
(i.e., won't start a worker) until the startup process has finished.

It seems like your launcher is perfectly fine with requesting workers and not getting them. The replication manager currently isn't. Maybe I should make it more fault tolerant in that regard...

I guess your problem is that the manager's task is quite a lot more
involved than my launcher's.  But in that case, it's even more important
to have them separate.

More involved with what? It does not touch shared memory, it mainly keeps track of the backends states (by getting a notice from the postmaster) and does all the necessary forwarding of messages between the communication system and the backends. It's main loop is similar to the postmasters, mainly consisting of a select().

I don't understand why the manager talks to postmaster.  If it doesn't,
well, then there's no concurrency issue gone, because the remote
backends will be talking to *somebody* anyway; be it postmaster, or
manager.

As with your launcher, I only send one message: the worker request. But the other way around, from the postmaster to the replication manager, there are also some messages: a "database is ready" message and a "worker terminated" messages. Thinking about handling the restarting cycle, I would need to add a "database is restarting" messages, which has to be followed by another "database is ready" message.

For sure, the replication manager needs to keep running during a restarting cycle. And it needs to know the database's state, so as to be able to decide if it can request workers or not.

(Maybe your problem is that the manager is not correctly designed.  We
can talk about checking that code.  I happen to know the Postmaster
process handling code because of my previous work with Autovacuum and
because of Mammoth Replicator.)

Thanks for the offer, I'll get back to that.

I think you're underestimating the postmaster's task.

Maybe, but it certainly looses importance within a cluster, since it controls only part of the whole database system.

Ok.  I have one ready, and it works very well.  It only ever starts one
worker -- I have constrained that way just to keep the current behavior
of a single autovacuum process running at any time.  My plan is to get
it submitted for review, and then start working on having it consider
multiple workers and introduce more scheduling smarts.

Sounds like a good plan.

Thank you for your inputs. You made me rethink some issues and pointed me to some open questions.

Regards

Markus

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Reply via email to