On Fri, Jan 24, 2014 at 12:49 PM, Andres Freund <and...@2ndquadrant.com> wrote:
>> But this code is riddled with places where you track a catalog xmin
>> and a data xmin separately.  The only point of doing it that way is to
>> support a division that hasn't been made yet.
> If you think it will make stuff more manageable I can try separating all
> lines dealing with catalog_xmin into another patch as long as data_xmin
> doesn't have to be renamed.
> That said, I don't really think it's a big problem that the division
> hasn't been made, essentially the meaning is different, even if we don't
> take advantage of it yet. data_xmin is there for streaming replication
> where we need to prevent all removal, catalog_xmin is there for
> changeset extraction.

I spent some more time studying the 0001 and 0002 patches this
afternoon, with a side dish of 0004.  I'm leaning toward thinking we
should go ahead and make that division.  I'm also wondering about
whether we've got the right naming here.  AFAICT, it's not the case
that we're going to use the "catalog xmin" for catalogs and the "data
xmin" for non-catalogs.  Rather, the "catalog xmin" is going to always
be included in globalxmin calculations, so IOW it should just be
called "xmin".  The "data xmin" is going to be included only for
non-catalog tables.  I guess "data" is a reasonable antonym for
catalog, but I'm slightly tempted to propose
RecentGlobalNonCatalogXmin and similar.  Maybe that's too ugly to
live, but I can see someone failing to guess what the exact
distinction is between "xmin" and "data xmin", and I bet they'd be a
lot less likely to misguess if we wrote "non catalog".

It's interesting (though not immediately relevant) to speculate about
how we might extend this to fine-grained xmin tracking more generally.
 The design sketch that comes to mind (and I think parts of this have
been proposed before) is to have a means by which backends can promise
not to lock any more tables except under a new snapshot.  At the read
committed isolation level, or in any single-statement transaction,
backends can so promise whenever (a) all tables mentioned in the query
have been locked and (b) all functions to be invoked during the query
via the fmgr interface promise (e.g. via function labeling) that they
won't directly or indirectly do such a thing.  If they break their
promise, we detect it and ereport(ERROR).  Backends that have made
such a guarantee can be ignored for global-xmin calculations that
don't involve the tables they have locked.  One idea is to keep a hash
table keyed by <dboid, reloid> with some limited number of entries in
shared memory; it caches the table-specific xmin, a usage counter, and
a flag indicating whether the cached xmin might be stale.  In order to
promise not to lock any new tables, backends must make or update
entries for all the tables they already have locked in this hash
table; if there aren't enough entries, they're not allowed to promise.
 Thus, backends wishing to prune can use the cached xmin value if it's
present (optionally updating it if it's stale) and the minimum of the
xmins of the backends that haven't made a promise if it isn't.  This
is a bit hairy though; access to the shared hash table had better be
*really* fast, and we'd better not need to recompute the cached value
too often.

Anyway, whether we end up pursuing something like that or not, I think
I'm persuaded that this particular optimization won't really be a
problem for hypothetical future work in this area; and also that it's
a good idea to do this much now specifically for logical decoding.

>> I have zero confidence that it's OK to treat fsync() as an operation
>> that won't fail.  Linux documents EIO as a plausible error return, for
>> example.  (And really, how could it not?)
> But quite fundamentally having a the most basic persistency building
> block fail is something you can't really handle safely. Note that
> issue_xlog_fsync() has always (and I wager, will always) treated that as
> a PANIC.
> I don't recall many complaints about that for WAL. All of the ones I
> found in a quick search were like "oh, the fs invalidated my fd because
> of corruption". And few.

Well, you have a point.  And certainly this version looks much better
than the previous version in terms of the likelihood of PANIC
occurring in practice.  But I wonder if we couldn't cut it down even
further without too much effort.  Suppose we define a slot to exist
if, and only if, the state file exists.  A directory without a state
file is subject to arbitrary removal.  Then we can proceed as follows:

- mkdir() the directory.
- open() state.tmp
- write() state.tmp
- close() state.tmp
- fsync() parent directory, directory and state.tmp
- rename() state.tmp to state
- fsync() state

If any step except the last one fails, no problem.  The next startup
can nuke the leftovers; any future attempt to create a slot with the
name can ignore an EEXIST failure from mkdir() and procedure just as
above.  Only a failure of the very last fsync is a PANIC.   In some
ways I think this'd be simpler than what you've got now, because we'd
eliminate the dance with renaming the directory as well as the state
file; only the state file matters.

To drop a slot, just unlink the state file and fsync() the directory.
If the unlink fails, it's just an error.  If the fsync() fails, it's a
PANIC.  Once the state file is gone, removing everything else is only
an ERROR, and you don't even need to fsync() it again.

To update a slot, open, write, close, and fsync state.tmp, then rename
it to state and fsync() again.  None of these steps need PANIC; hold
off on updating the values in memory until they're all done.  If any
step fails, the attempt to update the slot fails, but either memory
and disk are still consistent, or the disk has an xmin newer than
memory, but still legal.  On restart, when restoring slots, fsync()
each state file, dying horribly if we can't, and remove any
directories that don't contain one.

As compared with what you have here, this eliminates the risk of PANIC
entirely for slot updates, which is good because those will be quite
frequent.  For creating or dropping a slot, it doesn't quite eliminate
the risk entirely but only one fsync() call per create or drop is at
risk.  We still risk startup time failures, but that's unavoidable
anyway if the data we need can't be read; the chances of blowing up a
running system are very low.

Looking over patch 0002, I see that there's code to allow a walsender
to create or drop a physical replication slot.  Also, if we've
acquired a replication slot, there's code to update it, and code to
make sure we disconnect from it, but there's no code to acquire it.  I
think maybe the hunk in StartReplication() is supposed to be calling
ReplicationSlotAcquire() instead of ReplicationSlotRelease(), which
(ahem) makes one wonder how thoroughly this code has been tested.
There's also no provision for walsender (or pg_receivexlog?) to send
the new SLOT option to walreceiver, which seems somewhat necessary.
I'm tempted to suggest also adding something to src/bin/scripts to
create and drop slots, though I suppose we could just recommend psql
-c 'CREATE_REPLICATION_SLOT SLOT zippy PHYSICAL' 'replication=true'.

(BTW, isn't that kind of a strange syntax, with the word SLOT
appearing twice?  I think we could drop the second one.)

It also occurred to me that we need documentation for all of this; I
see that's in patch 0010, but I haven't reviewed it in detail yet.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to