Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

Andres Freund Sat, 16 Jun 2012 04:44:21 -0700

Hi Robert,

On Friday, June 15, 2012 10:03:38 PM Robert Haas wrote:
> On Thu, Jun 14, 2012 at 4:13 PM, Andres Freund <and...@2ndquadrant.com> 
wrote:
> > I don't plan to throw in loads of conflict resolution smarts. The aim is
> > to get to the place where all the infrastructure is there so that a MM
> > solution can be built by basically plugging in a conflict resolution
> > mechanism. Maybe providing a very simple one.
> > I think without in-core support its really, really hard to build a
> > sensible MM implementation. Which doesn't mean it has to live entirely
> > in core.
> 
> Of course, several people have already done it, perhaps most notably
> Bucardo.
Bucardo certainly is nice but its not useable for many things just from an 
overhead perspective.


> Anyway, it would be good to get opinions from more people here.  I am
> sure I am not the only person with an opinion on the appropriateness
> of trying to build a multi-master replication solution in core or,
> indeed, the only person with an opinion on any of these other issues.
> It is not good for those other opinions to be saved for a later date.
Agreed.

> > Hm. Yes, you could do that. But I have to say I don't really see a point.
> > Maybe the fact that I do envision multimaster systems at some point is
> > clouding my judgement though as its far less easy in that case.
> Why?  I don't think that particularly changes anything.
Because it makes conflict detection very hard. I also don't think its a 
feature worth supporting. Whats the use-case of updating records you cannot 
properly identify?

> > It also complicates the wal format as you now need to specify whether you
> > transport a full or a primary-key only tuple...
> Why?  If the schemas are in sync, the target knows what the PK is
> perfectly well.  If not, you're probably in trouble anyway.
True. There already was the wish (from Kevin) of having the option of 
transporting full before/after images anyway, so the wal format might want to 
be able to represent that.

> > I think though that we do not want to enforce that mode of operation for
> > tightly coupled instances. For those I was thinking of using command
> > triggers to synchronize the catalogs.
> > One of the big screwups of the current replication solutions is exactly
> > that you cannot sensibly do DDL which is not a big problem if you have a
> > huge system with loads of different databases and very knowledgeable
> > people et al. but at the beginning it really sucks. I have no problem
> > with making one of the nodes the "schema master" in that case.
> > Also I would like to avoid the overhead of the proxy instance for
> > use-cases where you really want one node replicated as fully as possible
> > with the slight exception of being able to have summing tables,
> > different indexes et al.
> In my view, a logical replication solution is precisely one in which
> the catalogs don't need to be in sync.  If the catalogs have to be in
> sync, it's not logical replication.  ISTM that what you're talking
> about is sort of a hybrid between physical replication (pages) and
> logical replication (tuples) - you want to ship around raw binary
> tuple data, but not entire pages.
Ok, thats a valid point. Simon argued at the cluster summit that everything 
thats not physical is logical. Which has some appeal because it seems hard to 
agree what exactly logical rep is. So definition by exclusion makes kind of 
sense ;)

I think what you categorized as "hybrid logical/physical" rep solves an 
important use-case thats very hard to solve at the moment. Before my 
2ndquadrant days I had several client which had huge problemsing the trigger 
based solutions because their overhead simply was to big a burden on the 
master. They couldn't use SR either because every consuming database kept 
loads of local data.
I think such scenarios are getting more and more common.

> The problem with that is it's going to be tough to make robust.  Users could
> easily end up with answers that are total nonsense, or probably even crash
> the server.
Why?

> To step back and talk about DDL more generally, you've mentioned a few
> times the idea of using an SR instance that has been filtered down to
> just the system catalogs as a means of generating logical change
> records.  However, as things stand today, there's no reason to suppose
> that replicating anything less than the entire cluster is sufficient.
> For example, you can't translate enum labels to strings without access
> to the pg_enum catalog, which would be there, because enums are
> built-in types.  But someone could supply a similar user-defined type
> that uses a user-defined table to do those lookups, and now you've got
> a problem.  I think this is a contractual problem, not a technical
> one.  From the point of view of logical replication, it would be nice
> if type output functions were basically guaranteed to look at nothing
> but the datum they get passed as an argument, or at the very least
> nothing other than the system catalogs, but there is no such
> guarantee.  And, without such a guarantee, I don't believe that we can
> create a high-performance, robust, in-core replication solution.
I don't think thats a valid argument. Any such solution existing today fails 
to work properly with dump/restore and such because it implies dependencies 
that they do not know about. The "internal" tables will possibly be restored 
later than the tables using the tables and such. So your data format *has* to 
deal with loading/outputting data without such anyway.

Some of that can be ameliorated using extensions + configuration tables but 
even then you have to be *very* careful and plan your backup/restore 
procedures way much more careful then when not.

> Now, the nice thing about being the people who make PostgreSQL happen
> is we get to decide what the C code that people load into the server
> is required to guarantee; we can change the rules.  Before, types were
> allowed to do X, but now they're not.  Unfortunately, in this case, I
> don't really find that an acceptable solution.  First, it might break
> code that has worked with PostgreSQL for many years; but worse, it
> won't break it in any obvious way, but rather only if you're using
> logical replication, which will doubtless cause people to attribute
> the failure to logical replication rather than to their own code.
As I said above, anybody using code like that has to be aware of the problem 
anyway. Should there really be real cases of that marking configuration tables 
in the catalog as to be shared would be relatively uncomplicated solution.

> Even if they do understand that we imposed a rule-change from on high,
> there's no really good workaround: an enum type is a good example of
> something that you *can't* implement without a side-table.
Enums are a good example of a intrusive feature which breaks features we have 
come to rely on (transactional DDL) which could not be implemented outside of 
core pg.

And yes, you obviously can implement it without needing side-table for output. 
Just as a string which is checked during input.

> Second, it flies in the face of our often-stated desire to make the server
> extensible.
While I generally see that desire as something worthwile I don't see it being 
violated in this case. The amount of extensibility youre removing here is 
minimal in my opinion and solutions for it aren't that hard. Solving them (aka 
marking those tables as some form of secondary system tables) would make that 
code actually reliable.

Sorry, I remain highly unconvinced of the above argumentation.

> Also, even given the existence of such a restriction, you
> still need to run any output function that relies on catalogs with
> catalog contents that match what existed at the time that WAL was
> generated, and under the correct snapshot, which is not trivial.
> These things are problems even for other things that we might need to
> do while examining the WAL stream, but they're particularly acute for
> any application that wants to run type-output functions to generate
> something that can be sent to a server which doesn't necessarily
> having matching catalog contents.
I don't think its that hard. And as you say, we need to solve it anyway.

> But it strikes me that these things, really, are only a problem for a
> minority of data types.  For text, or int4, or float8, or even
> timestamptz, we don't need *any catalog contents at all* to
> reconstruct the tuple data.  Knowing the correct type alignment and
> which C function to call is entirely sufficient.  So maybe instead of
> trying to cobble together a set of catalog contents that we can use
> for decoding any tuple whatsoever, we should instead divide the world
> into well-behaved types and poorly-behaved types. Well-behaved types
> are those that can be interpreted without the catalogs, provided that
> you know what type it is.  Poorly-behaved types (records, enums) are
> those where you can't.  For well-behaved types, we only need a small
> amount of additional information in WAL to identify which types we're
> trying to decode (not the type OID, which might fail in the presence
> of nasty catalog hacks, but something more universal, like a UUID that
> means "this is text", or something that identifies the C entrypoint).
This would essentially double the size of the WAL even for rows containing 
only simple types and would mean we would run quite a bit of non-trivial code 
additionally in relatively critical parts of the code. Both imnsho is 
unacceptable.
You could reduce the space overhead by only adding that information only the 
first time after a table has changed (and then regularly after a checkpoint or 
so) but doing so seems to be introducing too much complexity.


> And then maybe we handle poorly-behaved types by pushing some of the
> work into the foreground task that's generating the WAL: in the worst
> case, the process logs a record before each insert/update/delete
> containing the text representation of any values that are going to be
> hard to decode.  In some cases (e.g. records all of whose constituent
> fields are well-behaved types) we could instead log enough additional
> information about the type to permit blind decoding.
I think this is prohibitively expensive from a development, runtime, space and 
maintenance standpoint.
For databases using thing were decoding is rather expensive (e.g. postgis) you 
wouldn't really improve much above the old trigger based solutions. Its a 
return to "log everything twice".

Sorry if I seem pigheaded here, but I fail to see why all that would buy us 
anything but loads of complexity while loosing many potential advantages.

Greetings,

Andres
-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

Reply via email to