Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

Andres Freund Thu, 11 Oct 2012 04:43:16 -0700

On Thursday, October 11, 2012 09:15:47 AM Heikki Linnakangas wrote:
> On 22.09.2012 20:00, Andres Freund wrote:
> > [[basic-schema]]
> > .Architecture Schema
> > ["ditaa"]
> > -------------------------------------------------------------------------
> > -----
> > 
> >          Traditional Stuff
> >   
> >   +---------+---------+---------+---------+----+
> >   
> >   | Backend | Backend | Backend | Autovac | ...|
> >   
> >   +----+----+---+-----+----+----+----+----+-+--+
> >   
> >        +------+ | +--------+         |      |
> >      
> >      +-+      | | | +----------------+      |
> >      
> >      |        v v v v                       |
> >      |     
> >      |     +------------+                   |
> >      |     
> >      |     | WAL writer |<------------------+
> >      |     
> >      |     +------------+
> >      
> >      v       v v v v v       +-------------------+
> > 
> > +--------+ +---------+   +->| Startup/Recovery  |
> > 
> > |{s}     | |{s}      |   |  +-------------------+
> > |Catalog | |   WAL   |---+->| SR/Hot Standby    |
> > |
> > |        | |         |   |  +-------------------+
> > 
> > +--------+ +---------+   +->| Point in Time     |
> > 
> >      ^          |            +-------------------+
> >   
> >   ---|----------|--------------------------------
> >   
> >      |       New Stuff
> > 
> > +---+          |
> > 
> > |              v            Running separately
> > | 
> > | +----------------+  +=-------------------------+
> > | 
> > | | Walsender  |   |  |                          |
> > | | 
> > | |            v   |  |    +-------------------+ |
> > | 
> > | +-------------+  |  | +->| Logical Rep.      | |
> > | 
> > | |     WAL     |  |  | |  +-------------------+ |
> > 
> > +-|  decoding   |  |  | +->| Multimaster       | |
> > 
> > | +------+------/  |  | |  +-------------------+ |
> > | 
> > | |            |   |  | +->| Slony             | |
> > | |            
> > | |            v   |  | |  +-------------------+ |
> > | 
> > | +-------------+  |  | +->| Auditing          | |
> > | 
> > | |     TX      |  |  | |  +-------------------+ |
> > 
> > +-| reassembly  |  |  | +->| Mysql/...         | |
> > 
> > | +-------------/  |  | |  +-------------------+ |
> > | 
> > | |            |   |  | +->| Custom Solutions  | |
> > | |            
> > | |            v   |  | |  +-------------------+ |
> > | 
> > | +-------------+  |  | +->| Debugging         | |
> > | 
> > | |   Output    |  |  | |  +-------------------+ |
> > 
> > +-|   Plugin    |--|--|-+->| Data Recovery     | |
> > 
> >    +-------------/  |  |    +-------------------+ |
> >    
> >    +----------------+  +--------------------------|
> > 
> > -------------------------------------------------------------------------
> > -----
> 
> This diagram triggers a pet-peeve of mine: What do all the boxes and
> lines mean? An architecture diagram should always include a key. I find
> that when I am drawing a diagram myself, adding the key clarifies my own
> thinking too.
Hm. Ok.


> This looks like a data-flow diagram, where the arrows indicate the data
> flows between components, and the boxes seem to represent processes. But
> in that case, I think the arrows pointing from the plugins in walsender
> to Catalog are backwards. The catalog information flows from the Catalog
> to walsender, walsender does not write to the catalogs.
The reason I drew it that way is that it actively needs to go back to the 
catalog and query it which is somewhat different of the rest which basically 
could be seen as a unidirectional pipeline.

> Zooming out to look at the big picture, I think the elephant in the room
> with this whole effort is how it fares against trigger-based
> replication. You list a number of disadvantages that trigger-based
> solutions have, compared to the proposed logical replication. Let's take
> > a closer look at them:

> > * essentially duplicates the amount of writes (or even more!)
> True.
By now I think its essentially unfixable.

> > * synchronous replication hard or impossible to implement
> > I don't see any explanation it could be implemented in the proposed
> logical replication either.
Its basically the same as its for synchronous streaming repl. At the place 
where SyncRepWaitForLSN() is done you instead/also wait for the decoding to 
reach that lsn (its in the wal, so everything is decodable) and for the other 
side to have confirmed reception of those changes. I think this should be 
doable with only minor code modifications.

The existing support for all that is basically the reason we want to reuse the 
walsender framework. (will start a thread about that soon)

> > * noticeable CPU overhead
> > 
> >   * trigger functions
> >   * text conversion of data
> 
> Well, I'm pretty sure we could find some micro-optimizations for these
> if we put in the effort.
Any improvements there are a good idea independent from this proposal but I 
don't see how we can fundamentally improve from the status quo.

> And the proposed code isn't exactly free, either.
If you don't have frequent DDL its really not all that expensive. In the 
version without DDL support I didn't manage to saturate the ApplyCache with 
either parallel COPY in individual transactions (repeated 100MB files) or with 
pgbench.
Also its basically doing work that the trigger/queue based solutions have to do 
as well, just that they do it via far less optimized sql statements.

DDL support doesn't really change much as the overhead for transactions without 
DDL and without concurrently running DDL should be fairly minor (the submitted 
version is *not* finialized there, it builds a new snapshot instead of 
copying/referencing the old one).

> > * complex parts implemented in several solutions
> Not sure what this means, but the proposed code is quite complex too.
It is, agreed.

What I mean is that significantly complex logic is burried in the encoding, 
queuing and decoding/ordering logic of every trigger based replication. Thats 
not a good thing.

> > * not in core
> 
> IMHO that's a good thing, and I'd hope this new logical replication to
> live outside core as well, as much as possible.
I don't agree there, but I would like to keep that a separate discussion.

For now I/we only want to submit the changes that technically need in-core 
support to work sensibly (this, background workers, some walsender 
integration). The goal of working nearly completely without special in-core 
support held the existing solutions back quite a bit imo.


> But whether or not something is in core is just a political decision, not a
> reason to implement something new.
Isn't it both? There are things you simply cannot do unless youre inside core.

Politically I think the external status of all those logical replication 
projects grew to be an adoption barrier. I don't even want to think about how 
many bad home-grown logical replication solutions I have seen out there that 
implement everything from the get-go.

> If the only meaningful advantage is reducing the amount of WAL written,
> I can't help thinking that we should just try to address that in the
> existing solutions, even if it seems "easy to solve at a first glance,
> but a solution not using a normal transactional table for its log/queue
> has to solve a lot of problems", as the document says.
Youre welcome to make suggestions, but everything I could think of that didn't 
fall short of reality ended up basically duplicating the amount of writes & 
fsyncs, even if not going through the WAL.

You need to be crash safe/restartable (=> writes, fsyncs) and you need to 
reduce the writes (in memory, => !writes). There is only one authoritative 
point where you can rely on a commit to have been successfull and thats when 
the commit record has been written to the WAL. You can't send out the data to 
be committed before thats written because that could result in spuriously 
committed transactions on the remote side and you can't easily do it afterwards 
because you can crash after the commit.

> Sorry to be a naysayer, but I'm pretty scared of all the new code and
> complexity these patches bring into core.
Understandable. I tried to keep the introduction of complexity in existing code 
paths relatively minor and I think I mostly succeeded there but it still needs 
to be maintained.

> PS. I'd love to see a basic Slony plugin for this, for example, to see
> how much extra code on top of the posted patches you need to write in a
> plugin like that to make it functional. I'm worried that it's a lot..
I think before its possible to do something like that a bit more design 
decisions need to be made. Mostly the walsender(ish) integration needs to be 
done.

After that I can imagine writing a demo plugin that outputs changes in a slony 
compatible format, but I would like to see some slony/londiste person 
cooperating on receiving/applying those.

What complications are you imagining?

Greetings,

Andres
-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

Reply via email to