Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

Andres Freund Mon, 15 Oct 2012 16:01:40 -0700

On Tuesday, October 16, 2012 12:13:14 AM Christopher Browne wrote:
> On Mon, Oct 15, 2012 at 4:51 PM, Andres Freund <and...@2ndquadrant.com> 
wrote:
> > On Monday, October 15, 2012 10:08:28 PM Christopher Browne wrote:
> >> On Mon, Oct 15, 2012 at 3:18 PM, Peter Geoghegan <pe...@2ndquadrant.com>
> > 
> > wrote:
> >> > On 15 October 2012 19:19, Bruce Momjian <br...@momjian.us> wrote:
> >> >> I think Robert is right that if Slony can't use the API, it is
> >> >> unlikely any other replication system could use it.
> >> > 
> >> > I don't accept that. Clearly there is a circular dependency, and
> >> > someone has to go first - why should the Slony guys invest in adopting
> >> > this technology if it is going to necessitate using a forked Postgres
> >> > with an uncertain future? That would be (with respect to the Slony
> >> > guys) a commercial risk that is fairly heavily concentrated with
> >> > Afilias.
> >> 
> >> Yep, there's something a bit too circular there.
> >> 
> >> I'd also not be keen on reimplementing the "Slony integration" over
> >> and over if it turns out that the API churns for a while before
> >> stabilizing.  That shouldn't be misread as "I expect horrible amounts
> >> of churn", just that *any* churn comes at a cost.  And if anything
> >> unfortunate happens, that can easily multiply into a multiplicity of
> >> painfulness(es?).
> > 
> > Well, as a crosscheck, could you list your requirements?
> > 
> > Do you need anything more than outputting data in a format compatible to
> > whats stored in sl_log_*? You wouldn't have sl_actionseq, everything
> > else should be there (Well, you would need to do lookups to get the
> > tableid, but thats not really much of a problem). The results would be
> > ordered in complete transactions, in commit order.
> 
> Hmm.  We need to have log data that's in a compatible ordering.
> 
> We use sl_actionseq, and can mix data from multiple transactions
> together; if what you're providing is, instead, in order based on
> transaction commit order followed by some sequencing within each
> transaction, then that should be acceptable.


Inside the transaction its sequenced by the order the XLogInsert calls were 
made which is the order the client sent the commands. That sounds like it 
should be compatible.

> The stylized query on sl_log_* looks like...
> 
> select log_origin, log_txid, log_tableid,
> log_actionseq, log_tablenspname,
> log_tablerelname, log_cmdtype,
> log_cmdupdncols, log_cmdargs
> from %s.sl_log_%d
> where log_origin = %d
> 
> How about I "quibble" about each of these:
> 
> a) log_origin - this indicates the node from which the data
> originates.  Presumably, this is implicit in a "chunk" of data that is
> coming in.

I think we can just reuse whatever method you use in slony to get the current 
node's id to get it in the output plugin.

> b) log_txid - indicating the transaction ID.  I presume you've got
> this available.  It's less important with the WAL-based scheme in that
> we'd probably not be using it as a basis for querying as is the case
> today with Slony.

Its directly available. The plugin will have to call txid_out, but thats 
obviously no problem.

> c) log_tableid - indicating the ID of the table.  Are you capturing an
> OID equivalent to this?  Or what?

You get the TupleDesc of the table.

> d) log_actionseq - indicating relative sequences of updates.  You
> don't have this, but if you're capturing commit ordering, we don't
> need it.

Good.

> e) log_tablenspname, log_tablerelname - some small amount of magic
> needful to get this.  Or perhaps you are already capturing it?

The relevant backend functions available, so its no problem 
(RelationGetNamespace(change->new_tuple->table)).

> f) log_cmdtype - I/U/D/T - indicating the action
> (insert/update/delete/truncate).  Hopefully you have something like
> this?

Yes:
enum ApplyCacheChangeType
{
        APPLY_CACHE_CHANGE_INSERT,
        APPLY_CACHE_CHANGE_UPDATE,
        APPLY_CACHE_CHANGE_DELETE,
..
}

> g) log_cmdupdncols - for UPDATE action, the number of updated columns.
>  Probably not mandatory; this was a new 2.1 thing...

Hm. NO. We don't have that. But then, you can just use normal C code there, so 
it shouldn't be too hard to compute if neede.d

> h) log_cmdargs - the actual data needed to do the I/U/D.  The form of
> this matters a fair bit.  Before Slony 2.1, this was a portion of a
> SQL statement, omitting the operation (provided in log_cmdtype) and
> the table name (in log_tablerelname et al).  In Slony 2.1, this
> changes to be a text[] array that essentially consists of pairs of
> [column name, column value] values.

The existing C code to generate this should be copy&pasteable into this with a 
relatively small amount of changes.

> I see one place, very notable in Slony 2.2, that would also be more
> complicated, which is the handling of DDL.
> 
> In 2.1 and earlier, we handled DDL as "events", essentially out of
> band.  This isn't actually correct; it could mix very badly if you had
> replication activity mixing with DDL requests.  (More detail than you
> want is in a bug on this...
> <http://www.slony.info/bugzilla/show_bug.cgi?id=137>).
> 
> In Slony 2.2, we added a third "log table" where DDL gets captured.
> sl_log_script has much the same schema as sl_log_{1,2}; it needs to
> get "mixed in" in compatible order.  What I imagine would pointedly
> complicate life is if a single transaction contained both DDL and
> "regular replicable activity."  Slony 2.2 mixes this in using XID +
> log_actionseq; how this would play out with your log capture mechanism
> isn't completely clear to me.  That's the place where I'd expect the
> very messiest interaction.

Hm. I expect some complications here as well. But then, unless you do something 
special changes to those tables (e.g. sl_log_script) will be streamed out as 
well, so you could just insert events into their respective tables on the 
receiving side and deal with them there.

> > I guess the other tables would stay as they are as they contain the
> > "added value" of slony?
> 
> A fair bit of Slony is about the "event infrastructure," and some of
> that ceases to be as needful.  The configuration bits probably
> continue to remain interesting.

Quite a bit of the event infrastructure seems to deal with configuration 
changes and such, all thats probably going to stay...

> The parts that seem notably mysterious to me at the moment are:
> 
> a) How do we pull result sets (e.g. - sl_log_* data)?

The details of this are in a bit of flux as of now but I hope we will nail this 
down soon. You open a replication connection to the primary 'replication=1 
dbname=...' and issue

START_LOGICAL_REPLICATION slony $slot_id 0/DEADBEEF

With 0/DEADBEEF being the location youve stopped getting changes the last time. 
That will start streaming out changes via the COPY protocol. The contents of 
whats streamed out is entirely up to you.

The first time you start replication you need to do:

INIT_LOGICAL_REPLICATION

which will return a $slot_id, a SET TRANSACTION SNAPSHOT compatible snapshot 
and the initial starting LSN.

The 'slony' in START_LOGICAL_REPLICATION above denotes which output plugin is 
to be used.

> b) How is the command data represented?

Command data is the old/new tuple? Thats up to the output plugin. You get a 
HeapTuple with the old/new tuple, and compatible TupleDesc's. You could simply 
stream out your current format for example.

> c) If we have a need to mix together your 'raw logs' and other
> material (e.g. - our sl_log_script that captures DDL-like changes to
> be mixed back in), how easy|impossible is this?

As described above in general that seems easy enough. Just insert data into 
e.g. sl_log_script and when you receive the changes on the other side decide in 
which table to redirect those.

Where I see a bit of a problem is the handling of replication sets, 
configuration and similar.

Currently there is a dichotomy between 'catalog tables' and 'data tables'. The 
former are not replicated but can be queried in an output plugin (thats the 
timetravel part). The latter are replicated but cannot be queried. All system 
catalog tables are in the 'catalog' category by their nature, but I have played 
with a system column that allows other tables to be treated as catalog tables 
as well.

If you would want to filter data on the source - which probably makes sense? - 
you currently would need to have such an additional catalog table which is not 
replicated but can be queried by the output plugin. But I guess the contents of 
that table would also need to be replicated...

I wonder if it we should replicate changes to such user-defined catalog tables 
as well, that should be relatively easy and if its not wanted the output plugin 
easily can filter that (if (class_form->relusercat)).

Does that clear things up?

Greetings,

Andres
-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

Reply via email to