Re: [Slony1-general] slony backup during pg_dump

2019-06-14 Thread Glyn Astill
  > On Thursday, 13 June 2019, 18:11:34 BST, Christopher Browne 
 wrote:  On Thu, 13 Jun 2019 at 12:48, Tory M Blue 
 wrote:

I saw nothing in the locks when I looked. We did notice that in our second 
dump, while we are using a -n table, we are not including a -N _slonschema,. 
Now that should work!! but we are going to add the -N _slonschema as well as 
the -n table, just to see. Will poke it some more this eve!
>
> The relevant documentation is thus... 
> http://slony.info/documentation/2.2/appendix.html#AEN9598>
> It's a bit dated; at the time that was written, pg_dump did not yet have the 
> --exclude-schema option :-)>
> I think that's the "smoking gun", there!  Yes indeed, add -N _slonschema, 
> that should help!
Late to the party, but this is the exact behaviour when the slony schema isn't 
excluded in 2.0 onwards, my analysis of when this hit us after an upgrade from 
1.2.x is here:
[Slony1-general] pg_dump and replication lag in 2.0.7

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
[Slony1-general] pg_dump and replication lag in 2.0.7


 |

 |

 |



  ___
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general


Re: [Slony1-general] slony backup during pg_dump

2019-06-13 Thread Christopher Browne
On Thu, 13 Jun 2019 at 12:48, Tory M Blue  wrote:

> I saw nothing in the locks when I looked. We did notice that in our second
> dump, while we are using a -n table, we are not including a -N
> _slonschema,. Now that should work!! but we are going to add the -N
> _slonschema as well as the -n table, just to see. Will poke it some more
> this eve!
>

The relevant documentation is thus...
http://slony.info/documentation/2.2/appendix.html#AEN9598

It's a bit dated; at the time that was written, pg_dump did not yet have
the --exclude-schema option :-)

I think that's the "smoking gun", there!  Yes indeed, add -N _slonschema,
that should help!
___
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general


Re: [Slony1-general] slony backup during pg_dump

2019-06-13 Thread Tory M Blue
On Thu, Jun 13, 2019 at 7:59 AM Christopher Browne 
wrote:

> On Wed, Jun 12, 2019 at 11:58 PM Tory M Blue  wrote:
>
>> First postgres 9.x
>> CentOS 7.2
>> Replication flows:
>> primary ->secondary-> 3 query slaves
>>
>> Okay here it is,
>>
>> My dump started:
>>
>> Start export of schema clsdb for user postgres at Wed Jun 12 19:00:48 PDT
>> 2019
>>
>> So if I'm reading this correct it's 20:45 but something is showing that
>> the last _received_event_ts was at 19:38? over an hour ago?
>> sl_lag_time 1 hour?
>>
>> Primary:
>>
>> select * from _cls.sl_status limit 3;
>>
>>  st_origin | st_received | st_last_event |   st_last_event_ts|
>> st_last_receive
>>
>> d |  st_last_received_ts  |   st_last_received_event_ts   |
>> st_lag_num_events |
>>
>> st_lag_time
>>
>>
>> ---+-+---+---+
>>
>>
>> --+---+---+---+---
>>
>> --
>>
>> 11 |  14 |5036030946 | 2019-06-12 20:45:29.465785-07
>> |   503603013
>>
>> 8 | 2019-06-12 19:38:11.065976-07 | 2019-06-12 19:38:03.311267-07 |
>>   808 | 01
>>
>> :07:33.039694
>>
>> 11 |  15 |5036030946 | 2019-06-12 20:45:29.465785-07
>> |   503603013
>>
>> 8 | 2019-06-12 19:38:12.202539-07 | 2019-06-12 19:38:03.311267-07 |
>>   808 | 01
>>
>> :07:33.039694
>>
>> 11 |  12 |5036030946 | 2019-06-12 20:45:29.465785-07
>> |   503603013
>>
>> 8 | 2019-06-12 19:38:08.342637-07 | 2019-06-12 19:38:03.311267-07 |
>>   808 | 01
>>
>> :07:33.039694
>>
>> (3 rows)
>>
>>
>> I have 2.4 million rows in the sl_log_1 table, which looks like it
>> started to hockey stick right around 19:30.
>>
>>
>>
> You're reading it pretty well right.
>
> It is mighty curious that it waited 37-ish minutes for replication to fall
> off.
>
> It would be interesting (and it's too late for today) to capture the lag
> between those times.
>
> Thus, something like...
>
> #!/bin/sh
> PGDATABASE=postgresql://cbbrowne@localhost:7096/shared-sample
> psql -d $PGDATABASE -c "create table if not exists lag_by_time (as_at
> timestamp default now(), replication_lag interval);"
> for i in `seq 180`; do
>psql -d $PGDATABASE -c "insert into lag_by_time (replication_lag)
> select min(st_lag_time) from _cls.sl_status;"
>sleep 60
> done
>
> and launch that at about 19:00.  At whatever point replication_lag heads
> towards the Stanley Cup, that's the time at which Something Happened.
>
> It's probably interesting to look at pg_stat_activity to see what
> long-running queries there might be.
>
> I'm somewhat suspicious that, despite what you were intending, perhaps the
> pg_dump is nevertheless capturing the schema _cls; it would be "sorta
> normal" if, at the point at which pg_dump starts dumping _cls material,
> replication starts lagging.
>
> In particular, anything in pg_stat_activity that started at 19:38 ought to
> be looked at with particular suspicion as a candidate for being the cause.
>
>
>>
>> On the secondary (which replicates to my query db's (where the dump is
>> run from (not run from the primary), it includes these lines
>>
>>
>> Secondary:
>>
>>
>>  select * from _cls.sl_status limit 3;
>>
>>  st_origin | st_received | st_last_event |   st_last_event_ts|
>> st_la
>>
>> st_received |  st_last_received_ts  |   st_last_received_event_ts
>>   | st
>>
>> _lag_num_events |   st_lag_time
>>
>>
>> ---+-+---+---+--
>>
>>
>> +---+---+---
>>
>> +-
>>
>> 12 |  11 |5008650405 | 2019-06-12 20:45:35.365757-07
>> |
>>
>>  500865 | 2019-06-12 19:38:03.815508-07 | 2019-06-12
>> 19:38:03.338486-07 |
>>
>> 405 | 01:07:35.106539
>>
>> 12 |  13 |5008650405 | 2019-06-12 20:45:35.365757-07
>> |
>>
>>  5008650403 | 2019-06-12 20:45:17.701455-07 | 2019-06-12
>> 20:45:15.355233-07 |
>>
>>   2 | 00:00:23.089792
>>
>> 12 |  14 |5008650405 | 2019-06-12 20:45:35.365757-07
>> |
>>
>>  5008650404 | 2019-06-12 20:45:26.642094-07 | 2019-06-12
>> 20:45:25.360167-07 |
>>
>>   1 | 00:00:13.084858
>>
>> (3 rows)
>>
>
> May I presume that the time you ran this query was around 20:45 or 20:46?
>   That looks unsurprising/consistent with the first query; events going
> through node 11, which appears to be the master, are held back to 19:38.
> And the few events that go through coming from other nodes are lightly
> laggy, consistent with those nodes not generating SYNC events very often.
>

You assume correctly. i ran the queries back to back, and yes node 11 is
the primary/master and node 12 is the standby, with nodes 13-15 just query
nodes and only a subset of the data, they are not available for

Re: [Slony1-general] slony backup during pg_dump

2019-06-12 Thread Tory M Blue
On Wed, Jun 12, 2019 at 9:33 AM Christopher Browne 
wrote:

> On Tue, Jun 11, 2019 at 3:44 PM Tory M Blue  wrote:
>
>>
>>
>> On Tue, Jun 11, 2019 at 12:26 PM Christopher Browne <
>> cbbro...@afilias.info> wrote:
>>
>>>
>>>
>>> On Tue, Jun 11, 2019 at 1:30 PM Tory M Blue  wrote:
>>>
 I'm wondering if there is anything I can do to prevent what appears to
 be slony replication backup (not replicating) during a pg_dump. I'm using a
 -N slonschema, so wondering if there is any other techniques to prevent
 slon replication delay during pg_dumps?

>>>
>>> Hmm.  I remember the "big deal" being to ensure that the Slony schema
>>> was left out, as that would definitely lead to locking that would cause
>>> replication to fall behind.
>>>
>>> It's possible that something new has come along that would lock in a
>>> troublesome way.
>>>
>>> I suggest that you take a look in pg_catalog.pg_locks to see what
>>> objects Slony is waiting on, thus, looking for:
>>>
>>>   select * from pg_catalog.pg_locks where not granted;
>>>
>>> Some further work needs to be done to trace that to exactly what objects
>>> are locked.
>>>
>>
>> Ya will take a look, I also thought that bypassing the slon schema would
>> solve it but apparently not, we can have 2-10million count in the sl_log
>> before it clears and it's apparent that the replication is not happening,
>> so I'll poke around some more. Glad to see folks are still around :)
>>
>
> Well, growth of sl_log during the backup is fine, and not an indication of
> anything untowards.  Given that you opened a transaction when pg_dump
> began, and that has not completed, that will lead to Slony not trimming
> anything out of sl_log during the backup.  That's normal, routine stuff.
> Nothing's wrong (not on the basis of that, anyways).
>
> On the origin node, you can query the view sl_status, which will show how
> far behind replication appears to be.  (The view is available on all nodes;
> it does not return terribly interesting results on nodes other than the
> origin.)
>
> Here's an example of replication being very far behind:
>
> usaorigin=# select * from _usa.sl_status limit 3;
> -[ RECORD 1 ]-+--
> st_origin | 1
> st_received   | 3
> st_last_event | 502124
> st_last_event_ts  | 2019-04-10 20:02:43.833414+00
> st_last_received  | 502124
> st_last_received_ts   | 2019-04-10 20:02:43.885806+00
> st_last_received_event_ts | 2019-04-10 20:02:43.833414+00
> st_lag_num_events | 0
> st_lag_time   | 62 days 20:23:14.029816
> -[ RECORD 2 ]-+--
> st_origin | 1
> st_received   | 4
> st_last_event | 502124
> st_last_event_ts  | 2019-04-10 20:02:43.833414+00
> st_last_received  | 502122
> st_last_received_ts   | 2019-04-10 20:02:33.870577+00
> st_last_received_event_ts | 2019-04-10 20:02:23.76373+00
> st_lag_num_events | 2
> st_lag_time   | 62 days 20:23:34.0995
> -[ RECORD 3 ]-+--
> st_origin | 1
> st_received   | 5
> st_last_event | 502124
> st_last_event_ts  | 2019-04-10 20:02:43.833414+00
> st_last_received  | 502122
> st_last_received_ts   | 2019-04-10 20:02:33.898605+00
> st_last_received_event_ts | 2019-04-10 20:02:23.76373+00
> st_lag_num_events | 2
> st_lag_time   | 62 days 20:23:34.0995
>
> That is a replication cluster where I shut down slon processes back in
> April :-)
>
> It's only behind by a few events, for the self-same reason :-)
>
> If the last event received coincides with the time at which you started
> the pg_dump run, well, that sure sounds like a smoking gun.
>
> If, on the other hand, SYNC events are being processed, just at slower
> speed than you'd like, then perhaps the problem is that the I/O induced by
> pg_dump is delaying replication, and that's more a hardware problem than a
> software one.
>
> And if replication is a few seconds behind, while the system is busy, as
> indicated by st_lag_time being a few seconds, then it's possible that there
> is no problem there.
>

First postgres 9.x
CentOS 7.2
Replication flows:
primary ->secondary-> 3 query slaves

Okay here it is,

My dump started:

Start export of schema clsdb for user postgres at Wed Jun 12 19:00:48 PDT
2019

So if I'm reading this correct it's 20:45 but something is showing that the
last _received_event_ts was at 19:38? over an hour ago?
sl_lag_time 1 hour?

Primary:

select * from _cls.sl_status limit 3;

 st_origin | st_received | st_last_event |   st_last_event_ts|
st_last_receive

d |  st_last_received_ts  |   st_last_received_event_ts   |
st_lag_num_events |

st_lag_time

---+-+---+---+


Re: [Slony1-general] slony backup during pg_dump

2019-06-12 Thread Christopher Browne
On Tue, Jun 11, 2019 at 3:44 PM Tory M Blue  wrote:

>
>
> On Tue, Jun 11, 2019 at 12:26 PM Christopher Browne 
> wrote:
>
>>
>>
>> On Tue, Jun 11, 2019 at 1:30 PM Tory M Blue  wrote:
>>
>>> I'm wondering if there is anything I can do to prevent what appears to
>>> be slony replication backup (not replicating) during a pg_dump. I'm using a
>>> -N slonschema, so wondering if there is any other techniques to prevent
>>> slon replication delay during pg_dumps?
>>>
>>
>> Hmm.  I remember the "big deal" being to ensure that the Slony schema was
>> left out, as that would definitely lead to locking that would cause
>> replication to fall behind.
>>
>> It's possible that something new has come along that would lock in a
>> troublesome way.
>>
>> I suggest that you take a look in pg_catalog.pg_locks to see what objects
>> Slony is waiting on, thus, looking for:
>>
>>   select * from pg_catalog.pg_locks where not granted;
>>
>> Some further work needs to be done to trace that to exactly what objects
>> are locked.
>>
>
> Ya will take a look, I also thought that bypassing the slon schema would
> solve it but apparently not, we can have 2-10million count in the sl_log
> before it clears and it's apparent that the replication is not happening,
> so I'll poke around some more. Glad to see folks are still around :)
>

Well, growth of sl_log during the backup is fine, and not an indication of
anything untowards.  Given that you opened a transaction when pg_dump
began, and that has not completed, that will lead to Slony not trimming
anything out of sl_log during the backup.  That's normal, routine stuff.
Nothing's wrong (not on the basis of that, anyways).

On the origin node, you can query the view sl_status, which will show how
far behind replication appears to be.  (The view is available on all nodes;
it does not return terribly interesting results on nodes other than the
origin.)

Here's an example of replication being very far behind:

usaorigin=# select * from _usa.sl_status limit 3;
-[ RECORD 1 ]-+--
st_origin | 1
st_received   | 3
st_last_event | 502124
st_last_event_ts  | 2019-04-10 20:02:43.833414+00
st_last_received  | 502124
st_last_received_ts   | 2019-04-10 20:02:43.885806+00
st_last_received_event_ts | 2019-04-10 20:02:43.833414+00
st_lag_num_events | 0
st_lag_time   | 62 days 20:23:14.029816
-[ RECORD 2 ]-+--
st_origin | 1
st_received   | 4
st_last_event | 502124
st_last_event_ts  | 2019-04-10 20:02:43.833414+00
st_last_received  | 502122
st_last_received_ts   | 2019-04-10 20:02:33.870577+00
st_last_received_event_ts | 2019-04-10 20:02:23.76373+00
st_lag_num_events | 2
st_lag_time   | 62 days 20:23:34.0995
-[ RECORD 3 ]-+--
st_origin | 1
st_received   | 5
st_last_event | 502124
st_last_event_ts  | 2019-04-10 20:02:43.833414+00
st_last_received  | 502122
st_last_received_ts   | 2019-04-10 20:02:33.898605+00
st_last_received_event_ts | 2019-04-10 20:02:23.76373+00
st_lag_num_events | 2
st_lag_time   | 62 days 20:23:34.0995

That is a replication cluster where I shut down slon processes back in
April :-)

It's only behind by a few events, for the self-same reason :-)

If the last event received coincides with the time at which you started the
pg_dump run, well, that sure sounds like a smoking gun.

If, on the other hand, SYNC events are being processed, just at slower
speed than you'd like, then perhaps the problem is that the I/O induced by
pg_dump is delaying replication, and that's more a hardware problem than a
software one.

And if replication is a few seconds behind, while the system is busy, as
indicated by st_lag_time being a few seconds, then it's possible that there
is no problem there.
___
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general


Re: [Slony1-general] slony backup during pg_dump

2019-06-12 Thread Steve Singer

On Tue, 11 Jun 2019, Tory M Blue wrote:




Ya will take a look, I also thought that bypassing the slon schema would solve 
it but apparently not, we
can have 2-10million count in the sl_log before it clears and it's apparent 
that the replication is not
happening, so I'll poke around some more. Glad to see folks are still around :)

Thanks
Tory 


The large number of rows in sl_log is expected. The pg_dump opens a 
transaction and the rows can't be cleaned up by the cleanupThread until all 
transactions that might use those rows are gone.


I am a bit surprised that replication stops.
I know many sites take pg_dump's from slon replicas.

Steve

___
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general


Re: [Slony1-general] slony backup during pg_dump

2019-06-11 Thread Tory M Blue
On Tue, Jun 11, 2019 at 12:26 PM Christopher Browne 
wrote:

>
>
> On Tue, Jun 11, 2019 at 1:30 PM Tory M Blue  wrote:
>
>> I'm wondering if there is anything I can do to prevent what appears to be
>> slony replication backup (not replicating) during a pg_dump. I'm using a -N
>> slonschema, so wondering if there is any other techniques to prevent slon
>> replication delay during pg_dumps?
>>
>
> Hmm.  I remember the "big deal" being to ensure that the Slony schema was
> left out, as that would definitely lead to locking that would cause
> replication to fall behind.
>
> It's possible that something new has come along that would lock in a
> troublesome way.
>
> I suggest that you take a look in pg_catalog.pg_locks to see what objects
> Slony is waiting on, thus, looking for:
>
>   select * from pg_catalog.pg_locks where not granted;
>
> Some further work needs to be done to trace that to exactly what objects
> are locked.
>

Ya will take a look, I also thought that bypassing the slon schema would
solve it but apparently not, we can have 2-10million count in the sl_log
before it clears and it's apparent that the replication is not happening,
so I'll poke around some more. Glad to see folks are still around :)

Thanks
Tory
___
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general