I'm pretty sure this turned out to be only tangentially related to bucardo.

The EC2 server had been configured with the drive at 1500 IOPS with burst to 3000 IOPS.  The disconnections we were getting happened in tandem with the AWS resetting the network interface because bucardo's copy was exceeding the burst rate.  bucardo would then try again but the delay meant just that many more rows to be deleted and then inserted via the copy so we were trapped in an ever-expanding reset loop.

Once the drive was reset to have 10,000 IOPS bucardo quickly caught up.

Additionally, I have not see the VAC double free error when restarting bucardo but I don't have an explanation for that.

Jeff

Jeff Ross
[email protected]

On 2020-04-02 12:51, David Christensen wrote:
Anything in the PostgreSQL logs around this time?
--
David Christensen
Senior Software and Database Engineer
End Point Corporation
[email protected]
785-727-1171


On Mar 31, 2020, at 10:42 AM, Jeff Ross <[email protected]> wrote:

Not sure that's going to help--or maybe this is another issue.

Getting this in the logs now:

(18916) [Tue Mar 31 11:37:32 2020] KID (load_sync) Warning! Aborting due to 
exception for metro.load_events:? Error was CTL request
(18916) [Tue Mar 31 11:37:32 2020] KID (load_sync) Kid has died, error is: CTL 
request Line: 4997
(24401) [Tue Mar 31 11:37:32 2020] KID (load_sync) Warning! Aborting due to 
exception for metro.load:? Error was CTL request
(24401) [Tue Mar 31 11:37:32 2020] KID (load_sync) Kid has died, error is: CTL 
request Line: 4997


bucardo status shows it moved on from the load table to the load_events table 
but I don't think the load table ever synced back up.


Jeff Ross
[email protected]

On 2020-03-31 09:32, Jeff Ross wrote:
Thank you David.  On the master side I had idle-in-transaction-session set to 
10 minutes so I did alter role bucardo to set it to 0 as suggested.

Jeff Ross
[email protected]

On 2020-03-31 09:14, David Christensen wrote:
On Mar 31, 2020, at 9:08 AM, Jeff Ross <[email protected]> wrote:

FATAL:  terminating connection due to idle-in-transaction timeout
Well, this sounds like *a* potential issue (not necessarily *the* issue).  What 
do you have the idle_in_transaction_session_timeout parameter set to?  If it’s 
particularly low, (read: lower than some rate of changes) you could end up in a 
situation where the CTL connection terminates like you display, and then bets 
are off.

I would not expect this to be a persistent issue (i.e., a Bucardo restart 
should reestablish these connections and pick up where it left off).

If you need the idle_in_transaction_session_timeout behavior, at the very 
least, you could alter the “bucardo” user to disable this GUC for that user.

HTH,

David
--
David Christensen
Senior Software and Database Engineer
End Point Corporation
[email protected]
785-727-1171


--
The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


--
The contents of this e-mail and any attachments are intended solely for the use of the named addressee(s) and may contain confidential and/or privileged information. Any unauthorized use, copying, disclosure, or distribution of the contents of this e-mail is strictly prohibited by the sender and may be unlawful. If you are not the intended recipient, please notify the sender immediately and delete this e-mail.
_______________________________________________
Bucardo-general mailing list
[email protected]
https://bucardo.org/mailman/listinfo/bucardo-general

Reply via email to