I'm pretty sure this turned out to be only tangentially related to bucardo.
The EC2 server had been configured with the drive at 1500 IOPS with
burst to 3000 IOPS. The disconnections we were getting happened in
tandem with the AWS resetting the network interface because bucardo's
copy was exceeding the burst rate. bucardo would then try again but the
delay meant just that many more rows to be deleted and then inserted via
the copy so we were trapped in an ever-expanding reset loop.
Once the drive was reset to have 10,000 IOPS bucardo quickly caught up.
Additionally, I have not see the VAC double free error when restarting
bucardo but I don't have an explanation for that.
Jeff
Jeff Ross
[email protected]
On 2020-04-02 12:51, David Christensen wrote:
Anything in the PostgreSQL logs around this time?
--
David Christensen
Senior Software and Database Engineer
End Point Corporation
[email protected]
785-727-1171
On Mar 31, 2020, at 10:42 AM, Jeff Ross <[email protected]> wrote:
Not sure that's going to help--or maybe this is another issue.
Getting this in the logs now:
(18916) [Tue Mar 31 11:37:32 2020] KID (load_sync) Warning! Aborting due to
exception for metro.load_events:? Error was CTL request
(18916) [Tue Mar 31 11:37:32 2020] KID (load_sync) Kid has died, error is: CTL
request Line: 4997
(24401) [Tue Mar 31 11:37:32 2020] KID (load_sync) Warning! Aborting due to
exception for metro.load:? Error was CTL request
(24401) [Tue Mar 31 11:37:32 2020] KID (load_sync) Kid has died, error is: CTL
request Line: 4997
bucardo status shows it moved on from the load table to the load_events table
but I don't think the load table ever synced back up.
Jeff Ross
[email protected]
On 2020-03-31 09:32, Jeff Ross wrote:
Thank you David. On the master side I had idle-in-transaction-session set to
10 minutes so I did alter role bucardo to set it to 0 as suggested.
Jeff Ross
[email protected]
On 2020-03-31 09:14, David Christensen wrote:
On Mar 31, 2020, at 9:08 AM, Jeff Ross <[email protected]> wrote:
FATAL: terminating connection due to idle-in-transaction timeout
Well, this sounds like *a* potential issue (not necessarily *the* issue). What
do you have the idle_in_transaction_session_timeout parameter set to? If it’s
particularly low, (read: lower than some rate of changes) you could end up in a
situation where the CTL connection terminates like you display, and then bets
are off.
I would not expect this to be a persistent issue (i.e., a Bucardo restart
should reestablish these connections and pick up where it left off).
If you need the idle_in_transaction_session_timeout behavior, at the very
least, you could alter the “bucardo” user to disable this GUC for that user.
HTH,
David
--
David Christensen
Senior Software and Database Engineer
End Point Corporation
[email protected]
785-727-1171
--
The contents of this e-mail and any attachments are intended solely for the use
of the named addressee(s) and may contain confidential and/or privileged
information. Any unauthorized use, copying, disclosure, or distribution of the
contents of this e-mail is strictly prohibited by the sender and may be
unlawful. If you are not the intended recipient, please notify the sender
immediately and delete this e-mail.
--
The contents of this e-mail and any attachments are intended solely for the
use of the named addressee(s) and may contain confidential and/or
privileged information. Any unauthorized use, copying, disclosure, or
distribution of the contents of this e-mail is strictly prohibited by the
sender and may be unlawful. If you are not the intended recipient, please
notify the sender immediately and delete this e-mail.
_______________________________________________
Bucardo-general mailing list
[email protected]
https://bucardo.org/mailman/listinfo/bucardo-general