We've been running OpenLDAP since 2015 and upgraded from v2.4 to v2.6 about a year ago. 99% of the time, replication works fine. I have numerous consumers and the only ones that have regular issues are the two in AWS. This week (worst so far), I had to restart both consumers because replication hung 4 out of 5 days. Two of those days, I had the be_delete issue mention above. The others just continued and finished replication after the restart. I have lowered timeouts and keepalives to see if that would help; current settings are:
idletimeout 30 syncrepl rid=XXX ... retry="10 10 20 +" network-timeout=30 timeout=60 keepalive=10:3:10 Unclear if this has helped. Note that if all the operations/tasks finish quickly it's unlikely to have the be_delete issue. If one of the operations take a while to finish, be_delete is more likely. I'm assuming due to the "last case option" of systemd is to send SIGKILL rather than the initial SIGINT.
