Hello Alex,

Your configuration includes `max-unacked-clients` parameter set to 5. It means that when the communication between the two servers is interrupted the surviving server starts checking if the other server responds to the DHCP queries. It assumes that the other server doesn't respond to the queries when the `secs` field value the clients set in the discover or rebind messages exceed the value of `max-ack-delay` (5 seconds in your case). It must find at least 5 distinct clients retransmitting with bumped `secs` field value before it may assume that the partner is dead and take over.

This process can take a while depending on the lease lifetime (rebind time), number of new or rebinding clients etc.

You can read more about the parameters controlling the failover process in the Kea ARM:

https://kea.readthedocs.io/en/kea-2.4.0/arm/hooks.html#load-balancing-configuration

In your failover scenario, please make sure that after the communication failure the clients properly set the `secs` field value upon retransmissions. If you want to bypass this mechanism, you can set the `max-unacked-clients` to 0.

Marcin Siodelski
Senior Software Engineer
ISC

On 9.10.2023 19:37, Alexandre Lessard wrote:
Hello everyone,

I'm new here! I'm working for an ISP as a network administrator.
Furthermore, I got about 7 years of experience doing all sort of IT
stuff for this company. I've been using and configuring Kea DHCP for
about 2 weeks now. Prior to that, I was using ISC DHCP, but since it's
now deprecated, I'm preparing two new servers to migrate all customers
on them.

The setup:
The setup is DHCP relay with two Kea servers in HA hot-standby. There
is three particularity that I want to mention right now.

First, because I couldn't find an out-of-the-box solution, I made a
script that replicate the configuration through the API on both server
when they are restarted. I don't think it interferes with the service
as it is run prior to the service startup, but I don't want to
overlook it either.

Second, they both have an IP configured on their loop back interface
to be use kind of like an any cast address. That being said, I don't
use them for the HA, it's only used by the Relay agents.

Third, they are Proxmox containers. I don't think it's problematic but
tell me if I'm wrong, I will make VMs for them.

My problem:
When I simulated an outage by stopping the server1, only 2 (test2 and
test3) of the 4 subnets recover eventually. Even if they recover, it
takes about 5 minutes. As much as I understand, it's supposed to be
configured at 1 minute. The two other subnets never recovers.
Why some subnets never recover?
Why the 2 that recover take so long?

I observed that the state of server2 stays to "hot-standby" even if
the remote communication is interrupted.

I have been working on fixing that for more than 10 hours now.
Likewise, I really don't know what to look for anymore.

The config:
The Control Agents have almost default configuration, except for the
http-host that is set to the IP interface that receive the request
(eth0).

The Dhcp6 server is disabled.

Has for the Dhcp4 config, it has been saved through the API, so it is
massive! All default configs have been written in the config file. For
this reason, I won't post it here if not required to avoid sending a
wall of config. I've put it on a public repository of GitHub:
https://github.com/AlexTargo/Kea-Dhcp

If I'm missing anything, let me know, and I'll share it as soon as possible.

I hope someone have good pointers for me.

Regards
Alex

--
ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.

To unsubscribe visit https://lists.isc.org/mailman/listinfo/kea-users.

Kea-users mailing list
Kea-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/kea-users

Reply via email to