Kea HA detects partner failures via the control channel in the first place. The servers constantly exchange heartbeats and lease updates. If this communication is healthy (i.e., servers receive responses to the control commands), the counters of unacked clients are set to 0, and the servers do not monitor whether their partners respond to DHCP. In other words, if the servers can communicate with each other, the "max-unacked-clients" setting has no effect.

Partner failure detection in Kea HA always begins with communication failure over the control channel. Usually, it is caused by a partner process crash or a network failure. The server tries to send heartbeats and lease updates to the partner, for which it gets no responses. Suppose the "max-response-delay" setting is 60000 (1 minute) and the "heartbeat-delay" is 10000 (10 seconds). In that case, the server sends a heartbeat every 10 seconds to the partner. If the partner doesn't respond, the server sends the heartbeat again 10 seconds later, and so on. It neither counts the unacked clients nor transitions to the partner-down state because the communication issue can be temporary. Finally, after around six unsuccessful heartbeat attempts (6 * 10 seconds), the communication interruption becomes longer than 1 minute (60000ms). In that case, the server assumes there can be an issue with the partner. It is when the "max-unacked-clients" setting finally starts to matter.

The server begins to analyze the DHCP messages sent to the partner server. The "secs" field in DHCPv4 and "Elapsed Time Option" in DHCPv6 should be set by a client to indicate how long the client has been trying to ask for a new lease or rebind an existing lease. Obviously, the server can't see if the partner responds to these queries. It only gets copies of the DHCP messages sent by the client. The clients must bump these values when they retry to obtain a lease. If these values are zero, the server suspecting that the partner is down cannot determine whether the partner actually responds to DHCP. If these values are greater than 0 (and greater than "max-ack-delay") the server can assume that the partner hasn't responded to them because the clients are retrying.

For every client who sends a DHCPDISCOVER or DHCPREBIND to the partner server and (finally) sets the "secs" field value greater than "max-ack-delay", the other server bumps up its internal counter or unacked clients. Again, it only does it when it has been unable to communicate with the partner server over the control channel longer than the configured "max-response-delay". A single successful heartbeat over the control channel will clear the counters of the unacked clients and make the server believe that the partner is healthy. It will also stop looking at the "secs" and "Elapsed Time" values. The "max-unacked-clients" no longer matters until the next communication issue over the control channel.

If the "max-unacked-clients" value is exceeded, the server can finally transition to the partner-down state and handle both the traffic directed to itself and the inoperational partner. Since the state transitions are only carried after a heartbeat attempt, there may be a slight delay between exceeding the "max-unacked-clients" value and actually transitioning to the partner-down state.

I looked into our documentation and realized that although all of these pieces are described there, it can be confusing because the ARM lacks a sequence diagram or an example of how the failover process can look end to end. That's something we should address.

Going over the previous emails, I see that users can see different failover strategies, depending on the types of failures they are likely to experience in their setup. They are interesting cases, and we will discuss them internally. We could consider some alternative failure detection strategies, selectable with the HA configuration, but we should be aware that there is no one-fits-all solution. There is always a possibility that the true failure won't be detected or a false failure will be detected, leading to a split-brain situation.

It would be useful if you could please open tickets in Gitlab to describe your failover scenarios and the desired behavior. Please disregard it if you have already opened them.

Kinds Regards,
Marcin Siodelski

Sr. Software Engineer,
ISC

On 10.01.2023 03:07, Eric Graham wrote:
This is my understanding of how the unacked clients functionality works. My explanation is based upon the DHCP4 source code and may differ for DHCP6. I will include references at the bottom of my email which I encourage double-checking for accuracy. I am not a contributor to Kea and have not thoroughly tested the conclusions I draw here.

1. The DHCP packet enters Kea. The HA hook receives the packet in the buffer4Receive[1] function. The packet contents are parsed and dropped if invalid.

2. The packet is checked to be in scope [2][3][4][5][6] (and if it isn't, the packet status is set to NEXT_STEP_DROP [21]). Whether a packet is in scope is decided by the following:   a. If the packet is not one that can be handled by HA (is one of DHCPDISCOVER, DHCPREQUEST, DHCPDECLINE, DHCPRELEASE, or DHCPINFORM [7]), then the current server will process it [8].   b. If HA is configured in load balancing mode, the packet is classed according to the aforementioned HBA defined in RFC 3074 section 6 [9][10]. The HBA returns the server that must handle the packet (either primary or secondary). Otherwise (server is in hot standby), the packet is classed as belonging to the primary server in the HA configuration [12]. The class given in either of these conditions is the defined name of the respective server, coming from the HA section of the Kea DHCP4 configuration [13].   c. The current server will process the packet if it is serving packets with the class determined in (2)(b).

Note: every heartbeat, the servers send each other their scopes [15]. A failed heartbeat sets the HA status to "unavailable" [24], which eventually transitions the server to partner down state.

3. If the server is in a communication interrupted state and the packet is not classed for the current server, then:   a. Maintain a global counter, incrementing it once per packet (every successful heartbeat counts as a "poke" for the partner [16], which resets this global counter to zero [17]).   b. Get the "secs" field of the packet. Compare the value to the value configured in the Kea DHCP4 configuration for "max-ack-delay" [18], or 10 seconds by default [19]. If the value of this field is greater than the max-ack-delay, the packet is considered unacked [20]. All packets (unacked or not) are kept track of in a map containing the hardware address, client ID, and last unacked status; if the packet is being received unacked, and it has not been previously recorded as being unacked (that is, the packet secs field just exceeded the max-ack-delay threshold for the first time), the server logs a warning message.

4. A failure is detected if the number of packets in the unacked state is greater than the "max-unacked-clients" setting of the Kea DHCP4 config [22] (or 10 by default [19]). If a failure is detected, the server eventually transitions to partner-down state [23]. More information about when exactly the server transitions to partner-down state is shown by the usages of HAService::shouldPartnerDown() [25] (in other words, I'm not digging into that tonight).

[1]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_impl.cc#L60-L111 [2]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1021 [3]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1029-L1047 [4]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1034 [5]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L376 [6]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L382-L414 [7]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L51-L71 [8]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L395 [9]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L416-L446
[10]: https://www.rfc-editor.org/rfc/rfc3074
[11]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L413 [12]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L398 [13]: https://kea.readthedocs.io/en/kea-2.2.0/arm/hooks.html#load-balancing-configuration [14]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L617-L625 [15]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1757-L1758 [16]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1793-L1794 [17]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L274 [18]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config_parser.cc#L180-L181 [19]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config.cc#L166 [20]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L652 [21]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_impl.cc#L104 [22]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config_parser.cc#L184-L185 [23]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1097 [24]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1799 [25]: https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1081-L1106


*Eric Graham*
/DevOps Specialist/
Direct: 605.990.1859/
/
//[email protected] <mailto:[email protected]>/
/
/
/
------------------------------------------------------------------------
*From:* Kea-users <[email protected]> on behalf of Kevin P. Fleming <[email protected]>
*Sent:* Monday, January 9, 2023 12:38 PM
*To:* [email protected] <[email protected]>
*Subject:* Re: [Kea-users] Load-Balancing Network issue between Relay and Kea *CAUTION:* This email originated outside the organization. Do not click any links or attachments unless you have verified the sender.
On Mon, Jan 9, 2023, at 11:54, Veronique Lefebure wrote:
Very interesting thread.

Mathias, you wrote "Expected behaviour: Kea 2 sees the unacked clients of Kea 1 and sets Kea 1 in partner-down state and handles all requests.", but, If there is no traffic between DHCP clients and Kea1, then the value of max-unacked-clients on server1 cannot increase anyway, right ?  In other words, Kea2 cannot "see" anything ?


It can 'see', because it *also* saw all of the client requests and knows which ones it expected to be handled by Kea1 (as noted earlier in the thread it even emits a log message indicating this).

Forgive my presumption, but I assumed that 'max-unacked-clients' would be a counter of 'unacked clients' which belong to a Kea server *other than this one*. I don't immediately know how counting the number of clients *this server* has not acked would be useful, although I won't be surprised to learn that it is useful to someone.


--
ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.

To unsubscribe visit https://lists.isc.org/mailman/listinfo/kea-users.

Kea-users mailing list
[email protected]
https://lists.isc.org/mailman/listinfo/kea-users

Reply via email to