Re: [Kea-users] Load-Balancing Network issue between Relay and Kea

Marcin Siodelski Mon, 09 Jan 2023 23:35:09 -0800

Kea HA detects partner failures via the control channel in the firstplace. The servers constantly exchange heartbeats and lease updates. Ifthis communication is healthy (i.e., servers receive responses to thecontrol commands), the counters of unacked clients are set to 0, and theservers do not monitor whether their partners respond to DHCP. In otherwords, if the servers can communicate with each other, the"max-unacked-clients" setting has no effect.

Partner failure detection in Kea HA always begins with communicationfailure over the control channel. Usually, it is caused by a partnerprocess crash or a network failure. The server tries to send heartbeatsand lease updates to the partner, for which it gets no responses.Suppose the "max-response-delay" setting is 60000 (1 minute) and the"heartbeat-delay" is 10000 (10 seconds). In that case, the server sendsa heartbeat every 10 seconds to the partner. If the partner doesn'trespond, the server sends the heartbeat again 10 seconds later, and soon. It neither counts the unacked clients nor transitions to thepartner-down state because the communication issue can be temporary.Finally, after around six unsuccessful heartbeat attempts (6 * 10seconds), the communication interruption becomes longer than 1 minute(60000ms). In that case, the server assumes there can be an issue withthe partner. It is when the "max-unacked-clients" setting finally startsto matter.

The server begins to analyze the DHCP messages sent to the partnerserver. The "secs" field in DHCPv4 and "Elapsed Time Option" in DHCPv6should be set by a client to indicate how long the client has beentrying to ask for a new lease or rebind an existing lease. Obviously,the server can't see if the partner responds to these queries. It onlygets copies of the DHCP messages sent by the client. The clients mustbump these values when they retry to obtain a lease. If these values arezero, the server suspecting that the partner is down cannot determinewhether the partner actually responds to DHCP. If these values aregreater than 0 (and greater than "max-ack-delay") the server can assumethat the partner hasn't responded to them because the clients are retrying.

For every client who sends a DHCPDISCOVER or DHCPREBIND to the partnerserver and (finally) sets the "secs" field value greater than"max-ack-delay", the other server bumps up its internal counter orunacked clients. Again, it only does it when it has been unable tocommunicate with the partner server over the control channel longer thanthe configured "max-response-delay". A single successful heartbeat overthe control channel will clear the counters of the unacked clients andmake the server believe that the partner is healthy. It will also stoplooking at the "secs" and "Elapsed Time" values. The"max-unacked-clients" no longer matters until the next communicationissue over the control channel.

If the "max-unacked-clients" value is exceeded, the server can finallytransition to the partner-down state and handle both the trafficdirected to itself and the inoperational partner. Since the statetransitions are only carried after a heartbeat attempt, there may be aslight delay between exceeding the "max-unacked-clients" value andactually transitioning to the partner-down state.

I looked into our documentation and realized that although all of thesepieces are described there, it can be confusing because the ARM lacks asequence diagram or an example of how the failover process can look endto end. That's something we should address.

Going over the previous emails, I see that users can see differentfailover strategies, depending on the types of failures they are likelyto experience in their setup. They are interesting cases, and we willdiscuss them internally. We could consider some alternative failuredetection strategies, selectable with the HA configuration, but weshould be aware that there is no one-fits-all solution. There is alwaysa possibility that the true failure won't be detected or a false failurewill be detected, leading to a split-brain situation.

It would be useful if you could please open tickets in Gitlab todescribe your failover scenarios and the desired behavior. Pleasedisregard it if you have already opened them.


Kinds Regards,
Marcin Siodelski

Sr. Software Engineer,
ISC

On 10.01.2023 03:07, Eric Graham wrote:

This is my understanding of how the unacked clients functionality works.My explanation is based upon the DHCP4 source code and may differ forDHCP6. I will include references at the bottom of my email which Iencourage double-checking for accuracy. I am not a contributor to Keaand have not thoroughly tested the conclusions I draw here.
1. The DHCP packet enters Kea. The HA hook receives the packet in thebuffer4Receive[1] function. The packet contents are parsed and droppedif invalid.
2. The packet is checked to be in scope [2][3][4][5][6] (and if itisn't, the packet status is set to NEXT_STEP_DROP [21]). Whether apacket is in scope is decided by the following: a. If the packet is not one that can be handled by HA (is one ofDHCPDISCOVER, DHCPREQUEST, DHCPDECLINE, DHCPRELEASE, or DHCPINFORM [7]),then the current server will process it [8]. b. If HA is configured in load balancing mode, the packet is classedaccording to the aforementioned HBA defined in RFC 3074 section 6[9][10]. The HBA returns the server that must handle the packet (eitherprimary or secondary). Otherwise (server is in hot standby), the packetis classed as belonging to the primary server in the HA configuration[12]. The class given in either of these conditions is the defined nameof the respective server, coming from the HA section of the Kea DHCP4configuration [13]. c. The current server will process the packet if it is servingpackets with the class determined in (2)(b).
Note: every heartbeat, the servers send each other their scopes [15]. Afailed heartbeat sets the HA status to "unavailable" [24], whicheventually transitions the server to partner down state.
3. If the server is in a communication interrupted state and the packetis not classed for the current server, then: a. Maintain a global counter, incrementing it once per packet (everysuccessful heartbeat counts as a "poke" for the partner [16], whichresets this global counter to zero [17]). b. Get the "secs" field of the packet. Compare the value to the valueconfigured in the Kea DHCP4 configuration for "max-ack-delay" [18], or10 seconds by default [19]. If the value of this field is greater thanthe max-ack-delay, the packet is considered unacked [20]. All packets(unacked or not) are kept track of in a map containing the hardwareaddress, client ID, and last unacked status; if the packet is beingreceived unacked, and it has not been previously recorded as beingunacked (that is, the packet secs field just exceeded the max-ack-delaythreshold for the first time), the server logs a warning message.
4. A failure is detected if the number of packets in the unacked stateis greater than the "max-unacked-clients" setting of the Kea DHCP4config [22] (or 10 by default [19]). If a failure is detected, theserver eventually transitions to partner-down state [23]. Moreinformation about when exactly the server transitions to partner-downstate is shown by the usages of HAService::shouldPartnerDown() [25] (inother words, I'm not digging into that tonight).
[1]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_impl.cc#L60-L111[2]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1021[3]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1029-L1047[4]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1034[5]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L376[6]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L382-L414[7]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L51-L71[8]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L395[9]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L416-L446
[10]: https://www.rfc-editor.org/rfc/rfc3074
[11]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L413[12]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L398[13]:https://kea.readthedocs.io/en/kea-2.2.0/arm/hooks.html#load-balancing-configuration[14]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L617-L625[15]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1757-L1758[16]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1793-L1794[17]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L274[18]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config_parser.cc#L180-L181[19]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config.cc#L166[20]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L652[21]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_impl.cc#L104[22]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config_parser.cc#L184-L185[23]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1097[24]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1799[25]:https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1081-L1106
*Eric Graham*
/DevOps Specialist/
Direct: 605.990.1859/
/
//[email protected] <mailto:[email protected]>/
/
/
/
------------------------------------------------------------------------
*From:* Kea-users <[email protected]> on behalf of KevinP. Fleming <[email protected]>
*Sent:* Monday, January 9, 2023 12:38 PM
*To:* [email protected] <[email protected]>
*Subject:* Re: [Kea-users] Load-Balancing Network issue between Relayand Kea*CAUTION:* This email originated outside the organization. Do not clickany links or attachments unless you have verified the sender.
On Mon, Jan 9, 2023, at 11:54, Veronique Lefebure wrote:
Very interesting thread.
Mathias, you wrote "Expected behaviour: Kea 2 sees the unacked clientsof Kea 1 and sets Kea 1 in partner-down state and handles allrequests.", but, If there is no traffic between DHCP clients and Kea1,then the value of max-unacked-clients on server1 cannot increaseanyway, right ? In other words, Kea2 cannot "see" anything ?
It can 'see', because it *also* saw all of the client requests and knowswhich ones it expected to be handled by Kea1 (as noted earlier in thethread it even emits a log message indicating this).
Forgive my presumption, but I assumed that 'max-unacked-clients' wouldbe a counter of 'unacked clients' which belong to a Kea server *otherthan this one*. I don't immediately know how counting the number ofclients *this server* has not acked would be useful, although I won't besurprised to learn that it is useful to someone.


--
ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.

To unsubscribe visit https://lists.isc.org/mailman/listinfo/kea-users.

Kea-users mailing list
[email protected]
https://lists.isc.org/mailman/listinfo/kea-users

Re: [Kea-users] Load-Balancing Network issue between Relay and Kea

Reply via email to