vdombrovski opened a new issue, #8967:
URL: https://github.com/apache/cloudstack/issues/8967
<!--
Verify first that your issue/request is not already reported on GitHub.
Also test if the latest release and main branch are affected too.
Always add information AFTER of these HTML comments, but no need to delete
the comments.
-->
##### ISSUE TYPE
<!-- Pick one below and delete the rest -->
* Bug Report
##### COMPONENT NAME
<!--
Categorize the issue, e.g. API, VR, VPN, UI, etc.
-->
~~~
VR
~~~
##### CLOUDSTACK VERSION
<!--
New line separated list of affected versions, commit ID for issues on main
branch.
-->
~~~
4.17.2
~~~
Note: I believe this impacts releases 4.18 and 4.19 also as no relevant code
changes have been made to VR code
##### CONFIGURATION
<!--
Information about the configuration if relevant, e.g. basic network,
advanced networking, etc. N/A otherwise
-->
- Advanced zone
- Standard networking using virtual routers only
- /24 pool of "public" IPs available for the virtual routers
##### OS / ENVIRONMENT
<!--
Information about the environment if relevant, N/A otherwise
-->
Ubuntu Focal 20.04
##### SUMMARY
<!-- Explain the problem/feature briefly -->
Our workflow makes it so public IPs can be assigned/unassigned/reassigned to
different networks, possibly multiple times per day (mostly, but not only via
https://github.com/apache/cloudstack-kubernetes-provider)
After a while, we started seeing that many unused IP addresses (according to
ACS database) were actually answering to L2 ARPing. In parallel, some of our
IPs are getting 2 different MACs from the ping:
```
ARPING XX.YY.25.156
60 bytes from 1e:00:fe:00:04:ac (XX.YY.25.156): index=0 time=306.714 usec
60 bytes from 1e:00:63:00:04:b5 (XX.YY.25.156): index=1 time=342.809 usec
```
Our investigation has shown these conflicts on around 10 ips in two of our
/24 subnets, which looks too much to be a conicidence.
We have confirmed that both MAC addresses belong to alive and healthy
virtual routers:
```
[cloud]> select instance_id from nics where mac_address="1e:00:63:00:04:b5";
+-------------+
| instance_id |
+-------------+
| 5654 |
+-------------+
1 row in set (0.010 sec)
[cloud]> select instance_id from nics where mac_address="1e:00:fe:00:04:ac";
+-------------+
| instance_id |
+-------------+
| 5624 |
+-------------+
1 row in set (0.007 sec)
```
These routers both correctly mount the IP addresses:
```
ip a | grep XX.YY.25.156
inet XX.YY.25.156/24 brd XX.YY.25.255 scope global secondary eth2
```
Inside the "databag" of the so called "illegitimate" router causing the
conflict, we can see the IP declared (sometimes with flag add: true, sometimes
with add: false):
```
# /etc/cloudstack/ips.json
{
"add": true,
"broadcast": "XX.YY.25.255",
"cidr": "XX.YY.25.156/24",
"device": "eth2",
"first_i_p": false,
"gateway": "XX.YY.25.1",
"is_private_gateway": false,
"netmask": "255.255.255.0",
"network": "XX.YY.25.0/24",
"new_nic": false,
"nic_dev_id": 2,
"nw_type": "public",
"one_to_one_nat": false,
"public_ip": "XX.YY.25.156",
"size": "24",
"source_nat": false,
"vif_mac_address": "1e:00:63:00:04:b5"
}
```
To my understanding, this file is read/merged whenever a new IP state change
is requested; however it's not merged properly on unassignment.
Obviously, deleting/recreating all the virtual routers would solve the issue
as this cache would be cleared; however because this causes massive disruptions
on the service we would like to make sure the root cause is fixed first.
##### STEPS TO REPRODUCE
<!--
For bugs, show exactly how to reproduce the problem, using a minimal
test-case. Use Screenshots if accurate.
For new features, show how the feature would be used.
-->
No specific command line here, but I think trying to bounce a public IP
between 2 guest routers (ofc inside separate VPCs if these are VPC networks)
several time would results in this issue.
1. Create 2 guest networks A and B
2. Create 2 VMS, one inside network A, the other inside network B
3. Associate an IP address to network A, and create a loadbalancer rule for
some port of VM A (should also work with port-forwarding and static NAT)
4. Release the IP, associate it to network B, and create a loadbalancer rule
for some port of VM A (should also work with port-forwarding and static NAT)
5. Repeat steps 3-4 until you get an IP address conflict (or until you see
invalid entries in the databag)
##### EXPECTED RESULTS
<!-- What did you expect to happen when running the steps above? -->
The IP is properly garbage collected on the router that doesn't hold it
anymore
##### ACTUAL RESULTS
<!-- What actually happened? -->
<!-- Paste verbatim command output between quotes below -->
IP address conflcit
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]