Re: [pve-devel] [PATCH container 1/1] Signed-off-by: Maurice Klein

Maurice Klein Fri, 06 Feb 2026 03:23:27 -0800

Am 06.02.26 um 09:23 schrieb Stefan Hanreich:

On 2/1/26 3:31 PM, Maurice Klein wrote:

Basicly the vnet and subnet part I see as in issue.
Since in this kind of setup there is no defined subsets required the
current configuration doesn't fully make sense.
I guess you could still have a subnet configuration and configure all
the host addresses inside that subnet, but it's not really necessary .
Every VM route would be a /32 route and also the configured address on
that bridge (gateway field) would be a /32.

We would still need a local IP on the PVE host that acts as a gateway
and preferably an IP for the VM inside the subnet so you can route the
traffic for the /32 IPs there. So we'd need to configure e.g.
192.0.2.0/24 as subnet, then have the host as gateway (e.g. 192.0.2.1)
and each VM gets an IP inside that subnet (which could automatically be
handled via IPAM / DHCP). Looking at other implementations (e.g.
kube-router) there's even a whole subnet pool and each node gets one
subnet from that pool - but that's easier done with containers than VMs,
so I think the approach with one shared subnet seems easier
(particularly for VM mobility).


I think I didn't explain properly about that.

Basically the whole Idea is to have a gateway IP like 192.0.2.1/32 onthe pve host on that bridge and not have a /24 or so route then.

Guests then also have addresses whatever they might look like.

For example a guest could have 1.1.1.1/32 but usually always /32,although I guess for some use cases it could be beneficial to be able tohave a guest that gets more then a /32 but let's put that aside for now.Now there is no need/reason to define which subnet a guest is on and noneed to be in the same with the host.

The guest would configure it's ip statically inside and it would be a/32 usually.

Now on the pve a host route to 1.1.1.1/32 would be added by thefollowing comand:

ip route add 1.1.1.1/32 dev bridgetest

Guest configuration would look like this (simpliefied and shortend):
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP>
    inet 1.1.1.1/32

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref Use Iface

0.0.0.0 192.0.2.1 0.0.0.0 UG 1.00 0 0 eth0

Now the biggest thing this enables us to do is in pve clusters if webuild for example a ibgp full mesh the routes get shared.

There could be any topology now and routing would adapt.

just as an example while that is a shity topology it can illustrate thepoint.:


      GW-1        GW-2
        | \        / |
        |  \      /  |
        |   \    /   |
       pve1--pve3
           \      /
            \    /
             pve2

Any pve can fail and there would still be everything reachable.
Always the shortest path will be chosen.
Any link can Fail.
Any Gateway can Fail.
Even multiple links failing is ok.
No chance for loops because every link is p2p.
Much like at the full mesh ceph setup with ospf or openfabric.

That can be archived with evpn/vxlan and anycast gateway and multipleexit nodes.Problem is the complexity and by giving bigger routes then /24 togateways they will not always use the optimal path thus increasinglatancy and putting unnesisary routing load on hosts where the vm isn'tliving right now.And all that to have one L2 domain which often brings more disadvantagesthen advantages.

I hope I explained it well now, if not feel free to ask anything, Icould also provide some bigger documentation with screenshots of everything.

When the tap interface of a vm gets plugged a route needs to be created.
Routes per VM get created with the comand ip route add 192.168.1.5/32
dev routedbridge.
The /32 gateway address needs to be configured on the bridge as well.

This could be done in in the respective tap_plug / veth_create functions
inside pve-network [1]. You can override them on a per-zone basis so
that would fit right in. We'd have to implement analogous functions for
teardown though so we can remove the routes when updating / deleting the
tap / veth.

Someone has actually implemented a quite similar thing via utilizing
hooks and a dedicated config files for each VM - see [2]. They're using
IPv6-LL addresses though (which I would personally also prefer), but I'm
unsure how it would work with windows guests for instance and it might
be weird / unintuitive for some users (see my previous mail).

Yeah, sounds good.

IPv6 Support needs to be implemented as well for all of this, I'm juststarting with v4.

There needs to be some way to configure the guests IPs as well, but in
ipam there is currently no way to set a ip for a vm, it's only ip mac
bindings.

That's imo the real question left, where to store the additional IPs.
Zone config is awkward, PVE IPAM might be workable with introducing
additional fields (and, for a PoC we could just deny using any other
IPAM plugin than that and implement it later).

Network device is probably the best bet, since we can then utilize the
hotplug code in case an IP gets reassigned, which would be more
complicated with the other approaches. The only reason why I'm reluctant
is because we're introducing a property there that is specific to one
particular SDN zone and unused by everything else.

I also feel like it would make sense in the network device, since it ispart of specific configuration for that vm but I get why you arereluctant to that.

This honestly makes me reconsider the sdn approach a little bit.
I have an Idea here that could be something workable.

What if we add a field not saying guest ip, what if we instead call idroutes.Essentially that is what it is and might have extra use cases apart fromwhat I'm trying to archive.That way for this use case you can use those fields to add the needed/32 host routes.

It wouldn't be specific to the sdn feature we build.

The SDN feature could then be more about configuring the bridge with theright addresses and fetures and enable us to later distribute the routesvia bgp and other ways.I looked into the hotplug scenarios as well and that way those would besolved.

A potential security flaw is also devices on that bridge can steal a
configured ip by just replying to arp.
That could be mitigated by disabling bridge learning and creating static
atp entires as well for those configured IPs.

That setting should be exposed in the zone configuration and probably be
on by default. There's also always the option of using IP / MAC filters
in the firewall although the static fdb / neighbor table approach is
preferable imo.


Perfect, I'm on the same page.

Implementing it in fdb / neighbor also ensures that crucial feature isthere for users with firewall disabled.


[0] https://docs.cilium.io/en/stable/network/lb-ipam/#requesting-ips
[1]
https://git.proxmox.com/?p=pve-network.git;a=blob;f=src/PVE/Network/SDN/Zones.pm;h=4da94580e07d6b3dcb794f19ce9335412fa7bc41;hb=HEAD#l298
[2] https://siewert.io/posts/2022/announce-proxmox-vm-ips-via-bgp-1/

Re: [pve-devel] [PATCH container 1/1] Signed-off-by: Maurice Klein

Reply via email to