Nicolas Droux wrote: > Hi Jonathan, Hi Nicolas, thanks so much for your input. I'm a LOT closer to understanding what's going on here now.
What follows is another very long email I'm sorry. This is a full day's research condensed in the the shortest email I could manage without fear of leaving out anything important! >> Just as a refresher, my Solaris server here is VM running under VMware >> ESXi 3.5u3 (with all current patches). An extra layer of >> virtualisation does add extra questions, so I tried a ping test that >> would be entirely internal to the ESX host.; pinging the global zone >> from the non-global [dns] zone. >> >> Traffic test #1 >>> From within the dns zone: >> bash-3.2# ping 192.168.1.60 >> no answer from 192.168.1.60 > > So what is 192.168.1.60? I guess it's the global zone, but e1000g0 or > e1000g1? Yes it's the global zone, which is running on e1000g0. The zone was running on e1000g1. > If it's e1000g0 but dnsvnic0 is created on e1000g1 there will be no > virtual switching between these data-links. Ok, thanks for clearing that up for me. I'm still getting my head around difference of a shared kernel, but non-shared network stacks. The point that I was trying to make with this test was that traffic wasn't going over any physical links. Unfortunately we have 2 levels of virtualisation going on here (ESX & Crossbow) which makes the terminologies that little bit harder to visualise. In this case the traffic was leaving the zone and going over the "wire" to talk to the global zone. That "wire" is a VMware vSwitch so the network traffic in this case was entirely self contained within the ESX server. The actual physical NIC in the physical server wasn't used, which allowed me to rule that as a cause of this issue, along with any physical network switches :) >> So the global zone is replying to the non-global zone, 'dns' just >> isn't seeing the replies. >> This is sounding a lot like a weird vswitch bug. > > No necessarily. It depends on how you wired your NICs. If e1000g0 and > e1000g1 are connected to the same switch, Yeah, they are. > then the packet can go from > dnsvnic0->e1000g1->switch->e1000g0->global zone. That's right. A "vSwitch" in this case though. > You may not see the > reply come back to dnsvnic0 via global_zone->e1000g0->switch->e1000g1 > due to the same problem you described initially with unicast packets not > making it to the VNIC in the VMware VM. Well it _should_ be working this way, it's frustrating that this isn't happening. Where else would it go? >> Next I decided to try zone-to-zone traffic.: >> Server - vnic - IP >> Zone-template - zonevnic0 (via e1000g1) - 192.168.1.61 >> DNS - dnsvnic0 (via e1000g1) - 192.168.1.62 >> >> This worked... DNS could ping Zone-template. > > Because in this case you are going through the virtual switch. I expected that it would, but it's always encouraging to actually see a successful test for a change! Now when you say "virtual switch", this time we're talking about the crossbow internal switch and not the VMware vSwitch. I just wanted to point that out for the sake of clarity as we keep digging deeper into this. >> What really surprised me was that that my snoop on e1000g1 was >> showing the traffic. It was my understanding that vnic-to-vnic traffic >> that's attached to the same pnic never actually went across the wire, >> so why is snoop on a physical interface showing vnic <> vnic traffic ? > > That's done by design to allow the global zone/dom0 see all traffic > exchanged between the VMs/Zones. It's similar to a monitoring port on a > physical switch. Ah, thanks for clearing that one up :) >> A) Something in crossbow isn't working properly. >> B) I'm misunderstanding how vnics talk to each other. I understand >> etherstubs, but it just makes sense that inter-zone traffic shouldn't >> be sending traffic down a bottleneck like a pNIC when it's all >> *internal* anyway. >> C) The traffic isn't actually going out the physical interface across >> the wire, but it is going via the logical concept of the e1000g1 >> interface, which snoop is reporting on - which is rather confusing to >> an end user like me trying to diagnose this using snoop :( >> >> Can anyone clarify this one for me? Based on your previous comment above, you're saying that the answer is C)? So just to confirm that point, as it's pretty crucial that I understand this distinction correctly; "snoop -d e1000g1" is showing traffic that _isn't_ actually going across the 'wire' on that 'physical' interface, but rather traffic that is passing "internally, *behind* the physical interface" - to make observability easier for administrators from the global zone. If I were able to watch the switch port that e1000g1 was plugged into, I'd see no packets doing a return loop? >> The WTF moment of the night was this: >> vSwitches security in ESX is configured like this by default: >> Promiscuous Mode: Disabled >> MAC Address Changes: Accept >> Forged Transmits: Accept >> >> These sound like reasonable defaults to me, toggling the Promiscuous >> flag to my understanding would pretty much turn the vSwitch into a >> "vHub"! >> >> I left a [non-returning] ping running between dns and the global zone, >> and decided to try enabling Promiscuous mode anyway. >> No change. >> >> I started a snoop up on e1000g1, and suddenly the sparse-template <> >> dns ping that I started in another terminal moments ago started >> working. I disabled the snoop, and it stopped working again. >> >> !!!? >> >> Enabling the promiscuous flag on the e1000g1 driver is suddenly >> "fixing" my traffic problem. >> >> My best interpretation of this data is that 1 of 3 things isn't >> working, and I'm starting to get out of my depth here fast. >> >> A) Crossbow itself is doing something 'funny' with the way traffic is >> being passed on to the vswitch, which is causing it to not send >> traffic for this mac address down the correct virtual port on the >> switch. Arp spoofing is common enough and both of those options are >> already enabled so it's something else which is causing it to get >> confused it would seem. Sadly there isn't any interface to the vSwitch >> that I'm aware of to pull some stats/logs from. >> Funny promiscous ARPs? sending traffic down both pnics? something else >> to confuse the vswitch? I'm out of skills to troubleshoot this option >> any further. >> >> B) The vSwitch in ESXi has a bug. If so, why is it only effecting >> crossbow... ESX is very widely used so if there was a glaring bug in >> the vSwitch ethernet implementation it would be very common and public >> knowledge. Crossbow is new enough; is it possible that I'm the first >> to have tried this configuration under ESX and thus am the first to >> notice this issue? >> There aren't any other options within ESX that I'm aware of that I can >> try to get some further data on the vSwitch itself, so I'm at a loss >> as to how I troubleshoot this one further. >> I'm also just using the free ESXi, so I can't contact VMware for >> support on this and at this point it would be a pretty vauge bug >> report anyway :/ >> >> C) The intel pro 1000 vNIC that ESX is exposing to the VM has a bug in >> it, or the solaris e1000g driver has a bug when sending crossbow >> traffic across it (or a combination of the two). >> The intel pro 1000 is a very common server NIC, and I'd be gobsmacked >> if there was a bug with a real (non-virtual) e1000g adapter that the >> Sun folk hadn't picked up in their prerelease testing. >> >> The only option for vNICs within ESX, for a 64-bit solaris host, is >> the e1000 NIC. I trying to setup a 32-bit host to see what NIC that >> ends up with. If this provides different result, that at least gives >> us some better information on where to start looking! >> >> Any further directions or feedback would be most welcome. If I'm >> heading in the wrong direction, please do tell me :) > > I have a theory. > > When you create a VNIC, Crossbow will try to associate the unicast MAC > address with the NIC. Most NICs have hardware unicast filters which > allow traffic for multiple unicast addresses to be received without > turning the NIC in promiscuous mode. e1000g provides multiple such slots > for unicast addresses. I didn't realise that. I must have fallen behind a bit on modern network card technology. I take it that they is a performance penalty when running in promiscuous mode to handle multiple mac addresses as the filtering is no longer done in hardware by the NIC itself? > What could be happening is that e1000g running in the VM happily allows > Crossbow to program the unicast address for the VNIC address, but the > VMware back-end driver or virtual switch doesn't know about that > address. So all broadcast and multicast packets are going in and out as > expected, all traffic from the VNIC are going out without a problem, but > when unicast packets are coming back for the unicast address of the > VNIC, they never make it to the VM. That makes a lot of sense, and I think you're quite correct about that. It's either that or ESX is getting upset with promiscuous being enabled on the NIC and as a security precaution it's not allowing the traffic to be delivered to the virtual NIC in the VM. (Explored further down this email) I've only experienced these weird issues while using crossbow but if the above is true than this is not a crossbow problem per se all it; it's simply that crossbow is adding mac addresses to the [VMware] e1000g card (or enabling promiscuous mode) which is causing a problem at some layer within ESX, and there haven't been any other networking scenarios in which this would have happened prior to crossbow. (Maybe network teaming though this is not generally done *within* a VM, there is little-to-no point!). If this is the heart of the issue, then I should be able to replicate this without needing to use a zone at all, provided I can setup crossbow in the global zone in such a way that it uses different mac addresses depending on the destination.... Now that I think about this, I think I did hit this when I started off with just the 1 NIC in the VM. I moved to a second e1000, seperating the global/zone traffic as a sanity check quite early on.... hrm. > If you simply enable promiscuous mode on the VMware virtual switch, then > it will take these packets, but the back-end driver instance associated > with e1000g might still filter out these packets by default and dropping > them. In order to see the packets you have to turn on promiscuous mode > on e1000g1 itself which probably causes the VMWare back-end to send all > packets up. Agreed. VMware ESX provides some granularity when it comes to setting promiscuous options. It can be set globally on the whole switch, or at a "port group" level, though I don't see anywhere to toggle it on a vNIC or per VM basis. Port groups are an administrative abstraction of a group of ports on a specific vSwitch, a bit like a VLANs but without network level tagging (though they can be used to enable/setup VLANs too). I have ALL virtual machines running off 1 vSwitch so enabling promiscuous mode on the vSwitch (for all VMs) just to get my zone server working with crossbow isn't an attractive option. Making a dedicated *promiscuous-on* port group that only contains this one solaris server may work better though. > If this theory is correct, what would help is allow the VMware back-end > to send up all packets received from the VMware virtual switch without > filtering. But I don't know if VMware provides that option. I think that is what a port group will allow me to do, however by itself remember that this didn't fix the problem. I had to have the VMs nic in promiscuous mode too for traffic to flow correctly. I was doing this (accidentally at the time) by running snoop. Is there a better way to enable promiscuous mode on an interface within Solaris permanently? All I could dig up with google was this: http://www.kernelfaq.com/2008/04/enabling-and-disabling-promiscuous-mode.html Mac Filtering. Going back to what you said earlier about the e1000g driver handling multiple unicast macs concurrently in hardware; in my googling I've discovered that not all e1000 NICs support this feature. *Is there a way to tell if the VMware emulated e1000 is advertising this feature in the 'hardware' to the guest? *Is there a way to tell if crossbow is making use of it rather than falling back to the "less fancy" promiscuous mode instead? This would be most valuable to better understand what we're seeing here! dladm show-linkprop isn't showing my anything. I guess we're not quite there yet? http://markmail.org/message/qiqygyqxt5t6qp5b My current working theory is this: *vSwitch layer* VMware ESX knows exactly which vSwitch ports are connected to a physical NIC uplinking the vSwitch to the physical world and which ports are connected to NICs within VMs. The vSwitch "host" ports should only ever have a single MAC address on them at any given time as they're directly connected to a single NIC and it enforces this limit as a security measure. This would prevent mac spoofing attacks for example. Recall that by default within a vSwitch "MAC Address Changes" are allowed, as are "Forged Transmits", which strongly hints at the behaviour that I'm theorising. *NIC layer* I'm expecting that the VMware provided emulated e1000 NIC has no concept of MAC address slots on the vSwitch end - given the behaviour of 1 MAC address per port at the vSwitch level, why would it ever need to support multiple MACs? Within the VM however, crossbow is detecting an e1000 pNIC that does support multiple MACs and it's making use of these slots for the VNIC's MACs as they get added, rather than toggling promiscuous mode on the e1000g. **Outbound traffic** ESX is allowing the "forged transmits" from VNIC's additional MAC address, and broadcasts/multicasts are being passed through both the vSwitch and the e1000 correctly. **Inbound traffic** *vSwitch layer* ESX knows which MAC address the e1000 has within the guest and it will have this entered into it's MAC forwarding table for the port that the VM is connected to. Exactly what it's doing with the VNICs MAC that is being broadcast around as ARP requests... I have no idea. Enabling promiscuous mode at the vSwitch level bypasses/disables the MAC forwarding table so now frames with the VNICs MAC are getting to the right switch port. This functionality alone still doesn't fix the problem because: *NIC layer* The ESX end of the e1000 NIC only knows about the primary MAC address of the NIC so it's not passing frames addressed to the VNIC's MAC address into the VM guest's end of the e1000 for further processing by crossbow. When snoop is started, the interface is set to promiscuous mode in the guest and this is being trapped by the ESX end of the e1000, which is also enabling promiscuous mode on it's end. With all frames finally now passing into the guest end of the e1000, crossbow can do it's job and everything starts working! Phew! I'm having to theorise much of the ESX behaviour as there is simply no way to get the information I need from ESX itself, but this model all seems to fit pretty well, don't you think? Way forward: I can focus on testing the promiscuous mode behaviour on the vSwitch port group which may lead to a tidy work around at that level. At the NIC level if my theory is correct it would seem that I really need a way to make crossbow enable promiscuous mode on the NIC rather than adding a "hardware based MAC filter" to the e1000 as it doesn't seem that this is going to work in a VMware ESX environment. > Nicolas. Jonathan -- This message posted from opensolaris.org
