Nicolas Droux wrote:
> Hi Jonathan,
Hi Nicolas, thanks so much for your input. I'm a LOT closer to understanding 
what's going on here now.

What follows is another very long email I'm sorry. This is a full day's 
research condensed in the the shortest email I could manage without fear of 
leaving out anything important!

>> Just as a refresher, my Solaris server here is VM running under VMware 
>> ESXi 3.5u3 (with all current patches). An extra layer of 
>> virtualisation does add extra questions, so I tried a ping test that 
>> would be entirely internal to the ESX host.; pinging the global zone 
>> from the non-global [dns] zone.
>>
>> Traffic test #1
>>> From within the dns zone:
>> bash-3.2# ping 192.168.1.60
>> no answer from 192.168.1.60
> 
> So what is 192.168.1.60? I guess it's the global zone, but e1000g0 or 
> e1000g1?
Yes it's the global zone, which is running on e1000g0. The zone was running on 
e1000g1.

> If it's e1000g0 but dnsvnic0 is created on e1000g1 there will be no 
> virtual switching between these data-links.
Ok, thanks for clearing that up for me. I'm still getting my head around 
difference of a shared kernel, but non-shared network stacks.

The point that I was trying to make with this test was that traffic wasn't 
going over any physical links. Unfortunately we have 2 levels of virtualisation 
going on here (ESX & Crossbow) which makes the terminologies that little bit 
harder to visualise.

In this case the traffic was leaving the zone and going over the "wire" to talk 
to the global zone. That "wire" is a VMware vSwitch so the network traffic in 
this case was entirely self contained within the ESX server. The actual 
physical NIC in the physical server wasn't used, which allowed me to rule that 
as a cause of this issue, along with any physical network switches :)

>> So the global zone is replying to the non-global zone, 'dns' just 
>> isn't seeing the replies.
>> This is sounding a lot like a weird vswitch bug.
> 
> No necessarily. It depends on how you wired your NICs. If e1000g0 and 
> e1000g1 are connected to the same switch,
Yeah, they are.

> then the packet can go from 
> dnsvnic0->e1000g1->switch->e1000g0->global zone.
That's right. A "vSwitch" in this case though.

> You may not see the 
> reply come back to dnsvnic0 via global_zone->e1000g0->switch->e1000g1 
> due to the same problem you described initially with unicast packets not 
> making it to the VNIC in the VMware VM.
Well it _should_ be working this way, it's frustrating that this isn't 
happening. Where else would it go?
 
>> Next I decided to try zone-to-zone traffic.:
>> Server - vnic - IP
>> Zone-template - zonevnic0 (via e1000g1) - 192.168.1.61
>> DNS - dnsvnic0 (via e1000g1) - 192.168.1.62
>>
>> This worked... DNS could ping Zone-template.
> 
> Because in this case you are going through the virtual switch.
I expected that it would, but it's always encouraging to actually see a 
successful test for a change!
Now when you say "virtual switch", this time we're talking about the crossbow 
internal switch and not the VMware vSwitch. I just wanted to point that out for 
the sake of clarity as we keep digging deeper into this.

>> What really surprised me was that that my snoop on e1000g1 was 
>> showing the traffic. It was my understanding that vnic-to-vnic traffic 
>> that's attached to the same pnic never actually went across the wire, 
>> so why is snoop on a physical interface showing vnic <> vnic traffic ?
> 
> That's done by design to allow the global zone/dom0 see all traffic 
> exchanged between the VMs/Zones. It's similar to a monitoring port on a 
> physical switch.
Ah, thanks for clearing that one up :)

>> A) Something in crossbow isn't working properly.
>> B) I'm misunderstanding how vnics talk to each other. I understand 
>> etherstubs, but it just makes sense that inter-zone traffic shouldn't 
>> be sending traffic down a bottleneck like a pNIC when it's all 
>> *internal* anyway.
>> C) The traffic isn't actually going out the physical interface across 
>> the wire, but it is going via the logical concept of the e1000g1 
>> interface, which snoop is reporting on - which is rather confusing to 
>> an end user like me trying to diagnose this using snoop :(
>>
>> Can anyone clarify this one for me?

Based on your previous comment above, you're saying that the answer is C)?

So just to confirm that point, as it's pretty crucial that I understand this 
distinction correctly; "snoop -d e1000g1" is showing traffic that _isn't_ 
actually going across the 'wire' on that 'physical' interface, but rather 
traffic that is passing "internally, *behind* the physical interface" - to make 
observability easier for administrators from the global zone.
If I were able to watch the switch port that e1000g1 was plugged into, I'd see 
no packets doing a return loop?

>> The WTF moment of the night was this:
>> vSwitches security in ESX is configured like this by default:
>> Promiscuous Mode: Disabled
>> MAC Address Changes: Accept
>> Forged Transmits: Accept
>>
>> These sound like reasonable defaults to me, toggling the Promiscuous 
>> flag to my understanding would pretty much turn the vSwitch into a 
>> "vHub"!
>>
>> I left a [non-returning] ping running between dns and the global zone, 
>> and decided to try enabling Promiscuous mode anyway.
>> No change.
>>
>> I started a snoop up on e1000g1, and suddenly the sparse-template <> 
>> dns ping that I started in another terminal moments ago started 
>> working. I disabled the snoop, and it stopped working again.
>>
>> !!!?
>>
>> Enabling the promiscuous flag on the e1000g1 driver is suddenly 
>> "fixing" my traffic problem.
>>
>> My best interpretation of this data is that 1 of 3 things isn't 
>> working, and I'm starting to get out of my depth here fast.
>>
>> A) Crossbow itself is doing something 'funny' with the way traffic is 
>> being passed on to the vswitch, which is causing it to not send 
>> traffic for this mac address down the correct virtual port on the 
>> switch. Arp spoofing is common enough and both of those options are 
>> already enabled so it's something else which is causing it to get 
>> confused it would seem. Sadly there isn't any interface to the vSwitch 
>> that I'm aware of to pull some stats/logs from.
>> Funny promiscous ARPs? sending traffic down both pnics? something else 
>> to confuse the vswitch? I'm out of skills to troubleshoot this option 
>> any further.
>>
>> B) The vSwitch in ESXi has a bug. If so, why is it only effecting 
>> crossbow... ESX is very widely used so if there was a glaring bug in 
>> the vSwitch ethernet implementation it would be very common and public 
>> knowledge. Crossbow is new enough; is it possible that I'm the first 
>> to have tried this configuration under ESX and thus am the first to 
>> notice this issue?
>> There aren't any other options within ESX that I'm aware of that I can 
>> try to get some further data on the vSwitch itself, so I'm at a loss 
>> as to how I troubleshoot this one further.
>> I'm also just using the free ESXi, so I can't contact VMware for 
>> support on this and at this point it would be a pretty vauge bug 
>> report anyway :/
>>
>> C) The intel pro 1000 vNIC that ESX is exposing to the VM has a bug in 
>> it, or the solaris e1000g driver has a bug when sending crossbow 
>> traffic across it (or a combination of the two).
>> The intel pro 1000 is a very common server NIC, and I'd be gobsmacked 
>> if there was a bug with a real (non-virtual) e1000g adapter that the 
>> Sun folk hadn't picked up in their prerelease testing.
>>
>> The only option for vNICs within ESX, for a 64-bit solaris host, is 
>> the e1000 NIC. I trying to setup a 32-bit host to see what NIC that 
>> ends up with. If this provides different result, that at least gives 
>> us some better information on where to start looking!
>>
>> Any further directions or feedback would be most welcome. If I'm 
>> heading in the wrong direction, please do tell me :)
> 
> I have a theory.
> 
> When you create a VNIC, Crossbow will try to associate the unicast MAC 
> address with the NIC. Most NICs have hardware unicast filters which 
> allow traffic for multiple unicast addresses to be received without 
> turning the NIC in promiscuous mode. e1000g provides multiple such slots 
> for unicast addresses.

I didn't realise that. I must have fallen behind a bit on modern network card 
technology. I take it that they is a performance penalty when running in 
promiscuous mode to handle multiple mac addresses as the filtering is no longer 
done in hardware by the NIC itself?

> What could be happening is that e1000g running in the VM happily allows 
> Crossbow to program the unicast address for the VNIC address, but the 
> VMware back-end driver or virtual switch doesn't know about that 
> address. So all broadcast and multicast packets are going in and out as 
> expected, all traffic from the VNIC are going out without a problem, but 
> when unicast packets are coming back for the unicast address of the 
> VNIC, they never make it to the VM.

That makes a lot of sense, and I think you're quite correct about that. It's 
either that or ESX is getting upset with promiscuous being enabled on the NIC 
and as a security precaution it's not allowing the traffic to be delivered to 
the virtual NIC in the VM. (Explored further down this email)

I've only experienced these weird issues while using crossbow but if the above 
is true than this is not a crossbow problem per se all it; it's simply that 
crossbow is adding mac addresses to the [VMware] e1000g card (or enabling 
promiscuous mode) which is causing a problem at some layer within ESX, and 
there haven't been any other networking scenarios in which this would have 
happened prior to crossbow. (Maybe network teaming though this is not generally 
done *within* a VM, there is little-to-no point!).

If this is the heart of the issue, then I should be able to replicate this 
without needing to use a zone at all, provided I can setup crossbow in the 
global zone in such a way that it uses different mac addresses depending on the 
destination.... Now that I think about this, I think I did hit this when I 
started off with just the 1 NIC in the VM. I moved to a second e1000, 
seperating the global/zone traffic as a sanity check quite early on.... hrm.

> If you simply enable promiscuous mode on the VMware virtual switch, then 
> it will take these packets, but the back-end driver instance associated 
> with e1000g might still filter out these packets by default and dropping 
> them. In order to see the packets you have to turn on promiscuous mode 
> on e1000g1 itself which probably causes the VMWare back-end to send all 
> packets up.
Agreed.

VMware ESX provides some granularity when it comes to setting promiscuous 
options.
It can be set globally on the whole switch, or at a "port group" level, though 
I don't see anywhere to toggle it on a vNIC or per VM basis.

Port groups are an administrative abstraction of a group of ports on a specific 
vSwitch, a bit like a VLANs but without network level tagging (though they can 
be used to enable/setup VLANs too).

I have ALL virtual machines running off 1 vSwitch so enabling promiscuous mode 
on the vSwitch (for all VMs) just to get my zone server working with crossbow 
isn't an attractive option. Making a dedicated *promiscuous-on* port group that 
only contains this one solaris server may work better though.

> If this theory is correct, what would help is allow the VMware back-end 
> to send up all packets received from the VMware virtual switch without 
> filtering. But I don't know if VMware provides that option.
I think that is what a port group will allow me to do, however by itself 
remember that this didn't fix the problem. I had to have the VMs nic in 
promiscuous mode too for traffic to flow correctly.

I was doing this (accidentally at the time) by running snoop.
Is there a better way to enable promiscuous mode on an interface within Solaris 
permanently? All I could dig up with google was this: 
http://www.kernelfaq.com/2008/04/enabling-and-disabling-promiscuous-mode.html



Mac Filtering.
Going back to what you said earlier about the e1000g driver handling multiple 
unicast macs concurrently in hardware; in my googling I've discovered that not 
all e1000 NICs support this feature.

*Is there a way to tell if the VMware emulated e1000 is advertising this 
feature in the 'hardware' to the guest?

*Is there a way to tell if crossbow is making use of it rather than falling 
back to the "less fancy" promiscuous mode instead? This would be most valuable 
to better understand what we're seeing here!

dladm show-linkprop isn't showing my anything. I guess we're not quite there 
yet? http://markmail.org/message/qiqygyqxt5t6qp5b



My current working theory is this:
*vSwitch layer*
VMware ESX knows exactly which vSwitch ports are connected to a physical NIC 
uplinking the vSwitch to the physical world and which ports are connected to 
NICs within VMs.
The vSwitch "host" ports should only ever have a single MAC address on them at 
any given time as they're directly connected to a single NIC and it enforces 
this limit as a security measure. This would prevent mac spoofing attacks for 
example.

Recall that by default within a vSwitch "MAC Address Changes" are allowed, as 
are "Forged Transmits", which strongly hints at the behaviour that I'm 
theorising.

*NIC layer*
I'm expecting that the VMware provided emulated e1000 NIC has no concept of MAC 
address slots on the vSwitch end - given the behaviour of 1 MAC address per 
port at the vSwitch level, why would it ever need to support multiple MACs?
Within the VM however, crossbow is detecting an e1000 pNIC that does support 
multiple MACs and it's making use of these slots for the VNIC's MACs as they 
get added, rather than toggling promiscuous mode on the e1000g.

**Outbound traffic**
ESX is allowing the "forged transmits" from VNIC's additional MAC address, and 
broadcasts/multicasts are being passed through both the vSwitch and the e1000 
correctly.

**Inbound traffic**
*vSwitch layer*
ESX knows which MAC address the e1000 has within the guest and it will have 
this entered into it's MAC forwarding table for the port that the VM is 
connected to. Exactly what it's doing with the VNICs MAC that is being 
broadcast around as ARP requests... I have no idea.

Enabling promiscuous mode at the vSwitch level bypasses/disables the MAC 
forwarding table so now frames with the VNICs MAC are getting to the right 
switch port. This functionality alone still doesn't fix the problem because:

*NIC layer*
The ESX end of the e1000 NIC only knows about the primary MAC address of the 
NIC so it's not passing frames addressed to the VNIC's MAC address into the VM 
guest's end of the e1000 for further processing by crossbow.

When snoop is started, the interface is set to promiscuous mode in the guest 
and this is being trapped by the ESX end of the e1000, which is also enabling 
promiscuous mode on it's end.
With all frames finally now passing into the guest end of the e1000, crossbow 
can do it's job and everything starts working!

Phew!


I'm having to theorise much of the ESX behaviour as there is simply no way to 
get the information I need from ESX itself, but this model all seems to fit 
pretty well, don't you think?


Way forward:
I can focus on testing the promiscuous mode behaviour on the vSwitch port group 
which may lead to a tidy work around at that level.
At the NIC level if my theory is correct it would seem that I really need a way 
to make crossbow enable promiscuous mode on the NIC rather than adding a 
"hardware based MAC filter" to the e1000 as it doesn't seem that this is going 
to work in a VMware ESX environment.

> Nicolas.
Jonathan
-- 
This message posted from opensolaris.org

Reply via email to