On Wed, Feb 27, 2013 at 07:39:02AM -0800, Jesse Gross wrote:
> On Tue, Feb 26, 2013 at 11:10 PM, Simon Horman <[email protected]> wrote:
> > On Tue, Feb 26, 2013 at 02:49:47PM -0800, Jesse Gross wrote:
> >> A couple high level comments:
> >> * I'd like to nail down the format of the userspace/kernel interface
> >> for multiple levels of tags. We don't need to implement multiple
> >> levels but we should have a good plan on how we would cleanly extend
> >> the interface to do that in the future if we want. Ben and I had
> >> talked about using an array of tags in a single Netlink attribute,
> >> which seems fairly clean and avoids needing lots of encap attributes.
> >
> > Is the implication here that the size of the array and thus depth of the
> > tags is fixed?
>
> I think we could use a variable sized array and then use the size from
> the Netlink attribute to get the number of elements. That way we
> could easily expand it over time.
Thanks, I think that has worked out pretty cleanly.
What I have done is to allow the attribute to be an array of one or more
struct ovs_key_mpls. This doesn't actually require any code changes because
its a superset of the previous arrangement of an element with
exactly one struct ovs_key_mpls. However, I added a little bit of code
to the datapath to allow verification parsing of the array. This is shown
in the incremental patch at the bottom of this email which I will roll
into the next revision of the patch to add MPLS support to the datapath
(the patch this thread is about).
I also added a bit of code to ovs-vswitchd to exercise this lightly
which seems to work out well enough.
>
> >> > diff --git a/datapath/actions.c b/datapath/actions.c
> >> > index f638ffc..60522be 100644
> >> > --- a/datapath/actions.c
> >> > +++ b/datapath/actions.c
> >> > + new_mpls_lse = (__be32 *)(skb_mac_header(skb) + l2_size);
> >> > + *new_mpls_lse = mpls->mpls_lse;
> >> > +
> >> > + set_ethertype(skb, mpls->mpls_ethertype);
> >> > + return 0;
> >> > +}
> >>
> >> There are potentially a number of issues with offloading here: if we
> >> have a packet that requires some form of offloading (say TSO) and we
> >> push an MPLS header on the front, then it's quite likely that the NIC
> >> can no longer handle the packet and we must emulate in software first.
> >> That would ideally mean GSO support for MPLS (as was done for vlan).
> >
> > I take it that would be work in the core kernel code.
> > Could you give me a few pointers as to what I should be looking at?
>
> It would mostly be the code around skb_gso_segment() that would need
> to skip over and replicate the MPLS tags. dev_hard_start_xmit() also
> likely needs some work to make sure that we take the software path
> instead of using hardware that can't offload.
Thanks. I have added it to the TODO list and will look into it.
No doubt I will have more questions at that time.
> >> > diff --git a/datapath/datapath.c b/datapath/datapath.c
> >> > index 87c96ae..1386917 100644
> >> > --- a/datapath/datapath.c
> >> > +++ b/datapath/datapath.c
> >> > +void skb_cb_set_l2_size(struct sk_buff *skb)
> >> > +{
> >> > + struct ethhdr *eth;
> >> > + int nh_ofs;
> >> > + __be16 dl_type = 0;
> >> > +
> >> > + skb_reset_mac_header(skb);
> >> > +
> >> > + eth = eth_hdr(skb);
> >> > + nh_ofs = sizeof(struct ethhdr);
> >> > + if (likely(eth->h_proto >= htons(ETH_TYPE_MIN))) {
> >> > + dl_type = eth->h_proto;
> >> > +
> >> > + while (dl_type == htons(ETH_P_8021Q) &&
> >> > + skb->len >= nh_ofs + sizeof(struct
> >> > vlan_hdr)) {
> >> > + struct vlan_hdr *vh = (struct
> >> > vlan_hdr*)(skb->data + nh_ofs);
> >> > + dl_type = vh->h_vlan_encapsulated_proto;
> >> > + nh_ofs += sizeof(struct vlan_hdr);
> >> > + }
> >> > +
> >> > + OVS_CB(skb)->l2_size = nh_ofs;
> >> > + } else {
> >> > + OVS_CB(skb)->l2_size = 0;
> >> > + }
> >> > +}
> >>
> >> It really seems like this should happen as part of flow extraction,
> >> similar to the initialization of the other layer pointers, rather than
> >> separately in every place that we get a packet.
> >
> > Yes, I think that should work.
> >
> > Thinking out aloud:
> >
> > I think that it should be possible to implement your idea by
> > having ovs_flow_extract() set OVS_CB(skb)->l2_size.
> >
> > Currently skb_cb_set_l2_size() is called in two locations,
> > ovs_packet_cmd_execute() and ovs_vport_receive().
> >
> > In the case of ovs_packet_cmd_execute() there is a previously a
> > call to ovs_flow_extract().
> >
> > And in the case of ovs_vport_receive() there is subsequently a
> > call to ovs_dp_process_received_packet() which in turn calls
> > ovs_flow_extract() if OVS_CB(skb)->flow is not set.
>
> All of that sounds right to me.
>
> >> > diff --git a/datapath/flow.c b/datapath/flow.c
> >> > index fad9e19..27e1920 100644
> >> > --- a/datapath/flow.c
> >> > +++ b/datapath/flow.c
> >> > @@ -728,6 +728,17 @@ int ovs_flow_extract(struct sk_buff *skb, u16
> >> > in_port, struct sw_flow_key *key,
> >> > memcpy(key->ipv4.arp.tha, arp->ar_tha, ETH_ALEN);
> >> > key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
> >> > }
> >> > + } else if (eth_p_mpls(key->eth.type)) {
> >> > + error = check_header(skb, MPLS_HLEN);
> >> > + if (unlikely(error))
> >> > + goto out;
> >> > +
> >> > + key_len = SW_FLOW_KEY_OFFSET(mpls.top_lse);
> >> > + memcpy(&key->mpls.top_lse, skb_network_header(skb),
> >> > MPLS_HLEN);
> >> > +
> >> > + /* Update network header */
> >> > + skb_set_network_header(skb, skb_network_header(skb) -
> >> > + skb->data + MPLS_HLEN);
> >>
> >> Is it possible to actually use this network header pointer?
> >
> > It is possible to use it in ovs_flow_extract_l3_onwards() in conjunction
> > with the inner/outer flow changes I have in other patches (and you have
> > asked me to drop).
> >
> > It does seem to me that it may be incorrect if there is more than one MPLS
> > LSE present in the frame.
> >
> > But I'm not sure if either of those answers relate to the point you are
> > making. Are there use-cases which concern you?
>
> I was just unsure of when it would be useful. It seems like if we're
> not going to try to act beyond the MPLS labels that we can parse then
> it makes the most sense to just leave the network layer unset (maybe
> we can use it to point to the start of the MPLS stack instead of
> needing a special l2 length then?).
I think that removing the call to skb_set_network_header() you have
isolated above is harmless enough as I don't think it is used because
if frame is MPLS then no l3+ processing is done.
I looked at making use of the network header to mark the top of
the MPLS stack and not adding the special l2 length, however, I believe
it breaks down in the situation I will now describe.
Suppose an IP frame has two actions applied, dec_ttl and push_mpls.
This is valid because an IP match may be made and ovs-vswitchd
can install set(ip) and push_mpls actions in the datapath.
Now, for some reason that I must confess that I don't fully understand
at this moment the actions are installed as push_mpls and then set(ip).
I believe that for the scheme you suggest above to work push_mpls (and
pop_mpls) need to update the network_header such that it always points to
the to of the MPLS stack. This is to allow multiple mpls actions to behave
correctly, e.g. mpls_push then mpls_set.
The problem being that set(ip) uses the network_header and if it has been
set to the top of the MPLS stack by push_mpls then the set(ip) action will
corrupt the packet.
------ incremental patch to allow OVS_KEY_ATTR_MPLS to be an array -------
diff --git a/datapath/datapath.c b/datapath/datapath.c
index eb2e91b..af0dba8 100644
--- a/datapath/datapath.c
+++ b/datapath/datapath.c
@@ -599,8 +599,7 @@ static int validate_set(const struct nlattr *a,
return -EINVAL;
if (key_type > OVS_KEY_ATTR_MAX ||
- (ovs_key_lens[key_type] != nla_len(ovs_key) &&
- ovs_key_lens[key_type] != -1))
+ ovs_flow_verify_key_len(key_type, ovs_key))
return -EINVAL;
switch (key_type) {
diff --git a/datapath/flow.c b/datapath/flow.c
index c99a6f9..7528eb4 100644
--- a/datapath/flow.c
+++ b/datapath/flow.c
@@ -858,7 +858,7 @@ void ovs_flow_tbl_remove(struct flow_table *table, struct
sw_flow *flow)
}
/* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute. */
-const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
+static const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
[OVS_KEY_ATTR_ENCAP] = -1,
[OVS_KEY_ATTR_PRIORITY] = sizeof(u32),
[OVS_KEY_ATTR_IN_PORT] = sizeof(u32),
@@ -881,6 +881,11 @@ const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
[OVS_KEY_ATTR_TUN_ID] = sizeof(__be64),
};
+/* Set the bit of a type in ovs_key_arrays if it is
+ * an array of one or more elements of size ovs_key_lens[type].
+ */
+static const u64 ovs_array_keys = (1ULL << OVS_KEY_ATTR_MPLS);
+
static int ipv4_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_len,
const struct nlattr *a[], u64 *attrs)
{
@@ -987,6 +992,27 @@ static int ipv6_flow_from_nlattrs(struct sw_flow_key
*swkey, int *key_len,
return 0;
}
+int ovs_flow_verify_key_len(u16 type, const struct nlattr *nla)
+{
+ int expected_len = ovs_key_lens[type];
+ int len = nla_len(nla);
+
+ if (len == -1)
+ return 0;
+
+ if (ovs_array_keys & (1ULL << type)) {
+ /* The type is an array.
+ * Check if len is a non-zero, non-negative multiple
+ * of expected_len.
+ */
+ if (len < expected_len || len % expected_len)
+ return -EINVAL;
+ } else if (len != expected_len)
+ return -EINVAL;
+
+ return 0;
+}
+
static int parse_flow_nlattrs(const struct nlattr *attr,
const struct nlattr *a[], u64 *attrsp)
{
@@ -997,14 +1023,12 @@ static int parse_flow_nlattrs(const struct nlattr *attr,
attrs = 0;
nla_for_each_nested(nla, attr, rem) {
u16 type = nla_type(nla);
- int expected_len;
if (type > OVS_KEY_ATTR_MAX || attrs & (1ULL << type))
return -EINVAL;
- expected_len = ovs_key_lens[type];
- if (nla_len(nla) != expected_len && expected_len != -1)
- return -EINVAL;
+ if (ovs_flow_verify_key_len(type, nla))
+ return -EINVAL;
attrs |= 1ULL << type;
a[type] = nla;
@@ -1305,7 +1329,6 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int
*key_lenp,
memcpy(swkey->ipv4.arp.tha, arp_key->arp_tha, ETH_ALEN);
} else if (eth_p_mpls(swkey->eth.type)) {
const struct ovs_key_mpls *mpls_key;
-
if (!(attrs & (1ULL << OVS_KEY_ATTR_MPLS)))
return -EINVAL;
attrs &= ~(1ULL << OVS_KEY_ATTR_MPLS);
@@ -1353,7 +1376,7 @@ int ovs_flow_metadata_from_nlattrs(struct sw_flow *flow,
int key_len, const stru
if (type <= OVS_KEY_ATTR_MAX && ovs_key_lens[type] > 0) {
int err;
- if (nla_len(nla) != ovs_key_lens[type])
+ if (ovs_flow_verify_key_len(type, nla))
return -EINVAL;
switch (type) {
diff --git a/datapath/flow.h b/datapath/flow.h
index 99429d2..27e394b 100644
--- a/datapath/flow.h
+++ b/datapath/flow.h
@@ -240,7 +240,7 @@ void ovs_flow_tbl_insert(struct flow_table *table, struct
sw_flow *flow,
void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow);
struct sw_flow *ovs_flow_tbl_next(struct flow_table *table, u32 *bucket, u32
*idx);
-extern const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1];
+int ovs_flow_verify_key_len(u16 type, const struct nlattr *nla);
int ipv4_tun_from_nlattr(const struct nlattr *attr,
struct ovs_key_ipv4_tunnel *tun_key);
int ipv4_tun_to_nlattr(struct sk_buff *skb,
diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
index 63d1cac..f7b788c 100644
--- a/include/linux/openvswitch.h
+++ b/include/linux/openvswitch.h
@@ -285,7 +285,7 @@ enum ovs_key_attr {
OVS_KEY_ATTR_IPV4_TUNNEL, /* struct ovs_key_ipv4_tunnel */
#endif
- OVS_KEY_ATTR_MPLS = 62, /* struct ovs_key_mpls */
+ OVS_KEY_ATTR_MPLS = 62, /* Array of one or more struct ovs_key_mpls */
OVS_KEY_ATTR_TUN_ID = 63, /* be64 tunnel ID */
__OVS_KEY_ATTR_MAX
};
_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev