On Wed, Dec 12, 2012 at 8:04 AM, Jesse Gross <[email protected]> wrote:
> Thanks and sorry that I took so long to get back to you. > > I thought about the layering a fair amount and am still somewhat > bothered by the flow of things. I think if we make a couple small > modifications then it will fit more cleanly: > * I don't think that we really need ip_tunnel_xmit(). It's primarily > there to handle offloads but if we get in the GSO changes before > allowing upper layer protocols to send these offloaded packets then we > don't have to worry about it. Also, the send_frags() function is > there due to CAPWAP's protocol level fragmentation, which we don't > have to worry about here. Without these two we can call directly into > the GRE data plane code and have it do the complete process without > much in the way of code duplication. > OK, we can drop capwap related support in xmit, but IPIP still needs it, so I think we can keep it in ip_tunnel. > * Registering to receive packets should probably into the GRE data > plane module instead of directly to the IP tunnel code. There's > potentially a module dependency problem otherwise - if all you do is > register to receive packets then there's nothing that will actually > cause the GRE module to get loaded. This is resolved at the moment by > the use of the transmit function but it seems a little fragile. We > can definitely still use the IP tunnel code and share structures, etc. > but we would just add an extra hop. > > we can have direct dependance of gre module on ip_tunnel module by pulling hash-lookup into ip_tunnel. ip_gre has dependency on gre. So am I not sure abt fragile dependency. > At this point the upper tunneling interface would look something like this: > gre_register() > gre_unregister() > what about vxlan and future protocols? > gre_send() > > It's obviously protocol specific but I think at this point that it's > inevitable and this way at least it is consistent. I'm also aware > that it makes the IP tunnel code more of a library than I was > proposing - now that I can see the code I can understand what you were > saying that it makes more sense this way. > > On Thu, Nov 22, 2012 at 7:56 AM, Pravin B Shelar <[email protected]> > wrote: > > diff --git a/include/net/gre.h b/include/net/gre.h > > index 8266547..dda547a 100644 > > --- a/include/net/gre.h > > +++ b/include/net/gre.h > > @@ -15,4 +16,12 @@ struct gre_protocol { > > int gre_add_protocol(const struct gre_protocol *proto, u8 version); > > int gre_del_protocol(const struct gre_protocol *proto, u8 version); > > > > +struct sk_buff *gre_build_header(struct sk_buff *skb, > > + const struct tnl_ptk_info *tpi); > > +struct gre_base_hdr { > > + __be16 flags; > > + __be16 protocol; > > +}; > > +#define GRE_HEADER_SECTION 4 > > I think these are only used in gre.c, so maybe we can keep the > definitions private to that (since ideally no other code will need to > have these protocol details). > > ok. > > diff --git a/include/net/ipip.h b/include/net/ipip.h > > index 21947cf..a14bde8 100644 > > --- a/include/net/ipip.h > > +++ b/include/net/ipip.h > > What do you think about renaming this to ip_tunnel.h or similar? It > seems like too specific of a name at this point. > > ok > > +#define TUNNEL_CSUM __cpu_to_be16(0x8000) > > +#define TUNNEL_KEY __cpu_to_be16(0x2000) > > +#define TUNNEL_SEQ __cpu_to_be16(0x1000) > > I don't think that we really want to tie these directly to GRE flags. > As we add more protocols it will just get more complicated. > > ok. > > +struct tnl_ptk_info { > > + __be16 flags; > > + __be16 proto; > > + __be32 key; > > + __be32 seq; > > + int hdr_len; > > + __sum16 csum; > > Do we actually need the hdr_len and csum fields. hdr_len seems like > it ties the control plane to the data plane a little too closely. For > csum, can't we directly drop packets with invalid checksums when we > parse the header? There aren't any current users of packets with > invalid checksums and none really come to mind in the future. > > ok. > > +#define HASH_ON_KEY 1 > > I think this isn't used until later with VXLAN so maybe we should wait > until then. > > right. > > +struct ip_tunnel_ops { > > + u32 flags; > > + int (*parse_netlink_parms)(struct ip_tunnel *vxlan, > > + struct nlattr *data[], > > + struct nlattr *tb[], > > + struct ip_tunnel_parm *parms); > > + int (*get_ioctl_param)(struct net_device *dev, > > + struct ifreq *ifr, > > + struct ip_tunnel_parm *p); > > +}; > > + > > +#define IPT_HASH_BITS 10 > > +#define IPT_HASH_SIZE (1 << IPT_HASH_BITS) > > + > > +struct ip_tunnel_net { > > + struct ip_tunnel __rcu **tunnels; > > + struct net_device *fb_tunnel_dev; > > + const struct ip_tunnel_ops *ops; > > +}; > > + > > +enum ipt_type { > > + IPT_GRE, > > + IPT_VXLAN, > > Probably the VXLAN identifier should come later when VXLAN starts to > actually use this infrastructure. > > ok.. > > +#define IPT_HASH_BUCKETS 16 > > +extern struct hlist_head ipt_proto[IPT_HASH_BUCKETS]; > > The names for IPT_HASH_BUCKETS and IPT_HASH_SIZE are very similar, we > probably should choose something more descriptive. > > > +static inline void ip_tunnel_setup(struct net_device *dev, int net_id) > > +{ > > + struct ip_tunnel *tunnel = netdev_priv(dev); > > + tunnel->ipt_net_id = net_id; > > +} > > Does this need to be an inline? We could probably just make it consistent. > > > diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig > > index 5a19aeb..4110ff3 100644 > > --- a/net/ipv4/Kconfig > > +++ b/net/ipv4/Kconfig > > @@ -166,6 +166,7 @@ config IP_PNP_RARP > > config NET_IPIP > > tristate "IP: tunneling" > > select INET_TUNNEL > > + select NET_IP_TUNNEL > > IPIP isn't using anything from here yet, right? > > ok. > > diff --git a/net/ipv4/gre.c b/net/ipv4/gre.c > > index 5a903dc..b837551 100644 > > --- a/net/ipv4/gre.c > > +++ b/net/ipv4/gre.c > > +struct sk_buff *gre_build_header(struct sk_buff *skb, > > + const struct tnl_ptk_info *tpi) > > +{ > > + struct iphdr *iph = ip_hdr(skb); > > + struct gre_base_hdr *greh = (struct gre_base_hdr *)&iph[1]; > > Can we reverse the order that we build the headers so that GRE comes > before the IP header? That way you could just push each incremental > piece on, rather than having to do all of the pointer manipulation. > > ok > > + struct dst_entry *dst = skb_dst(skb); > > + > > + greh->flags = tpi->flags; > > + greh->protocol = tpi->proto; > > + if (tpi->flags&(GRE_KEY|GRE_CSUM|GRE_SEQ)) { > > + __be32 *ptr = (__be32 *)(((u8 *)greh) + tpi->hdr_len - > 4); > > + > > + if (tpi->flags&GRE_SEQ) { > > + *ptr = tpi->seq; > > + ptr--; > > + } > > + if (tpi->flags&GRE_KEY) { > > + *ptr = tpi->key; > > + ptr--; > > + } > > + if (tpi->flags&GRE_CSUM) { > > + *(__sum16 *)ptr = 0; > > + *(__sum16 *)ptr = csum_fold(skb_checksum(skb, > > + skb_transport_offset(skb), > > + skb->len - > skb_transport_offset(skb), > > + 0)); > > + } > > + } > > + skb->local_df = 1; > > + __ip_select_ident(ip_hdr(skb), dst, 0); > > We should just fix the fragmentation code to select an ID if necessary > rather than keeping this workaround. > > I will post separate patch for it. > + return skb; > > +} > > +EXPORT_SYMBOL(gre_build_header); > > + > > + > > Extra blank line. > > > +static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info > *tpi) > > +{ > > + struct gre_base_hdr *greh = (struct gre_base_hdr *)skb->data; > > + __be32 *options = (__be32 *)(greh + 1); > > + > > + if (unlikely(greh->flags & (GRE_VERSION | GRE_ROUTING))) > > + return -EINVAL; > > + > > + tpi->flags = greh->flags; > > + tpi->proto = greh->protocol; > > + > > + tpi->hdr_len = GRE_HEADER_SECTION; > > + tpi->csum = check_checksum(skb); > > + > > + if (tpi->csum) > > + return -EINVAL; > > + > > + if (greh->flags & GRE_CSUM) { > > + tpi->hdr_len += GRE_HEADER_SECTION; > > + options++; > > + } > > Can we incrementally pull each piece off so that we don't have to pass > hdr_len around? > > ok. > > +static int ipgre_rcv_v0(struct sk_buff *skb) > > +{ > > + struct tnl_ptk_info tpi; > > + struct ipt_protocol *proto; > > + struct hlist_node *n; > > + struct hlist_head *head; > > + > > + if (!pskb_may_pull(skb, 16)) > > + goto drop; > > + > > + if (parse_gre_header(skb, &tpi) < 0) > > + goto drop; > > + > > + head = ipt_hash_bucket(IPT_GRE, 0); > > + hlist_for_each_entry_rcu(proto, n, head, node) { > > + int ret; > > + > > + if (proto->type != IPT_GRE || proto->portno != 0) > > + continue; > > + ret = proto->handler(skb, &tpi); > > It seems that it is very likely that users of this API will have the > same handler for all port numbers (and running different > implementation on different ports at the same time seems not that > common either). In that case, it would be easier to just pass the > port number to the handler and let it do its own lookup. > > Also, does it make sense to pull the complete hash and lookup block > out into ip_tunnel.c? > > ok. > > +static void ipgre_err_v0(struct sk_buff *skb, u32 info) > > +{ > > + > > + /* All the routers (except for Linux) return only > > + * 8 bytes of packet payload. It means, that precise relaying of > > + * ICMP in the real Internet is absolutely infeasible. > > + * > > + * Moreover, Cisco "wise men" put GRE key to the third word > > + * in GRE header. It makes impossible maintaining even soft > > + * state for keyed > > + * GRE tunnels with enabled checksum. Tell them "thank you". > > + * > > + * Well, I wonder, rfc1812 was written by Cisco employee, > > + * what the hell these idiots break standards established > > + * by themselves??? > > + */ > > + > > + const int type = icmp_hdr(skb)->type; > > + const int code = icmp_hdr(skb)->code; > > + struct tnl_ptk_info tpi; > > + struct hlist_node *n; > > + struct hlist_head *head; > > + struct ipt_protocol *proto; > > + > > + if (!pskb_may_pull(skb, sizeof(struct gre_base_hdr) + ETH_HLEN)) > > + return; > > This code is shared by both encapsulated IP and Ethernet so we should > be careful not to make any assumptions there. > > ok. > > + parse_gre_header(skb, &tpi); > > + > > + if (tpi.csum) > > + return; > > I don't think that we can validate the checksum here. It's likely > that the ICMP message will only return a portion of the GRE packet so > we'll falsely think the checksum is wrong. > > > + /* If only 8 bytes returned, keyed message will be dropped here > */ > > + if (tpi.flags & GRE_KEY) { > > + if ((tpi.flags & GRE_CSUM) && (tpi.hdr_len < 12)) > > + return; > > + if (tpi.hdr_len < 8) > > + return; > > + } > > + > > + switch (type) { > > + default: > > + case ICMP_PARAMETERPROB: > > + return; > > + > > + case ICMP_DEST_UNREACH: > > + switch (code) { > > + case ICMP_SR_FAILED: > > + case ICMP_PORT_UNREACH: > > + /* Impossible event. */ > > + return; > > + default: > > + /* All others are translated to HOST_UNREACH. > > + rfc2003 contains "deep thoughts" about > NET_UNREACH, > > + I believe they are just ether pollution. --ANK > > + */ > > + break; > > + } > > + break; > > + case ICMP_TIME_EXCEEDED: > > + if (code != ICMP_EXC_TTL) > > + return; > > + break; > > + > > + case ICMP_REDIRECT: > > + break; > > + } > > + > > + head = ipt_hash_bucket(IPT_GRE, 0); > > + hlist_for_each_entry_rcu(proto, n, head, node) { > > + if (proto->type != IPT_GRE || proto->portno != 0) > > + continue; > > + if (proto->err_handler(skb, info, &tpi) <= 0) > > + return; > > + } > > I think we can move the dispatch lookup above the type check to avoid > imposing a policy on the ultimate handler. > > That said, can we continue to handle path MTU discovery and ICMP > redirects at this level? It doesn't really depend on the tunnel > device and seems like a common data plane component. > > > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c > > index 0d4eecd..829fe3d 100644 > > --- a/net/ipv4/ip_gre.c > > +++ b/net/ipv4/ip_gre.c > > /* Fallback tunnel: no source, no destination, no key, no options */ > > > > #define HASH_SIZE 16 > > I think these two lines are unused now. > > > static int ipgre_net_id __read_mostly; > > > -static void ipgre_err(struct sk_buff *skb, u32 info) > [...] > > + int dev_type; > [...] > > + if (tpi->proto == htons(ETH_P_TEB)) { > > + dev_type = ARPHRD_ETHER; > > + itn = net_generic(net, ipgre_tap_net_id); > > + } else { > > + dev_type = ARPHRD_IPGRE; > > + itn = net_generic(net, ipgre_net_id); > > } > > It doesn't look to me like dev_type is used anywhere, either here or > in ipgre_rcv(). > > Also, can we use a return code other than 1 for tunnel not found in > these two functions? Something symbolic seems better to me. > > > static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct > net_device *dev) > [...] > > + err = ip_tunnel_build_iphdr(skb, dev, tiph, gre_hlen, &iph); > > + if (err) > > + return NETDEV_TX_OK; > > + > > + tpi.flags = tunnel->parms.o_flags; > > + tpi.proto = proto; > > + tpi.key = tunnel->parms.o_key; > > + tpi.seq = htonl(tunnel->o_seqno); > > + tpi.hdr_len = tunnel->hlen; > > + if (skb_cow_head(skb, dev->needed_headroom)) { > > + dev->stats.tx_dropped++; > > dev_kfree_skb(skb); > > - skb = new_skb; > > - old_iph = ip_hdr(skb); > > - } > > Can we push this down into the encapsulation code? It seems like a > common data plane piece. > > do u meanpush it to ip_tunnel_xmit()? I am using same function in ip_gre and ovs-gre thats why i kept it here rather than ip_tunnel. > > - if (tunnel->parms.o_flags&(GRE_KEY|GRE_CSUM|GRE_SEQ)) { > > - __be32 *ptr = (__be32 *)(((u8 *)iph) + tunnel->hlen - 4); > > + u64_stats_update_begin(&tstats->syncp); > > + tstats->tx_bytes += tx_len; > > + tstats->tx_packets++; > > + u64_stats_update_end(&tstats->syncp); > > > > - if (tunnel->parms.o_flags&GRE_SEQ) { > > + if (tunnel->parms.o_flags&GRE_SEQ) > > ++tunnel->o_seqno; > > The current version increments the sequence number before > transmission, so it always goes up regardless of the outcome. That > seems a little more natural to me. > > Also, can we push the stats updates to ip_tunnel? They are > incremented there everywhere else and it would be good to keep things > in a single location to the greatest extent possible. > > > diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c > > new file mode 100644 > > index 0000000..e5e2fde > > --- /dev/null > > +++ b/net/ipv4/ip_tunnel.c > > +static void free_linked_skbs(struct sk_buff *skb) > > +{ > > + while (skb) { > > + struct sk_buff *next = skb->next; > > + kfree_skb(skb); > > + skb = next; > > + } > > +} > > + > > + > > Extra blank line. > > > +static int send_frags(struct sk_buff *skb, > > + int tunnel_hlen) > > +{ > > + int sent_len; > > + > > + sent_len = 0; > > + while (skb) { > > + struct sk_buff *next = skb->next; > > + int frag_len = skb->len - tunnel_hlen; > > + int err; > > + > > + skb->next = NULL; > > + memset(IPCB(skb), 0, sizeof(*IPCB(skb))); > > + > > + err = ip_local_out(skb); > > + skb = next; > > + if (unlikely(net_xmit_eval(err))) > > + goto free_frags; > > + sent_len += frag_len; > > + } > > + > > + return sent_len; > > + > > +free_frags: > > + /* > > + * There's no point in continuing to send fragments once one has > been > > + * dropped so just free the rest. This may help improve the > congestion > > + * that caused the first packet to be dropped. > > + */ > > + free_linked_skbs(skb); > > + return sent_len; > > +} > > + > > + > > Another double blank line here. > > > +int __net_init ip_tunnel_init_net(struct net *net, int ipt_net_id, > > + struct rtnl_link_ops *ops) > > +{ > > + struct ip_tunnel_net *itn = net_generic(net, ipt_net_id); > > + struct ip_tunnel_parm parms; > > + > > + itn->tunnels = kzalloc(sizeof(void *) * IPT_HASH_SIZE, > GFP_KERNEL); > > + if (!itn->tunnels) > > + return -ENOMEM; > > + > > + if (!ops) { > > Presumably some type of ops needs to be supplied. Should the fallback > device be created only if we have ioctl ops? > > > +/* Can be moved to ip-tunnel */ > > +int ip_tunnel_newlink(struct net *src_net, struct net_device *dev, > > I think the comment here is out of date. > > > +void ip_tunnel_uninit(struct net_device *dev) > > +{ > > + struct net *net = dev_net(dev); > > + struct ip_tunnel *tunnel = netdev_priv(dev); > > + struct ip_tunnel_net *itn; > > + > > + itn = net_generic(net, tunnel->ipt_net_id); > > + /* fb_tunnel_dev will be unregisted in net-exit call. */ > > + if (itn->fb_tunnel_dev != dev) > > + ip_tunnel_unlink(itn, netdev_priv(dev)); > > +} > > +EXPORT_SYMBOL(ip_tunnel_uninit); > > + > > + > > Double blank line here. > > > +int ipt_add_protocol(struct ipt_protocol *newp) > > +{ > > + struct hlist_head *head; > > + struct ipt_protocol *ipt = NULL; > > + struct hlist_node *n; > > + ASSERT_RTNL(); > > + > > + head = ipt_hash_bucket(newp->type, newp->portno); > > + > > + hlist_for_each_entry_rcu(ipt, n, head, node) { > > + if (ipt->type != newp->type || ipt->portno != > newp->portno) > > + continue; > > + if (ipt->priority > newp->priority) { > > + hlist_add_before_rcu(&newp->node, &ipt->node); > > + return 0; > > + } > > + if (ipt->priority == newp->priority) > > + return -1; > > Can you use a symbolic error code here? > > > +void ipt_del_protocol(struct ipt_protocol *proto) > > +{ > > + ASSERT_RTNL(); > > + > > + hlist_del_rcu(&proto->node); > > + synchronize_rcu(); > > I would use synchronize_net() here. > > > +static int __init ip_tunnel_mod_init(void) > > +{ > > + pr_info("IP_Tunnel init\n"); > > I would either make this a little more descriptive or just remove it > entirely. Right now it's not very human readable. > > > + get_random_bytes(&ip_tunnel_salt, sizeof(ip_tunnel_salt)); > > Do we actually need to hash in a random value? The hash entries > aren't controlled by anything on the network. > > ok. Thanks for review. > I'm still working on going through the rest of the ioctl/Netlink code > but I wanted to get back to you with what I have. >
_______________________________________________ dev mailing list [email protected] http://openvswitch.org/mailman/listinfo/dev
