On Wed, Dec 12, 2012 at 8:04 AM, Jesse Gross <[email protected]> wrote:

> Thanks and sorry that I took so long to get back to you.
>
> I thought about the layering a fair amount and am still somewhat
> bothered by the flow of things.  I think if we make a couple small
> modifications then it will fit more cleanly:
>  * I don't think that we really need ip_tunnel_xmit().  It's primarily
> there to handle offloads but if we get in the GSO changes before
> allowing upper layer protocols to send these offloaded packets then we
> don't have to worry about it.  Also, the send_frags() function is
> there due to CAPWAP's protocol level fragmentation, which we don't
> have to worry about here.  Without these two we can call directly into
> the GRE data plane code and have it do the complete process without
> much in the way of code duplication.
>

OK, we can drop capwap related support in xmit, but IPIP still needs it, so
I think we can keep it in ip_tunnel.


>  * Registering to receive packets should probably into the GRE data
> plane module instead of directly to the IP tunnel code.  There's
> potentially a module dependency problem otherwise - if all you do is
> register to receive packets then there's nothing that will actually
> cause the GRE module to get loaded.  This is resolved at the moment by
> the use of the transmit function but it seems a little fragile.  We
> can definitely still use the IP tunnel code and share structures, etc.
> but we would just add an extra hop.
>
>
we can have direct dependance of gre module on ip_tunnel module by pulling
hash-lookup into ip_tunnel. ip_gre has dependency on gre. So am I not sure
abt fragile dependency.


> At this point the upper tunneling interface would look something like this:
> gre_register()
> gre_unregister()
>

what about vxlan and future protocols?


> gre_send()
>
> It's obviously protocol specific but I think at this point that it's
> inevitable and this way at least it is consistent.  I'm also aware
> that it makes the IP tunnel code more of a library than I was
> proposing - now that I can see the code I can understand what you were
> saying that it makes more sense this way.
>
> On Thu, Nov 22, 2012 at 7:56 AM, Pravin B Shelar <[email protected]>
> wrote:
> > diff --git a/include/net/gre.h b/include/net/gre.h
> > index 8266547..dda547a 100644
> > --- a/include/net/gre.h
> > +++ b/include/net/gre.h
> > @@ -15,4 +16,12 @@ struct gre_protocol {
> >  int gre_add_protocol(const struct gre_protocol *proto, u8 version);
> >  int gre_del_protocol(const struct gre_protocol *proto, u8 version);
> >
> > +struct sk_buff *gre_build_header(struct sk_buff *skb,
> > +                                 const struct tnl_ptk_info *tpi);
> > +struct gre_base_hdr {
> > +       __be16 flags;
> > +       __be16 protocol;
> > +};
> > +#define GRE_HEADER_SECTION 4
>
> I think these are only used in gre.c, so maybe we can keep the
> definitions private to that (since ideally no other code will need to
> have these protocol details).
>
> ok.


> > diff --git a/include/net/ipip.h b/include/net/ipip.h
> > index 21947cf..a14bde8 100644
> > --- a/include/net/ipip.h
> > +++ b/include/net/ipip.h
>
> What do you think about renaming this to ip_tunnel.h or similar?  It
> seems like too specific of a name at this point.
>
> ok


> > +#define TUNNEL_CSUM        __cpu_to_be16(0x8000)
> > +#define TUNNEL_KEY         __cpu_to_be16(0x2000)
> > +#define TUNNEL_SEQ         __cpu_to_be16(0x1000)
>
> I don't think that we really want to tie these directly to GRE flags.
> As we add more protocols it will just get more complicated.
>
> ok.


> > +struct tnl_ptk_info {
> > +       __be16 flags;
> > +       __be16 proto;
> > +       __be32 key;
> > +       __be32 seq;
> > +       int hdr_len;
> > +       __sum16 csum;
>
> Do we actually need the hdr_len and csum fields.  hdr_len seems like
> it ties the control plane to the data plane a little too closely.  For
> csum, can't we directly drop packets with invalid checksums when we
> parse the header?  There aren't any current users of packets with
> invalid checksums and none really come to mind in the future.
>
> ok.


> > +#define HASH_ON_KEY    1
>
> I think this isn't used until later with VXLAN so maybe we should wait
> until then.
>
> right.


> > +struct ip_tunnel_ops {
> > +       u32  flags;
> > +       int (*parse_netlink_parms)(struct ip_tunnel *vxlan,
> > +                                  struct nlattr *data[],
> > +                                  struct nlattr *tb[],
> > +                                  struct ip_tunnel_parm *parms);
> > +       int (*get_ioctl_param)(struct net_device *dev,
> > +                              struct ifreq *ifr,
> > +                              struct ip_tunnel_parm *p);
> > +};
> > +
> > +#define IPT_HASH_BITS   10
> > +#define IPT_HASH_SIZE   (1 << IPT_HASH_BITS)
> > +
> > +struct ip_tunnel_net {
> > +       struct ip_tunnel __rcu **tunnels;
> > +       struct net_device *fb_tunnel_dev;
> > +       const struct ip_tunnel_ops *ops;
> > +};
> > +
> > +enum ipt_type {
> > +       IPT_GRE,
> > +       IPT_VXLAN,
>
> Probably the VXLAN identifier should come later when VXLAN starts to
> actually use this infrastructure.
>
> ok..


> > +#define IPT_HASH_BUCKETS       16
> > +extern struct hlist_head ipt_proto[IPT_HASH_BUCKETS];
>
> The names for IPT_HASH_BUCKETS and IPT_HASH_SIZE are very similar, we
> probably should choose something more descriptive.
>
> > +static inline void ip_tunnel_setup(struct net_device *dev, int net_id)
> > +{
> > +       struct ip_tunnel *tunnel = netdev_priv(dev);
> > +       tunnel->ipt_net_id = net_id;
> > +}
>
> Does this need to be an inline?  We could probably just make it consistent.
>
> > diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
> > index 5a19aeb..4110ff3 100644
> > --- a/net/ipv4/Kconfig
> > +++ b/net/ipv4/Kconfig
> > @@ -166,6 +166,7 @@ config IP_PNP_RARP
> >  config NET_IPIP
> >         tristate "IP: tunneling"
> >         select INET_TUNNEL
> > +       select NET_IP_TUNNEL
>
> IPIP isn't using anything from here yet, right?
>
> ok.


> > diff --git a/net/ipv4/gre.c b/net/ipv4/gre.c
> > index 5a903dc..b837551 100644
> > --- a/net/ipv4/gre.c
> > +++ b/net/ipv4/gre.c
> > +struct sk_buff *gre_build_header(struct sk_buff *skb,
> > +                                const struct tnl_ptk_info *tpi)
> > +{
> > +       struct iphdr *iph = ip_hdr(skb);
> > +       struct gre_base_hdr *greh = (struct gre_base_hdr *)&iph[1];
>
> Can we reverse the order that we build the headers so that GRE comes
> before the IP header?  That way you could just push each incremental
> piece on, rather than having to do all of the pointer manipulation.
>
> ok


> > +       struct dst_entry *dst = skb_dst(skb);
> > +
> > +       greh->flags = tpi->flags;
> > +       greh->protocol = tpi->proto;
> > +       if (tpi->flags&(GRE_KEY|GRE_CSUM|GRE_SEQ)) {
> > +               __be32 *ptr = (__be32 *)(((u8 *)greh) + tpi->hdr_len -
> 4);
> > +
> > +               if (tpi->flags&GRE_SEQ) {
> > +                       *ptr = tpi->seq;
> > +                       ptr--;
> > +               }
> > +               if (tpi->flags&GRE_KEY) {
> > +                       *ptr = tpi->key;
> > +                       ptr--;
> > +               }
> > +               if (tpi->flags&GRE_CSUM) {
> > +                       *(__sum16 *)ptr = 0;
> > +                       *(__sum16 *)ptr = csum_fold(skb_checksum(skb,
> > +                                          skb_transport_offset(skb),
> > +                                          skb->len -
> skb_transport_offset(skb),
> > +                                               0));
> > +               }
> > +       }
> > +       skb->local_df = 1;
> > +       __ip_select_ident(ip_hdr(skb), dst, 0);
>
> We should just fix the fragmentation code to select an ID if necessary
> rather than keeping this workaround.
>
> I will post separate patch for it.

> +       return skb;
> > +}
> > +EXPORT_SYMBOL(gre_build_header);
> > +
> > +
>
> Extra blank line.
>
> > +static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info
> *tpi)
> > +{
> > +       struct gre_base_hdr *greh = (struct gre_base_hdr *)skb->data;
> > +       __be32 *options = (__be32 *)(greh + 1);
> > +
> > +       if (unlikely(greh->flags & (GRE_VERSION | GRE_ROUTING)))
> > +               return -EINVAL;
> > +
> > +       tpi->flags = greh->flags;
> > +       tpi->proto = greh->protocol;
> > +
> > +       tpi->hdr_len = GRE_HEADER_SECTION;
> > +       tpi->csum = check_checksum(skb);
> > +
> > +       if (tpi->csum)
> > +               return -EINVAL;
> > +
> > +       if (greh->flags & GRE_CSUM) {
> > +               tpi->hdr_len += GRE_HEADER_SECTION;
> > +               options++;
> > +       }
>
> Can we incrementally pull each piece off so that we don't have to pass
> hdr_len around?
>
> ok.


> > +static int ipgre_rcv_v0(struct sk_buff *skb)
> > +{
> > +       struct tnl_ptk_info tpi;
> > +       struct ipt_protocol *proto;
> > +       struct hlist_node *n;
> > +       struct hlist_head *head;
> > +
> > +       if (!pskb_may_pull(skb, 16))
> > +               goto drop;
> > +
> > +       if (parse_gre_header(skb, &tpi) < 0)
> > +               goto drop;
> > +
> > +       head = ipt_hash_bucket(IPT_GRE, 0);
> > +       hlist_for_each_entry_rcu(proto, n, head, node) {
> > +               int ret;
> > +
> > +               if (proto->type != IPT_GRE || proto->portno != 0)
> > +                       continue;
> > +               ret = proto->handler(skb, &tpi);
>
> It seems that it is very likely that users of this API will have the
> same handler for all port numbers (and running different
> implementation on different ports at the same time seems not that
> common either).  In that case, it would be easier to just pass the
> port number to the handler and let it do its own lookup.
>
> Also, does it make sense to pull the complete hash and lookup block
> out into ip_tunnel.c?
>
> ok.

> > +static void ipgre_err_v0(struct sk_buff *skb, u32 info)
> > +{
> > +
> > +       /* All the routers (except for Linux) return only
> > +        * 8 bytes of packet payload. It means, that precise relaying of
> > +        * ICMP in the real Internet is absolutely infeasible.
> > +        *
> > +        * Moreover, Cisco "wise men" put GRE key to the third word
> > +        * in GRE header. It makes impossible maintaining even soft
> > +        * state for keyed
> > +        * GRE tunnels with enabled checksum. Tell them "thank you".
> > +        *
> > +        * Well, I wonder, rfc1812 was written by Cisco employee,
> > +        * what the hell these idiots break standards established
> > +        * by themselves???
> > +        */
> > +
> > +       const int type = icmp_hdr(skb)->type;
> > +       const int code = icmp_hdr(skb)->code;
> > +       struct tnl_ptk_info tpi;
> > +       struct hlist_node *n;
> > +       struct hlist_head *head;
> > +       struct ipt_protocol *proto;
> > +
> > +       if (!pskb_may_pull(skb, sizeof(struct gre_base_hdr) + ETH_HLEN))
> > +               return;
>
> This code is shared by both encapsulated IP and Ethernet so we should
> be careful not to make any assumptions there.
>
> ok.


> > +       parse_gre_header(skb, &tpi);
> > +
> > +       if (tpi.csum)
> > +               return;
>
> I don't think that we can validate the checksum here.  It's likely
> that the ICMP message will only return a portion of the GRE packet so
> we'll falsely think the checksum is wrong.
>
> > +       /* If only 8 bytes returned, keyed message will be dropped here
> */
> > +       if (tpi.flags & GRE_KEY) {
> > +               if ((tpi.flags & GRE_CSUM) && (tpi.hdr_len < 12))
> > +                       return;
> > +               if (tpi.hdr_len < 8)
> > +                       return;
> > +       }
> > +
> > +       switch (type) {
> > +       default:
> > +       case ICMP_PARAMETERPROB:
> > +               return;
> > +
> > +       case ICMP_DEST_UNREACH:
> > +               switch (code) {
> > +               case ICMP_SR_FAILED:
> > +               case ICMP_PORT_UNREACH:
> > +                       /* Impossible event. */
> > +               return;
> > +               default:
> > +                       /* All others are translated to HOST_UNREACH.
> > +                          rfc2003 contains "deep thoughts" about
> NET_UNREACH,
> > +                          I believe they are just ether pollution. --ANK
> > +                        */
> > +               break;
> > +               }
> > +               break;
> > +       case ICMP_TIME_EXCEEDED:
> > +               if (code != ICMP_EXC_TTL)
> > +                       return;
> > +               break;
> > +
> > +       case ICMP_REDIRECT:
> > +               break;
> > +       }
> > +
> > +       head = ipt_hash_bucket(IPT_GRE, 0);
> > +       hlist_for_each_entry_rcu(proto, n, head, node) {
> > +               if (proto->type != IPT_GRE || proto->portno != 0)
> > +                       continue;
> > +               if (proto->err_handler(skb, info, &tpi) <= 0)
> > +                       return;
> > +       }
>
> I think we can move the dispatch lookup above the type check to avoid
> imposing a policy on the ultimate handler.
>
> That said, can we continue to handle path MTU discovery and ICMP
> redirects at this level?  It doesn't really depend on the tunnel
> device and seems like a common data plane component.
>
> > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> > index 0d4eecd..829fe3d 100644
> > --- a/net/ipv4/ip_gre.c
> > +++ b/net/ipv4/ip_gre.c
> >  /* Fallback tunnel: no source, no destination, no key, no options */
> >
> >  #define HASH_SIZE  16
>
> I think these two lines are unused now.
>
> >  static int ipgre_net_id __read_mostly;
>
> > -static void ipgre_err(struct sk_buff *skb, u32 info)
> [...]
> > +       int dev_type;
> [...]
> > +       if (tpi->proto == htons(ETH_P_TEB)) {
> > +               dev_type = ARPHRD_ETHER;
> > +               itn = net_generic(net, ipgre_tap_net_id);
> > +       } else {
> > +               dev_type = ARPHRD_IPGRE;
> > +               itn = net_generic(net, ipgre_net_id);
> >         }
>
> It doesn't look to me like dev_type is used anywhere, either here or
> in ipgre_rcv().
>
> Also, can we use a return code other than 1 for tunnel not found in
> these two functions?  Something symbolic seems better to me.
>
> >  static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct
> net_device *dev)
> [...]
> > +       err = ip_tunnel_build_iphdr(skb, dev, tiph, gre_hlen, &iph);
> > +       if (err)
> > +               return NETDEV_TX_OK;
> > +
> > +       tpi.flags = tunnel->parms.o_flags;
> > +       tpi.proto = proto;
> > +       tpi.key = tunnel->parms.o_key;
> > +       tpi.seq =  htonl(tunnel->o_seqno);
> > +       tpi.hdr_len = tunnel->hlen;
> > +       if (skb_cow_head(skb, dev->needed_headroom)) {
> > +               dev->stats.tx_dropped++;
> >                 dev_kfree_skb(skb);
> > -               skb = new_skb;
> > -               old_iph = ip_hdr(skb);
> > -       }
>
> Can we push this down into the encapsulation code?  It seems like a
> common data plane piece.
>
> do u meanpush it to ip_tunnel_xmit()? I am using same function in ip_gre
and ovs-gre thats why i kept it here rather than ip_tunnel.


> > -       if (tunnel->parms.o_flags&(GRE_KEY|GRE_CSUM|GRE_SEQ)) {
> > -               __be32 *ptr = (__be32 *)(((u8 *)iph) + tunnel->hlen - 4);
> > +               u64_stats_update_begin(&tstats->syncp);
> > +               tstats->tx_bytes += tx_len;
> > +               tstats->tx_packets++;
> > +               u64_stats_update_end(&tstats->syncp);
> >
> > -               if (tunnel->parms.o_flags&GRE_SEQ) {
> > +               if (tunnel->parms.o_flags&GRE_SEQ)
> >                         ++tunnel->o_seqno;
>
> The current version increments the sequence number before
> transmission, so it always goes up regardless of the outcome.  That
> seems a little more natural to me.
>
> Also, can we push the stats updates to ip_tunnel?  They are
> incremented there everywhere else and it would be good to keep things
> in a single location to the greatest extent possible.
>
>
> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
> > new file mode 100644
> > index 0000000..e5e2fde
> > --- /dev/null
> > +++ b/net/ipv4/ip_tunnel.c
> > +static void free_linked_skbs(struct sk_buff *skb)
> > +{
> > +       while (skb) {
> > +               struct sk_buff *next = skb->next;
> > +               kfree_skb(skb);
> > +               skb = next;
> > +       }
> > +}
> > +
> > +
>
> Extra blank line.
>
> > +static int send_frags(struct sk_buff *skb,
> > +                     int tunnel_hlen)
> > +{
> > +       int sent_len;
> > +
> > +       sent_len = 0;
> > +       while (skb) {
> > +               struct sk_buff *next = skb->next;
> > +               int frag_len = skb->len - tunnel_hlen;
> > +               int err;
> > +
> > +               skb->next = NULL;
> > +               memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
> > +
> > +               err = ip_local_out(skb);
> > +               skb = next;
> > +               if (unlikely(net_xmit_eval(err)))
> > +                       goto free_frags;
> > +               sent_len += frag_len;
> > +       }
> > +
> > +       return sent_len;
> > +
> > +free_frags:
> > +       /*
> > +        * There's no point in continuing to send fragments once one has
> been
> > +        * dropped so just free the rest.  This may help improve the
> congestion
> > +        * that caused the first packet to be dropped.
> > +        */
> > +       free_linked_skbs(skb);
> > +       return sent_len;
> > +}
> > +
> > +
>
> Another double blank line here.
>
> > +int __net_init ip_tunnel_init_net(struct net *net, int ipt_net_id,
> > +                                 struct rtnl_link_ops *ops)
> > +{
> > +       struct ip_tunnel_net *itn = net_generic(net, ipt_net_id);
> > +       struct ip_tunnel_parm parms;
> > +
> > +       itn->tunnels = kzalloc(sizeof(void *) * IPT_HASH_SIZE,
> GFP_KERNEL);
> > +       if (!itn->tunnels)
> > +               return -ENOMEM;
> > +
> > +       if (!ops) {
>
> Presumably some type of ops needs to be supplied.  Should the fallback
> device be created only if we have ioctl ops?
>
> > +/* Can be moved to ip-tunnel */
> > +int ip_tunnel_newlink(struct net *src_net, struct net_device *dev,
>
> I think the comment here is out of date.
>
> > +void ip_tunnel_uninit(struct net_device *dev)
> > +{
> > +       struct net *net = dev_net(dev);
> > +       struct ip_tunnel *tunnel = netdev_priv(dev);
> > +       struct ip_tunnel_net *itn;
> > +
> > +       itn = net_generic(net, tunnel->ipt_net_id);
> > +       /* fb_tunnel_dev will be unregisted in net-exit call. */
> > +       if (itn->fb_tunnel_dev != dev)
> > +               ip_tunnel_unlink(itn, netdev_priv(dev));
> > +}
> > +EXPORT_SYMBOL(ip_tunnel_uninit);
> > +
> > +
>
> Double blank line here.
>
> > +int ipt_add_protocol(struct ipt_protocol *newp)
> > +{
> > +       struct hlist_head *head;
> > +       struct ipt_protocol *ipt = NULL;
> > +       struct hlist_node *n;
> > +       ASSERT_RTNL();
> > +
> > +       head = ipt_hash_bucket(newp->type, newp->portno);
> > +
> > +       hlist_for_each_entry_rcu(ipt, n, head, node) {
> > +               if (ipt->type != newp->type || ipt->portno !=
> newp->portno)
> > +                       continue;
> > +               if (ipt->priority > newp->priority) {
> > +                       hlist_add_before_rcu(&newp->node, &ipt->node);
> > +                       return 0;
> > +               }
> > +               if (ipt->priority == newp->priority)
> > +                       return -1;
>
> Can you use a symbolic error code here?
>
> > +void ipt_del_protocol(struct ipt_protocol *proto)
> > +{
> > +       ASSERT_RTNL();
> > +
> > +       hlist_del_rcu(&proto->node);
> > +       synchronize_rcu();
>
> I would use synchronize_net() here.
>
> > +static int __init ip_tunnel_mod_init(void)
> > +{
> > +       pr_info("IP_Tunnel init\n");
>
> I would either make this a little more descriptive or just remove it
> entirely.  Right now it's not very human readable.
>
> > +       get_random_bytes(&ip_tunnel_salt, sizeof(ip_tunnel_salt));
>
> Do we actually need to hash in a random value?  The hash entries
> aren't controlled by anything on the network.
>
> ok.

Thanks for review.


> I'm still working on going through the rest of the ioctl/Netlink code
> but I wanted to get back to you with what I have.
>
_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev

Reply via email to