>
> Ian Donaldson wrote:
> > I have a pair of Sun Fire X2100M2's connected via 100M eth switches
> > (yeah, crippling gig-E) and running pfil 2.1.11, ip_fil4.1.16 and
> > was noticing significant TCP throughput performance differences
> > for traffic between various ethernet interfaces on the two systems.
> >
> > (both systems running Solaris 10/x86 6/06 with 26 Feb recommmended
> > patch cluster, NVIDIA add-on driver patch 122530-02 for nge)
> >
> > eg: system1 bge0 -> system2 bge0 1700KB/s
> > system1 bge1 -> system2 bge1 11000KB/s
> > system1 nge1 -> system2 nge1 11000KB/s
> >
> > With top I noticed a significant portion of system time being consumed
> > in the bge0 test (like 50%).
> >
> > Using
> >
> > lockstat -kIi997 sleep 10
> >
> > What is curious though is that this problem only manifests itself
> > on one of the 3 interfaces I have enabled in the system, suggesting
> > something else is broken, as I would have though that all interface
> > traffic would pass thru the same code.
> > (yes I've verified pfil module is pushed on all interfaces)
> >
> > It doesn't manifest itself on another X2100M2 system that only has
> > bge0 enabled but.
> >
>
> Are you saying that where bge1 is used but not bge0, the problem doesn't
> arise?
> That would be strange! if it happened when either bge0 or bge1 was
> being used,
> I could understand that...kinda...it'll be because the bge driver is
> communicating
> with IP "differently" because pfil is there in between.
>
Yep, as stated. traffic between bge1 and nge1 on both systems was fine,
only bge0 was affected.
Since this I've also discovered this problem existed on some of our
Solaris 9 systems that run similar ipf/pfil versions.
ie: pfil_2.1.9 ip_fil4.1.13 but not in all combinations.
eg:
- Sun Fire V60x; no problems at all. Can't reproduce it on
either e1000g0 or e1000g1.
(Solaris 9/x86 2003/08 base with May 2005 recommended patch cluster)
lockstat doesn't even show pfil_printmchain being called at all.
- Sun Netra T1 105 sparc its 100% reproducable on both interfaces
(hme0 and hme1).
(Solaris 9 sparc 2003/12 base with May 2005 recommended patch cluster)
lockstat shows vsnprintf and pfil_printmchain at the top of usage.
Thoughput is abysmal; 300KB/sec. Kernel CPU usage 97%.
- Sun Fire V100 sparc; no problems at all. Can't reproduce it on either
dmfe0 or dmfe1. pfil_printmchain showed only a handful of calls
in the trace.
(identical OS/patch base as for the Netra)
Tested two similar systems. Same results.
Note that the ipf/pfil on the sparc systems were absolutely identical;
installed from the same package I built.
So what other factors can control whether pfil_printmchain is called?
(couldn't spot anything in the code myself; and I hate an unsolved
mystery like this as its probably related to another bug which could
be way more serious)
Ian D