> 
> Ian Donaldson wrote:
> > I have a pair of Sun Fire X2100M2's connected via 100M eth switches
> > (yeah, crippling gig-E) and running pfil 2.1.11, ip_fil4.1.16 and 
> > was noticing significant TCP throughput performance differences 
> > for traffic between various ethernet interfaces on the two systems.
> >
> > (both systems running Solaris 10/x86 6/06 with 26 Feb recommmended 
> > patch cluster, NVIDIA add-on driver patch 122530-02 for nge)
> >
> > eg: system1 bge0 -> system2 bge0  1700KB/s
> >     system1 bge1 -> system2 bge1 11000KB/s
> >     system1 nge1 -> system2 nge1 11000KB/s
> >
> > With top I noticed a significant portion of system time being consumed
> > in the bge0 test (like 50%).
> >
> > Using 
> >
> >     lockstat -kIi997 sleep 10
> >
> > What is curious though is that this problem only manifests itself
> > on one of the 3 interfaces I have enabled in the system, suggesting 
> > something else is broken, as I would have though that all interface 
> > traffic would pass thru the same code.
> > (yes I've verified pfil module is pushed on all interfaces)
> >
> > It doesn't manifest itself on another X2100M2 system that only has 
> > bge0 enabled but.
> >   
> 
> Are you saying that where bge1 is used but not bge0, the problem doesn't
> arise?
> That would be strange!  if it happened when either bge0 or bge1 was
> being used,
> I could understand that...kinda...it'll be because the bge driver is
> communicating
> with IP "differently" because pfil is there in between.
>

Yep, as stated.  traffic between bge1 and nge1 on both systems was fine,
only bge0 was affected.

Since this I've also discovered this problem existed on some of our 
Solaris 9 systems that run similar ipf/pfil versions.  
ie: pfil_2.1.9 ip_fil4.1.13 but not in all combinations.

eg:
   - Sun Fire V60x; no problems at all.  Can't reproduce it on 
     either e1000g0 or e1000g1.  
     (Solaris 9/x86 2003/08 base with May 2005 recommended patch cluster)

     lockstat doesn't even show pfil_printmchain being called at all.

   - Sun Netra T1 105 sparc its 100% reproducable on both interfaces 
     (hme0 and hme1).
     (Solaris 9 sparc 2003/12 base with May 2005 recommended patch cluster)
     lockstat shows vsnprintf and pfil_printmchain at the top of usage.

     Thoughput is abysmal; 300KB/sec.  Kernel CPU usage 97%.  

   - Sun Fire V100 sparc; no problems at all.  Can't reproduce it on either
     dmfe0 or dmfe1.  pfil_printmchain showed only a handful of calls 
     in the trace.
     (identical OS/patch base as for the Netra)
     Tested two similar systems.  Same results.

Note that the ipf/pfil on the sparc systems were absolutely identical;
installed from the same package I built.

So what other factors can control whether pfil_printmchain is called?
(couldn't spot anything in the code myself;  and I hate an unsolved
mystery like this as its probably related to another bug which could
be way more serious)

Ian D

Reply via email to