Re: 2.2.18Pre Lan Performance Rocks!

Andi Kleen Sun, 29 Oct 2000 23:31:02 -0800
On Mon, Oct 30, 2000 at 12:16:46AM -0700, Jeff V. Merkey wrote:
> On Mon, Oct 30, 2000 at 08:08:58AM +0100, Andi Kleen wrote:
> > On Sun, Oct 29, 2000 at 11:58:21PM -0700, Jeff V. Merkey wrote:
> > > On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> > > > On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > > > > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > > > > but it's most impressive.
> > > > > > 
> > > > > > Can you send a summary of the NFS reports to [EMAIL PROTECTED]
> > > > > 
> > > > > Yes.  I just went home, so I am emailing from my house.  I'll post late 
> > > > > tonight or in the morning.  Performance on 100Mbit with NFS going 
> > > > > Linux->Linux is getting better throughput than IPX NetWare Client -> 
> > > > > NetWare 5.x on the same network by @ 3%.  When you start loading up a 
> > > > > Linux server, it drops off sharply and NetWare keeps scaling, however, 
> > > > > this does indicate that the LAN code paths are equivalent relative to 
> > > > > latency vs. MSM/TSM/HSM in NetWare.  NetWare does better caching 
> > > > > (but we'll fix this in Linux next).  I think the ring transitions to 
> > > > > user space daemons are what are causing the scaling problems 
> > > > > Linux vs. NetWare.
> > > > 
> > > > There are no user space daemons involved in the knfsd fast path, only in slow 
>paths
> > > > like mounting.
> > > 
> > > So why is it spawning off nfsd servers in user space?  I did notice that all
> > 
> > They just provide a process context, the actual work is only done in kernel mode.
> > 
> > > the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
> > > looks to be pretty fast.  I know about the checksumming problem with TCPIP.
> > > But cycles are cheap on todays processors, so even with this overhead, it 
> > > could still get faster.  IPX uses small packets that are less wire efficient 
> > > since the ratio of header size to payload is larger than what NFS in Linux
> > > is doing, plus there's more of them for an equivalent data transfer, even
> > > with packet burst.   IPX would tend to be faster if there were multiple 
> > > routers involved, since the latency of smaller packets would be less and
> > > IPX never has to deal with the problem of fragmentation.
> > > 
> > > Is there any CR3 reloading or tasking switching going on for each interrupt
> > > that comes in you know of, BTW?  EMON stats show the server's bus utilization
> > 
> > No, interrupts do not change CR3.
> > 
> > One problem in Linux 2.2 is that kernel threads reload their VM on context switch
> > (that would include the nfsd thread), this should be fixed in 2.4 with lazy mm.
> 
> When you say it reloads it's VM, you mean it reloads the CR3 register?  

Yes.

> This will cost you about 15 clocks up front, and about 150 clocks over
> time as each TLB is reloaded (plus is will do LOCK# assertions invisibly
> underneath for PTE fetches) One major difference is that in NetWare, 
> CR3 does not ever get changed in any network I/O fast paths, and none of 
> the TCP or IPX processes exist in user space, so there's never this 
> background overhead with page tables being loaded in and out.  CR3 only 
> gets mucked with when someone allocates memory and pages in an address 
> frame (when memory is alloced and freed).
> 
> The user space activity is what's causing this, plus the copying.  Is there
> an easy way to completely disable multiple PTE/PDE address spaces and just
> map the entire address space linear (like NetWare) with a compile option 
> so I could do a true apples to apples comparison?

No. In 2.4 you could probably use the on demand lazy vm mechanism ingo described for
the nfsd processes. In 2.2 it is a bit more tricky, if I remember right lazy mm needed
quite a few changes.

But before doing too many changes i would first verify if that is really the problem.


> > Hmm actually it should be only fixed for true kernel threads that have been 
>started 
> > with kernel_thread(), the "pseudo kernel threads" like nfsd uses probably do not 
> > get that optimization because they don't set their MM to init_mm. 
> > 
> > > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > > 60% of total clock cycles, Linux starts dropping off.  NetWare 5.x is 1/8 
> > 
> > I think that can be explained by the copying.
> 
> It's more.  I am seeing tons of segment register reloads, which are very 
> heavy, and atomic reads of PTE entries.

PTEs are read for aging on memory pressure in vmscan.

segment register reload happen on interrupt entry, to load the kernel cs/ds (are they 
that
costly?). If it was really costly you could probably check in interrupt entry if you're
already running in kernel space and skip it.

> 
> > > the bus utilitization, which would suggest clocks are being used for something
> > > other than pushing packets in and out of the box (the checksumming is done 
> > > in the tcp code when it copies the fragments, which is very low overhead).
> > 
> > No, unfortunately not. NFS uses UDP and the UDP defragmenting doesn't do 
>copy-checksum
> 
> Correct.  You're right -- it's in the tcp code, however, I remember going over
> this when I was reviewing Alan's code -- that's what I was think of.

2.2 does not use checksum-copy-to-user in TCP RX, csum is either done separate or in 
hardware.
It does csum-copy-fragment for TX, as well as UDP.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/
Re: 2.2.18Pre Lan Performance Rocks!

Reply via email to