Linux-Development-Sys Digest #356, Volume #8     Sun, 17 Dec 00 20:13:14 EST

Contents:
  Re: A faster memcpy and bzero for x86 (Linus Torvalds)
  [OT] SPAM blocking; Was: Re: Kernel supporting more than 2GB of Ram (Anders Larsen)
  Re: How to make a BIOS call in Linux (Nix)
  Re: getty-less system? (Bryan Hackney)
  Re: A faster memcpy and bzero for x86 (Linus Torvalds)
  Re: Compiling C++ programs with GCC --> no GPL license implications (Nix)
  Re: imaginary complex value (Nix)
  Re: [OT] SPAM blocking; Was: Re: Kernel supporting more than 2GB of Ram ("Gene 
Heskett")
  Re: A faster memcpy and bzero for x86 (Andi Kleen)
  Re: A faster memcpy and bzero for x86 (Robert Krawitz)

----------------------------------------------------------------------------

From: [EMAIL PROTECTED] (Linus Torvalds)
Subject: Re: A faster memcpy and bzero for x86
Date: 17 Dec 2000 12:20:33 -0800

In article <[EMAIL PROTECTED]>, Robert Redelmeier  <[EMAIL PROTECTED]> wrote:
>Linus Torvalds wrote in part:
>
>> I'll see if I can see anything obvious from the profile.
>
>Well, I've looked at the 2.4.0t8 kernel (the best way for me
>to understand it is `objdump -d /usr/src/linux/vmlinux | more`)
>and memset looks healthy.  
>
>It uses `rep movsb` instead of the slightly faster `rep movsl` but 
>that only costs 16 extra CPU cycles 2% (863 vs 847) per 4KB zero.
>Hardly worth any special-size code.

Clearing a page doesn't use "memset()", it should be using
"clear_user_highpage()" - which on an x86 expands to a constant memset
which is different from the generic "memset()" library function. 

Unless you have high-memory support enabled (ie the stuff for 1GB+ of
RAM), the code should just expand to

        movl $1024,%ecx
        xorl %eax,%eax
        rep ; stosl

for the page clearing code.

                Linus

------------------------------

From: Anders Larsen <[EMAIL PROTECTED]>
Subject: [OT] SPAM blocking; Was: Re: Kernel supporting more than 2GB of Ram
Date: Fri, 15 Dec 2000 17:21:29 +0100

Gene Heskett wrote:
> 
> Unrot13 this;
> Reply to: <[EMAIL PROTECTED]>

Why? (see below)

[snip]
>   Gene Heskett, CET, UHK       |Amiga A2k Zeus040, Linux @ 400mhz
>         email gene underscore heskett at iolinc dot net

Gene,

you *are* aware that your e-mail address is listed in the clear in the
From: header, aren't you?

> Date: 15 Dec 2000 8:41:8 -0500
> From: "Gene Heskett" <[EMAIL PROTECTED]>
> Subject: Re: Kernel supporting more than 2GB of Ram

cheers
Anders

a dot larsen at identecsolutions dot de

------------------------------

From: Nix <$}xinix{[email protected]>
Subject: Re: How to make a BIOS call in Linux
Date: 17 Dec 2000 20:23:57 +0000

On Thu, 14 Dec 2000, Kasper Dupont said:
> Otherwise you would have to put the driver
> in the boot record, and writing a portable
> harddisk driver in just 512 bytes would
> probably not be posible.

There is at least one boot loader out there that does just that. IDE
disks only, and I can't remember the name of the boot loader in question
for the life of me.

-- 
Not speaking for Boskone at the moment.

------------------------------

From: Bryan Hackney <[EMAIL PROTECTED]>
Subject: Re: getty-less system?
Date: Sun, 17 Dec 2000 15:40:48 -0600

Marty Ross wrote:
> 
> Using RedHat 6.2:
> 
> I'm trying to create a standalone system so I'm trying to understand how
> "init" works exactly.
> 
> If I say, for example, that my "standalone" program runs at "runlevel 4",
> and I insert a line
> into "inittab" that says:
> 
> myprog:4:respawn:/mydir/myscript
> 
> where "myscript" sets my environment, loads my daemon processes, and runs my
> application, the first daemon process I load gets terminated with "SIGTERM".
> Why?  By whom?  (presumably, by "init").  But I don't understand.  There IS
> a previous line in the "inittab" that reads:
> 
> l4:wait:/etc/rc.d/rc 4
> 
> But I assume this happens BEFORE "myscript" is executed, so I don't see the
> problem.
> 
> I have read many "startup" tutorials, and the "man" page and other tutorials
> on "init" and "inittab", and still I can't figure out....
> 
> What is a good way to run a "standalone" application (and yes, it would be
> nice if it had "normal" control of the console tty)?  I don't want to run
> "getty" or "login", yet I want some of the "controlling terminal" things
> that are done here.  Do I need to write a program to open the terminal
> device and set it up myself?  Boy, that sounds archaic!
> 
> Any help appreciated!

Don't use init at all. Do something like

init=/bin/sh

if using lilo.



-- 
                                 Bryan Hackney / BHC / [EMAIL PROTECTED]

------------------------------

From: [EMAIL PROTECTED] (Linus Torvalds)
Subject: Re: A faster memcpy and bzero for x86
Date: 17 Dec 2000 13:52:54 -0800

In article <[EMAIL PROTECTED]>, Robert Redelmeier  <[EMAIL PROTECTED]> wrote:
>Linus Torvalds wrote:
>> 
>> It's not just about polluting the caches.
>> 
>> It's also about polluting a shared bus, where the other CPU's that
>> aren't idle may be doing real work that _needs_ that bus.
>
>Well, there's something stinky happening anyways :)

Ok, I've analyzed this, and there is nothing stinky happening.

Or rather, the "stinky" is cache effects, not software.

>The asm program below tests various bzero routines and
>outputs four times in CPU clocks :  a userland 4kB bzero 
>with `rep stosl`, the [g]libc 4kB bzero, the time for a 
>second 32bit write and the time for the first 32bit write
>to an otherwise unused RAM page.

It turns out that what you are testing is not at all what you _think_
you're testing. Which is why you don't understand the results, and why
they look so different on Linux and FreeBSD.

>Typical times are: [Abit BP6, 2 * 5.5 * 94 MHz Celerons]
>
>Linux 2.4.0t8:   847    915  33  12000 (~15% at 25000)
>FreeBSD   4.1:  1030  11000  33   5000 (all variable 15%)
>        notes:   (1)    (2)  (3)   (4)

One thing to clarify (for others) is that the numbers are shown in the
reverse order that they are actually calculated. Ie (4) is done first,
and (1) is done last. That's confusing, and made the real reason for the
numbers less apparent.

>1)  This time could be improved to 811 with MMX moves, a
>4% improvement likely to cost much more for MMX context saves.
>AMD K7 may perform differently.  The worst 32bit code I could 
>write in good conscience took 2100, the best 1600. Why FBSD 
>has variable times for a single instruction is a mystery.

NO!

(1) is the best possible time for doing a memset() WITH HOT CACHES!

Which is not necessarily the same as "best memset()" in general at all. 
The fastest way to clear memory can be very different for the hot-cache
and the cold-cache case. 

>2)  The glibc bzero routine is d@mn good!  It ought not 
>to be so deprecated.  The FreeBSD libc bzero routine s#x.

NO! Again.

(2) is, under Linux, again the same thing: it's a memset() with HOT
CACHES.

Under FreeBSD, it's a memset with mostly COLD CACHES. Those 11000 cycles
are what you need to clear memory that is not in the cache. Remember
this for later.

>3) 33 CPU cycles is exactly as expected:  32 clocks for rdtsc
>[measurement] overhead, and 1 clock for a single write to L1 
>cache which was brought in by the first [long] write.

This, actually, is not even a write to the cache, but to the write
buffers. There was no "first long write" at all. The cache never comes
into play here, really.

>4) The first write to a page is always long because of the
>copy-on-write VM philosophy.  The OS has to scrounge a freeable
>page.  But Linux takes considerably longer.  Why? Fragmentation?
>It should bzero rather than memcpy the zeropage. Most (>90%) CoW
>pagefaults are undoubtedly from the zeropage, and a memcpy
>would do nothing but eat cycles and pollute the caches!

No.  What happens is that under Linux, most of those 12000 cycles are
clearing memory that is not in the cache.  In particular, see above on
the FreeBSD memset() numbers.  Of the Linux 12000 cycles, 90%+ is for
the memset. The actual page fault handling etc takes on the order of
1000 cycles - a few hundred of this is the actual hardware cost on x86
to take a page fault.

On FreeBSD, those 5000 cycles are just mapping in an already zeroed
page. It looks to me like FreeBSD basically keeps most of the free
page list zeroed at all times, which is why you can have even 1000 pages
take a constant (fairly low) time for the mapping cost.

Ok, now for the REAL analysis of the numbers:

 - the FreeBSD memset() routine is probably fine. Call memset() twice,
   and I bet the second number will be ok.

 - FreeBSD does pre-zeroing, which makes it look good on certain things.
   In particular, this makes it look good at allocation-time.

   But what the numbers also show is that THIS IS BAD IN REAL LIFE!

   Pre-zeroing makes the kernel profile look good, which is probably
   what BSD's do it. What it doesn't show is that it generates all the
   wrong behaviour for the cache: it means that the cache has long since
   cooled down by the time the page is actually _used_. Look at the sum
   of all numbers (which is basically doing the same thing), and FreeBSD
   clearly loses to Linux. 

This is, in fact, the exact same performance problem that SLAB caches
have.  The thinking behind SLAB caches is that the initialization of a
data structure after allocation is expensive, so if you can avoid that
initialization, you can avoid that expense.  So slab caches remember
what the old contents were, and only initialize the fields that need it.

This makes allocation costs much better, and makes them show up less
obviously on performance profiles.

What the people who did this and wrote all the papers on it forgot about
is that the initialization also brings the data structure into the
cache, and that subsequent _uses_ of the data structure thus get sped up
by initializing it.  So most of the time, the cost of the cache misses
is just spread out, not actually elided. 

This can be clearly seen in the memset() numbers above: the page miss is
improved, but because the page was cleared much earlier the cost of
bringing it into the cache is now transferred to user space instead of
kernel space.  Which loks good if you want to optimize the kernel, but
looks bad if you want to get best system _performance_. 

You can obviously find cases where the pre-initialization actually does
help: if you don't touch the pages after allocating them,
pre-initialization is a clear win. It's easy to create these kinds of
benchmarks, I just don't think they are very realistic.

Anyway, I would suggest you improve your benchmark to more clearly state
what it actually tests. I also would suggest that at least your
benchmark actually shows that the FreeBSD approach sucks quite badly,
and Linux comes out the clear winner here. You may want to create a
benchmark that shows the reverse ;)

                Linus

------------------------------

From: Nix <$}xinix{[email protected]>
Crossposted-To: comp.lang.c++,gnu.misc.discuss
Subject: Re: Compiling C++ programs with GCC --> no GPL license implications
Date: 17 Dec 2000 21:24:47 +0000

On 15 Dec 2000, Peter Seebach stipulated:
> In article <[EMAIL PROTECTED]>, Mike Stump <[EMAIL PROTECTED]>
> wrote:
>>If what you say is true, please provide evidence of a single case that
>>has been litigated and won, where someone else besides the FSF sued.
>>I bet you cannot.
> 
> Has the FSF ever actually won a case, as opposed to getting a
> settlement, on this issue?

It has never needed to go further than nasty letters; the guilty
parties have always backed down to date.

(IIRC.)

-- 
Not speaking for Boskone at the moment.

------------------------------

From: Nix <$}xinix{[email protected]>
Subject: Re: imaginary complex value
Date: 17 Dec 2000 21:41:48 +0000

On Tue, 12 Dec 2000, Unk spake:
> While compiling glibc-2.2 I get an error, since I have little
> programming experience (other than a few programming projects at
> school) I cannot fix this myself.  I suspect I might have something to
> do with gcc, in the past I have come upon problems with "imaginary"
> numbers while bootstrap compiling both GCC and EGCS.  I'm totaly
> confused and would like to know if there is a fix to this.  Below is
> the error output given to me while attempting to compile the math part
> of glibc:

What version of GCC are you using?

What is your architecture? (The output of `config.guess' from the glibc
or GCC source trees).

-- 
Not speaking for Boskone at the moment.

------------------------------

Date: 17 Dec 2000 16:49:47 -0500
From: "Gene Heskett" <[EMAIL PROTECTED]>
Subject: Re: [OT] SPAM blocking; Was: Re: Kernel supporting more than 2GB of Ram

Unrot13 this;
Reply to: <[EMAIL PROTECTED]>

Gene Heskett sends Greetings to Anders Larsen;

 AL> Gene Heskett wrote:
>> 
>> Unrot13 this;
>> Reply to: <[EMAIL PROTECTED]>

 AL> Why? (see below)

 AL> [snip]
>>   Gene Heskett, CET, UHK       |Amiga A2k Zeus040, Linux @ 400mhz
>>         email gene underscore heskett at iolinc dot net

 AL> Gene,

 AL> you *are* aware that your e-mail address is listed in the clear
 AL> in the From: header, aren't you?

Yup, darnit.  I did have it antispammed, but then my isp did something
to my mailbox, and I can't post unless its me in the header.  Sucks is
what it does, so I've resorted to a spam prefilter in my getmail
scripts.  Which now has a 20kb config file of addresses and keywords
that cause a message containing them to be deleted from my mailbox
before the regular mail sucker grabs the rest.

Cheers, Gene
-- 
  Gene Heskett, CET, UHK       |Amiga A2k Zeus040, Linux @ 400mhz 
        email gene underscore heskett at iolinc dot net
#Amiga based X10 home automation program EZHome, see at:#
# <http://www.thirdwave.net/~jimlucia/amigahomeauto> #
ISP's please take note: My spam control policy is explicit!
#Any Class C address# involved in spamming me is added to my killfile
never to be seen again.  Message will be summarily deleted without dl.
This messages reply content, but not any previously quoted material, is
� 2000 by Gene Heskett, all rights reserved.
-- 


------------------------------

From: Andi Kleen <[EMAIL PROTECTED]>
Subject: Re: A faster memcpy and bzero for x86
Date: 18 Dec 2000 01:07:49 +0100

[EMAIL PROTECTED] (Linus Torvalds) writes:

> 
> This makes allocation costs much better, and makes them show up less
> obviously on performance profiles.
> 
> What the people who did this and wrote all the papers on it forgot about
> is that the initialization also brings the data structure into the
> cache, and that subsequent _uses_ of the data structure thus get sped up
> by initializing it.  So most of the time, the cost of the cache misses
> is just spread out, not actually elided.

This assumes that the complete structure is used. Sometimes there is the
case that only a part of a datastructue is used in the actual fast path,
but the whole data structure must be initialized to handle the slow 
path correctly.

For example one critical path I optimized for when doing the slab skb
work was the fast routing code. It actually only touches a single cache
line in the skb header during interrupt forwarding, but alloc_skb() was
initialising the full header, forcing it in cache. With the partial
initialization and a kfree_skb_fast() fast path the normal skb allocation
path got slightly faster than a very hackish skb reuse scheme for 
fast routing, and it is fully generic.

This is probably partly attributed to the better cache usage due to slab cache
colouring too.

[Admittedly with 64byte/128byte cachelines like on the P4 or the K7 this
approach makes less and less and sense here, but most boxes still use 32byte
cache lines]

Also slabs were invented fo Solaris, which probably has much more overly
bloated, only partly used in fast path datastructures than linux.

Another path where this lazy initialization with slab is very useful is in the 
file copying in fork. For some time in 2.1 it used a clever scheme using
the fds_bits to only touch the file pointers that were actually non zero,
and using a slab cache to only clear the pointers that were non
zero in the previous user. Most processes have only a few fds open or 
when there is some often forking server then it tends to have always the
same number of fds open, so with this in most cases only a small part
of the files_struct pointer page needs to be touched, saving a lot 
of cache traffic.
This unfortunately got broken with the big fd patches :-/, 


-Andi

------------------------------

From: Robert Krawitz <[EMAIL PROTECTED]>
Subject: Re: A faster memcpy and bzero for x86
Date: 17 Dec 2000 19:53:49 -0500

[EMAIL PROTECTED] (Linus Torvalds) writes:

> (1) is the best possible time for doing a memset() WITH HOT CACHES!
> 
> Which is not necessarily the same as "best memset()" in general at all. 
> The fastest way to clear memory can be very different for the hot-cache
> and the cold-cache case. 

That's true, although in the cold cache case there are some variants
depending upon whether the processor allocates on write.

>  - FreeBSD does pre-zeroing, which makes it look good on certain things.
>    In particular, this makes it look good at allocation-time.
> 
>    But what the numbers also show is that THIS IS BAD IN REAL LIFE!
> 
>    Pre-zeroing makes the kernel profile look good, which is probably
>    what BSD's do it. What it doesn't show is that it generates all the
>    wrong behaviour for the cache: it means that the cache has long since
>    cooled down by the time the page is actually _used_. Look at the sum
>    of all numbers (which is basically doing the same thing), and FreeBSD
>    clearly loses to Linux. 

Pre-zeroing certainly reduces the latency of page allocation, though;
if there's a sudden demand for pages, it can be satisfied in a hurry;
the intent (as I see it) is that less valuable cycles (when there's
nothing otherwise runnable) are spent clearing pages such that when
they're needed they're available.

The cache question is more complicated, though; it depends heavily on
the processor architecture, the cache size, and the future behavior of
the system (including whatever wants the pages).  Some processors can
write to memory without bringing data into the cache (they don't
allocate on write); some processors can also write without allocation
with special instructions.  The old Pentium never allocates a cache
line on write, so clearing cold memory never involves the cache.  On
that architecture, clearing the page immediately before use is of no
benefit to the user of that page.  Pre-clearing it also won't evict
some other possibly hot cache line.  On my old P90, the FPU memcpy()
gave overall system performance improvement, as measured by the Byte
benchmark.

If writing to cold memory always causes allocation of a cache line,
the issue's somewhat different; clearing the page will make it hot
(good for whatever's going to use it next), AND it will also evict
some other cache line, which might be hot itself.  So pre-clearing a
page will hurt something else.

> You can obviously find cases where the pre-initialization actually does
> help: if you don't touch the pages after allocating them,
> pre-initialization is a clear win. It's easy to create these kinds of
> benchmarks, I just don't think they are very realistic.

What about if the application is only going to touch a small number of
lines in that page?
-- 
Robert Krawitz <[EMAIL PROTECTED]>      http://www.tiac.net/users/rlk/

Tall Clubs International  --  http://www.tall.org/ or 1-888-IM-TALL-2
Member of the League for Programming Freedom -- mail [EMAIL PROTECTED]
Project lead for The Gimp Print --  http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton

------------------------------


** FOR YOUR REFERENCE **

The service address, to which questions about the list itself and requests
to be added to or deleted from it should be directed, is:

    Internet: [EMAIL PROTECTED]

You can send mail to the entire list by posting to the
comp.os.linux.development.system newsgroup.

Linux may be obtained via one of these FTP sites:
    ftp.funet.fi                                pub/Linux
    tsx-11.mit.edu                              pub/linux
    sunsite.unc.edu                             pub/Linux

End of Linux-Development-System Digest
******************************

Reply via email to