Re: Oops in 2.4.0-ac5

2001-01-10 Thread Ingo Molnar


On Wed, 10 Jan 2001, Alan Cox wrote:

  it.  I could never persuade Ingo to use wrmsr_eio() and check the
  return code, maybe this will change his mind.  Extract from kdb v1.7.

 I have a patch from Ingo to fix this one properly. Its just getting tested

i prefer clear oopses and bug reports instead of ignoring them. A failed
MSR write is not something to be taken easily. MSR writes if fail mean
that there is a serious kernel bug - we want to stop the kernel and
complain ASAP. And correct code will be much more readable that way.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] highmem-2.4.0-A0 (was: Re: 2.4.0-ac6: mm/vmalloc.c compileerror)

2001-01-11 Thread Ingo Molnar


On Thu, 11 Jan 2001, Frank Davis wrote:

 Hello,
   The following error occurred while compiling 2.4.0-ac6. [...]

 vmalloc.c: In function `get_vm_area':
 vmalloc.c:188: `PKMAP_BASE' undeclared (first use in this function)

you are compiling with HIGHMEM enabled (which makes sense only if you have
more than ~900MB RAM), and i accidentally broke HIGHMEM with the vmalloc
fix in -pre1/-ac5. The attached patch fixes it.

Ingo


--- linux/include/linux/vmalloc.h.orig  Thu Jan 11 11:28:06 2001
+++ linux/include/linux/vmalloc.h   Thu Jan 11 11:28:33 2001
@@ -4,6 +4,7 @@
 #include linux/sched.h
 #include linux/mm.h
 #include linux/spinlock.h
+#include linux/highmem.h
 
 #include asm/pgtable.h
 



Re: Oops in 2.4.0-ac5

2001-01-11 Thread Ingo Molnar


On Thu, 11 Jan 2001, David Woodhouse wrote:

 The bug here seems to be that we're using the same bit
 (X86_FEATURE_APIC) to report two _different_ features.

i think that the AMD APIC is truly 'compatible', but we are trying to
enable the APIC and program performance counters in an Intel-way. The MSRs
can be incompatible between steppings of the same CPU, so we should not
mark something 'incompatible' on that basis.

so the correct statement is: the UP-P6-specific way of enabling APICs does
not work on Athlons. It doesnt work on P5's either.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Updated zerocopy patch up on kernel.org

2001-01-11 Thread Ingo Molnar


On Tue, 9 Jan 2001, David S. Miller wrote:

 Nothing interesting or new, just merges up with the latest 2.4.1-pre1
 patch from Linus.

 ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p1-1.diff.gz

 I haven't had any reports from anyone, which must mean that it is
 working perfectly fine and adds no new bugs, testers are thus in
 nirvana and thus have nothing to report.  :-)

(works like a charm here.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] Lowlatency Patch for 2.4.0-ac6 and 2.4.1-pre2

2001-01-11 Thread Ingo Molnar


a new version against recent 2.4 kernels of my multimedia-lowlatency
patchset is now available. These patches are the 2.4-adapted versions of
my 2.2 lowlatency patch, which project has now reached an age of 1.5+
years.

the lowlatency patch against 2.4.0-ac6 can also be found at:

   
http://www.kernel.org/pub/linux/kernel/people/mingo/lowlatency-patches/lowlatency-2.4.0-ac6-A2

the lowlatency patch against 2.4.1-pre2 can be found at:

   
http://www.kernel.org/pub/linux/kernel/people/mingo/lowlatency-patches/lowlatency-2.4.1-A2

this patch still follows the 'take no prisoners' approach, is optimized on
x86 but should work on other platforms as well. The patch uses assembly
speedups and offline assembly sections to minimize the impact of
conditional schedule points as much as possible. This is the reason why
this patch does not offer a configuration option. The patch changes
lowlevel x86 assembly routines too, to make them perform with lower
latency.

on a 500 MHz 1-CPU box typical latencies during 'everyday work', with this
patch applied are 0.1 msec or less, under high load i've measured a
maximum latency was 0.3 millisec. The patch fixes latencies generated by
intense X sessions, high block IO and networking load and lots of
user-space processes load as well, and other more unusual latency sources.
I tested every latency source i could think of, the patch tries to be a
'complete solution' and tries to squash all latency sources larger than
0.5 msecs on a typical system.

bugreports, comments, suggestions and contributions welcome!

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar


On Tue, 9 Jan 2001, Stephen Frost wrote:

   Now, the interesting bit here is that the processes can grow to be
 pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what
 happens with MOSIX is that entire processes get sent over the wire to
 other machines for work.  MOSIX will also attempt to rebalance the load on
 all of the machines in the cluster and whatnot so it can often be moving
 processes back and forth.

then you'll love the zerocopy patch :-) Just use sendfile() or specify
MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card
DMA-and-checksumming on cards that support it.

the discussion with Stephen is about various device-to-device schemes.
(which Mosix i dont think wants to use. Mosix wants to use memory to
device zero-copy, right?)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



lies, lies and web benchmarks :-)

2001-01-12 Thread Ingo Molnar


On Thu, 11 Jan 2001, Christoph Lameter wrote:

 I got into a bragging game whose webserver is the fastest with Jim
 Nelson one of the authors of the boa webserver. We finally settled on
 the Zeus test to decide the battle.

may i add my (hopefully comparable) TUX 2.0 numbers to this bragging game
:-)

TUX had logging turned on, a 1-CPU 500 PIII MHz system (with enough RAM)
was used for the test. UP kernel, nohighmem. The used code is 2.4.0-ac4 +
DaveM's zerocopy patch + Jen's blk patch + TUX 2.0 patch.

(to make these results somewhat comparable, what system did you use?)

 First boa won hands down because it supports persistant connections. Boa
 on port 6000. Khttpd on port 80:

 clameter@melchi:~$ ./zb localhost /index.html -k -c 215 -n 2 -p 6000

 Server: Boa/0.94.8.3
 Doucment Length:1666
 Requests per seconds:   590.58

 Server: kHTTPd/0.1.6
 Doucment Length:1666
 Requests per seconds:   196.59

well, TUX supports persistent (keepalive, ie. lightweight) HTTP
connections as well:

over localhost (like the above test) it gives:

 m:~ ./zb localhost /index.html -k -c 215 -n 2

 Server: TUX
 Document Length:1666
 Complete requests:  2
 Failed requests:0
 Requests per seconds:   12658.23

 Connnection Times (ms)
min   avg   max
 Connect: 0 011
 Total:   51631

Over 100mbit Ethernet (using eepro100) TUX does:

 h:~ ./zb m /index.html -k -c 215 -n 2

 Server: TUX
 Document Length:1666
 Complete requests:  2
 Failed requests:0
 Requests per seconds:   6033.18
 Transfer rate:  11002.97 kb/s

 Connnection Times (ms)
min   avg   max
 Connect: 0 012
 Total:   332  3250

As visible from the 'Transfer rate', the 100 mbit link is fully saturated.

The eepro100 was not running in zerocopy mode, so all data was copied
once. As a comparison, over 1000 mbit Ethernet with a native zero-copy
driver (SysKonnect), TUX does:

 Server: TUX
 Document Length:1666
 Complete requests:  2
 Failed requests:0
 Keep-Alive requests:20094
 Requests per seconds:   12812.30

 Connnection Times (ms)
min   avg   max
 Connect: 0 012
 Total:  101629

(but the server is at 70% CPU utilization in the gigabit test - the
dual-PIII/500 client is 100% CPU utilized and thus not fast enough to
saturate the server. With two clients it does about 15000 reqs/sec.)

 Then we decided to switch persistant connection off... But boa still wins.

 clameter@melchi:~$ ./zb localhost /index.html -c 215 -n 2 -p 6000

 Server: Boa/0.94.8.3
 Doucment Length:1666
 Requests per seconds:   227.17

with normal, non-keepalive HTTP requests, TUX 2.0 over localhost does:

 m:~ ./zb localhost /index.html -c 215 -n 2

 Server: TUX
 Document Length:1666
 Complete requests:  2
 Failed requests:0
 Requests per seconds:   5154.64

 Connnection Times (ms)
min   avg   max
 Connect:111923
 Total:  344045

over 100mbit ethernet (eepro100) it does:

 h:~ ./zb m /index.html -c 215 -n 2

 Server: TUX
 Document Length:1666
 Complete requests:  2
 Failed requests:0
 Requests per seconds:   4435.57

 Connnection Times (ms)
min   avg   max
 Connect: 115  3020
 Total:  1847  3068

over gigabit SysKonnect zero-copy it does:

 h:~ ./zb mg /index.html -c 215 -n 2

 Server: TUX
 Document Length:1666
 Requests per seconds:   5327.65

 Connnection Times (ms)
min   avg   max
 Connect: 01116
 Total:  213981

(but the nonpersistent test puts even more load on the client, the server
is only about 60% utilized - with two clients it does about 8000
reqs/sec.)

at this point i couldnt resist - i assembled a few TUX 2.0 CGI execution
benchmarks. The CGI used for this test is a real, standard Linux ELF CGI
executable which is exec()-ed for every HTTP request: it read()s the same
/index.html file the other tests were using, write()s a HTML header to
stdout, then write()s the /index.html file to stdout and finally write()s
a HTML trailer to stdout and exit()s. [A separate process is created for
every single HTTP request]. Over localhost, TUX 2.0 CGI does:

 m:~ ./zb localhost x?/index.html -c 215 -n 2

 ---
 Server: TUX
 Document Length:1876
 Complete requests:  2
 Failed requests:0
 Requests per seconds:   1227.14

 Connnection Times (ms)
   min   avg   max
 Connect: 11232
 Total:  41   172   346
 ---

over 100 mbit Ethernet (eepro100) TUX 2.0 CGI does:

 Requests per seconds:   1336.18

(the 1000 mbit number is the same as the 100 mbit one, because the server
is saturated executing CGIs already, network 

Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware

2001-01-12 Thread Ingo Molnar


On Fri, 12 Jan 2001, Manfred Spraul wrote:

 The PPro local apic documentation says:
 
 The processor's local APIC includes an in-service entry and a holding
 entry for each priority level. To avoid losing interrupts, software
 should allocate no more than 2 interrupt vectors per priority.
 

 Ok, we must reorder the vector numbers for our own interrupts
 (0xfb-0xff), but that doesn't explain our problems: we don't loose
 reschedule interrupts, we have problems with normal interrupts - and
 there we only use 2 irq at the same priority level.

we *already* reorder vector numbers and spread them out as much as
possible. We do this in 2.2 as well. We did this almost from day 1 of
IO-APIC support. If any manually allocated IRQ vector creates a '3 vectors
in the same 16-vector region' situation then thats a bug in hw_irq.h..

the 'loss of interrupts' above does not include external interrupts, only
local interrupts (such as the APIC timer interrupt) can get lost in such a
situation.

(nevertheless there is something going on.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware

2001-01-12 Thread Ingo Molnar


On Fri, 12 Jan 2001, Manfred Spraul wrote:

 2.4 spreads the vectors for the external (hardware, from io apic)
 interrupts, but 5 ipi vectors have the same priority: reschedule, call
 function, tlb invalidate, apic error, spurious interrupt.

my reading of the errata is that the lost APIC timer IRQ happens only if
the APIC timer IRQ vector's priority level has more than 2 active vectors.
It's a very limited case, which does not happen in recent CPUs anyway
(such as the PIII).

 But that doesn't explain what happens with ne2k cards: neither 2.2 nor
 2.4 have more than 2 interrupts in class for the hardware interrupt
 16/19.

yep.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?

2001-01-12 Thread Ingo Molnar


On Fri, 12 Jan 2001, Linus Torvalds wrote:

 [...] Ingo, what was the focus-cpu thing?

well, some time ago i had an ne2k card in an SMP system as well, and found
this very problem. Disabling/enabling focus-cpu appeared to make a
difference, but later on i made experiments that show that in both cases
the hang happens. I spent a good deal of time trying to fix this problem,
but failed - so any fresh ideas are more than welcome.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?

2001-01-12 Thread Ingo Molnar


On Fri, 12 Jan 2001, Frank de Lange wrote:

 In addition, I patched apic.c (focus cpu enabled)
 In addition, I patched io_apic ((TARGET_CPUS 0xff)

please try it with the focus CPU enabling change (we want to enable that
feature, i only disabled it due to the stuck-ne2k bug), but with
TARGET_CPUS set to cpu_online_mask. (this later is needed for certain
crappy BIOSes.)

i believe the ne2k driver change is the key.

  I have a first idea: we send an EOI to an interrupt that is masked on
  the IO apic, perhaps that causes the problems.

 Sound plausible...

does not help. I've tried it (and many other combinations). I did not find
any direct workaround for this problem. (i tried very hard.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?

2001-01-12 Thread Ingo Molnar


On Fri, 12 Jan 2001, Frank de Lange wrote:

 WITH or WITHOUT the changed 8390 driver? I can already give you the
 results for running WITH the changed driver: it works. I have not yet
 tried it WITHOUT the changed 8390 driver (so that would be stock 8390,
 patched apic.c, stock io_apic.c). Please let me know which you want...

WITH. patched 8390.c, patched apic.c, sock io_apic.c. My very strong
feeling is that this will be a stable combination, and that this is what
we want as a final solution.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?

2001-01-12 Thread Ingo Molnar


On Fri, 12 Jan 2001, Frank de Lange wrote:

 BTW, does this (TARGET_CPUS cpu_online_mask) not wreak havoc with
 systems with hot-pluggable CPUs (many Suns, etc...)? Wouldn;t it be
 better to make this a config option (like the optional PCI fixes for
 crappy BIOSs)?

? this is x86-only code. There is no hot-pluggable CPU support for Linux
AFAIK. (But in any case, the code is basically ready for hot-pluggable
CPUs, just take a few precautions and change cpu_online_mask and a couple
of other things.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?

2001-01-12 Thread Ingo Molnar


On Fri, 12 Jan 2001, Frank de Lange wrote:

 It is. As I already mentioned in other messages, I already tested with
 JUST the patched 8390.c driver, no other patches. It was stable. I
 then patched apic.c AND io_apic.c, which did not introduce new
 instabilities. Unless you think that reverting back to a stock
 io_apic.c would cause instabilities (which would be weird, since I had
 no instabilities running only a patched 8390.c), I think the patch to
 8390.c DOES remove the symptoms all by itself. No other patches seem
 necessary to get a stable box.

okay - i just wanted to hear a definitive word from you that this fixes
your problem, because this is what we'll have to do as a final solution.
(barring any other solution.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?

2001-01-12 Thread Ingo Molnar


On Fri, 12 Jan 2001, Frank de Lange wrote:

 PATCHED 8390.c (using irq_safe spinlocks instead of disable_irq)
 PATCHED apic.c (focus cpu ENABLED)
 STOCK io_apic.c

 No problems under heavy network load.

 Gentleman, this (the patch to 8390.c) seems to fix the problem.

great. Back when i had the same problem, flood pinging another host (on
the local network) was the quickest way to reproduce the hang:

ping -f -s 10 otherhost

this produced an IOAPIC-hang within seconds.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Updated zerocopy patch up on kernel.org

2001-01-11 Thread Ingo Molnar


On Tue, 9 Jan 2001, David S. Miller wrote:

Is there any value to supporting fragments in a driver which
doesn't do hardware checksumming?  IIRC Alexey had a patch to do
such for Tulip, but I don't see it in the above patchset.

 I'm actually considering making the SG w/o hwcsum situation illegal.

i believe it might still make some limited sense for normal sendmsg()
and higher MTUs (or 8k NFS) - we could copy  checksum stuff into the
-tcp_page if SG is possible and thus the SG capability improves the VM.
(because we can allocate at PAGE_SIZE granularity.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Ingo's RAID patch for 2.2.18 final?

2001-01-14 Thread Ingo Molnar


On 13 Jan 2001 [EMAIL PROTECTED] wrote:

 What is at http://www.kernel.org/pub/linux/kernel/people/mingo/
 look official enough to me...

 raid-2.2.18-B0  12-Jan-2001 10:18   392k

yep, it is the 'official' 2.2.18 RAID patch.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is sendfile all that sexy?

2001-01-14 Thread Ingo Molnar


On Sun, 14 Jan 2001, jamal wrote:

 regular ttcp, no ZC and no sendfile. [...]
 Throughput: ~99MB/sec (for those obsessed with Mbps ~810Mbps)
 CPU abuse: server side 87% client side 22% [...]

 sendfile server.
 - throughput: 86MB/sec
 - CPU: server 100%, client 17%

i believe what you are seeing here is the overhead of the pagecache. When
using sendmsg() only, you do not read() the file every time, right? Is
ttcp using multiple threads? In that case if the sendfile() is using the
*same* file all the time, creating SMP locking overhead.

if this is the case, what result do you get if you use a separate,
isolated file per process? (And i bet that with DaveM's pagecache
scalability patch the situation would also get much better - the global
pagecache_lock hurts.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is sendfile all that sexy?

2001-01-14 Thread Ingo Molnar


On 14 Jan 2001, Linus Torvalds wrote:

 Does anybody but apache actually use it?

There is a Samba patch as well that makes it sendfile() based. Various
other projects use it too (phttpd for example), some FTP servers i
believe, and khttpd and TUX.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is sendfile all that sexy?

2001-01-14 Thread Ingo Molnar


On Sun, 14 Jan 2001, Linus Torvalds wrote:

  There is a Samba patch as well that makes it sendfile() based. Various
  other projects use it too (phttpd for example), some FTP servers i
  believe, and khttpd and TUX.

 At least khttpd uses "do_generic_file_read()", not sendfile per se. I
 assume TUX does too. Sendfile itself is mainly only useful from user
 space..

yes, you are right. TUX does it mainly to avoid some of the user-space
interfacing overhead present in sys_sendfile(), and to be able to control
packet boundaries. (ie. to have or not have the MSG_MORE flag). So TUX is
using its own sock_send_actor and own read_descriptor.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] fixes for RAID1/RAID5 hot-add/hot-remove, 2.4.0-ac9

2001-01-15 Thread Ingo Molnar


- the attached patch (against -ac9) fixes a bug in the RAID1 and RAID4/5
  code that made raidhotremove fail under certain (rare) circumstances.
  Please apply.

Ingo


--- linux/drivers/md/raid1.c.orig   Mon Dec 11 22:19:35 2000
+++ linux/drivers/md/raid1.cMon Jan 15 14:45:35 2001
@@ -832,6 +832,7 @@
struct mirror_info *tmp, *sdisk, *fdisk, *rdisk, *adisk;
mdp_super_t *sb = mddev-sb;
mdp_disk_t *failed_desc, *spare_desc, *added_desc;
+   mdk_rdev_t *spare_rdev, *failed_rdev;
 
print_raid1_conf(conf);
md_spin_lock_irq(conf-device_lock);
@@ -989,6 +990,10 @@
/*
 * do the switch finally
 */
+   spare_rdev = find_rdev_nr(mddev, spare_desc-number);
+   failed_rdev = find_rdev_nr(mddev, failed_desc-number);
+   xchg_values(spare_rdev-desc_nr, failed_rdev-desc_nr);
+
xchg_values(*spare_desc, *failed_desc);
xchg_values(*fdisk, *sdisk);
 
--- linux/drivers/md/raid5.c.orig   Mon Jan 15 14:45:50 2001
+++ linux/drivers/md/raid5.cMon Jan 15 14:46:01 2001
@@ -1707,6 +1707,7 @@
struct disk_info *tmp, *sdisk, *fdisk, *rdisk, *adisk;
mdp_super_t *sb = mddev-sb;
mdp_disk_t *failed_desc, *spare_desc, *added_desc;
+   mdk_rdev_t *spare_rdev, *failed_rdev;
 
print_raid5_conf(conf);
md_spin_lock_irq(conf-device_lock);
@@ -1878,6 +1879,10 @@
/*
 * do the switch finally
 */
+   spare_rdev = find_rdev_nr(mddev, spare_desc-number);
+   failed_rdev = find_rdev_nr(mddev, failed_desc-number);
+   xchg_values(spare_rdev-desc_nr, failed_rdev-desc_nr);
+
xchg_values(*spare_desc, *failed_desc);
xchg_values(*fdisk, *sdisk);
 



Re: Is sendfile all that sexy?

2001-01-15 Thread Ingo Molnar


On Mon, 15 Jan 2001, Jonathan Thackray wrote:

 It's a very useful system call and makes file serving much more
 scalable, and I'm glad that most Un*xes now have support for it
 (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to
 Linux is sendpath(), which does the open() before the sendfile() all
 combined into one system call.

i believe the right model for a user-space webserver is to cache open file
descriptors, and directly hash URLs to open files. This way you can do
pure sendfile() without any open(). Not that open() is too expensive in
Linux:

 m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall open
 Simple open/close: 7.5756 microseconds

 m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall stat
 Simple stat: 5.4864 microseconds

 m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall write
 Simple write: 0.9614 microseconds

 m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall read
 Simple read: 1.1420 microseconds

 m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall null
 Simple syscall: 0.6349 microseconds

(note that lmbench opens a nontrivial path, it can be cheaper than this.)

nevertheless saving the lookup can be win.

[ TUX uses dentries directly so there is no file opening cost - it's
pretty equivalent to sendpath(), with the difference that TUX can do
security evaluation on the (held) file prior sending it - while sendpath()
is pretty much a shot into the dark. ]

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] sendpath() support, 2.4.0-test3/-ac9

2001-01-15 Thread Ingo Molnar


On 15 Jan 2001, Linus Torvalds wrote:

   int fd = open(..)
   fstat(fd..);
   sendfile(fd..);
   close(fd);

 is any slower than

   .. cache stat() in user space based on name ..
   sendpath(name, ..);

 on any real load.

just for kicks i've implemented sendpath() support. (patch against
2.4.0-test and sample code attached) It appears to work just fine here.
With a bit of reorganization in mm/filemap.c it was quite straightforward
to do.

Jonathan, is this what Zeus needs? If yes, it could be interesting to run
a simple benchmark to compare sendpath() to open()+sendfile()?

Ingo


--- linux/mm/filemap.c.orig Mon Jan 15 22:43:21 2001
+++ linux/mm/filemap.c  Mon Jan 15 23:09:55 2001
@@ -39,6 +39,8 @@
  * page-cache, 21.05.1999, Ingo Molnar [EMAIL PROTECTED]
  *
  * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli [EMAIL PROTECTED]
+ *
+ * Started sendpath() support, (C) 2000 Ingo Molnar [EMAIL PROTECTED]
  */
 
 atomic_t page_cache_size = ATOMIC_INIT(0);
@@ -1450,15 +1452,15 @@
return written;
 }
 
-asmlinkage ssize_t sys_sendfile(int out_fd, int in_fd, off_t *offset, size_t count)
+/*
+ * Get input file, and verify that it is ok..
+ */
+static struct file * get_verify_in_file (int in_fd, size_t count)
 {
-   ssize_t retval;
-   struct file * in_file, * out_file;
-   struct inode * in_inode, * out_inode;
+   struct inode * in_inode;
+   struct file * in_file;
+   int retval;
 
-   /*
-* Get input file, and verify that it is ok..
-*/
retval = -EBADF;
in_file = fget(in_fd);
if (!in_file)
@@ -1474,10 +1476,21 @@
retval = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, 
in_file-f_pos, count);
if (retval)
goto fput_in;
+   return in_file;
+fput_in:
+   fput(in_file);
+out:
+   return ERR_PTR(retval);
+}
+/*
+ * Get output file, and verify that it is ok..
+ */
+static struct file * get_verify_out_file (int out_fd, size_t count)
+{
+   struct file *out_file;
+   struct inode *out_inode;
+   int retval;
 
-   /*
-* Get output file, and verify that it is ok..
-*/
retval = -EBADF;
out_file = fget(out_fd);
if (!out_file)
@@ -1491,6 +1504,29 @@
retval = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, 
out_file-f_pos, count);
if (retval)
goto fput_out;
+   return out_file;
+
+fput_out:
+   fput(out_file);
+fput_in:
+   return ERR_PTR(retval);
+}
+
+asmlinkage ssize_t sys_sendfile(int out_fd, int in_fd, off_t *offset, size_t count)
+{
+   ssize_t retval;
+   struct file * in_file, *out_file;
+
+   in_file = get_verify_in_file(in_fd, count);
+   if (IS_ERR(in_file)) {
+   retval = PTR_ERR(in_file);
+   goto out;
+   }
+   out_file = get_verify_out_file(out_fd, count);
+   if (IS_ERR(out_file)) {
+   retval = PTR_ERR(out_file);
+   goto fput_in;
+   }
 
retval = 0;
if (count) {
@@ -1524,6 +1560,56 @@
fput(in_file);
 out:
return retval;
+}
+
+asmlinkage ssize_t sys_sendpath(int out_fd, char *path, off_t *offset, size_t count)
+{
+   struct file in_file, *out_file;
+   read_descriptor_t desc;
+   loff_t pos = 0, *ppos;
+   struct nameidata nd;
+   int ret;
+
+   out_file = get_verify_out_file(out_fd, count);
+   if (IS_ERR(out_file)) {
+   ret = PTR_ERR(out_file);
+   goto err;
+   }
+   ret = user_path_walk(path, nd);
+   if (ret)
+   goto put_out;
+   ret = -EINVAL;
+   if (!nd.dentry || !nd.dentry-d_inode)
+   goto put_in_out;
+
+   memset(in_file, 0, sizeof(in_file));
+   in_file.f_dentry = nd.dentry;
+   in_file.f_op = nd.dentry-d_inode-i_fop;
+
+   ppos = in_file.f_pos;
+   if (offset) {
+   if (get_user(pos, offset))
+   goto put_in_out;
+   ppos = pos;
+   }
+   desc.written = 0;
+   desc.count = count;
+   desc.buf = (char *) out_file;
+   desc.error = 0;
+   do_generic_file_read(in_file, ppos, desc, file_send_actor, 0);
+
+   ret = desc.written;
+   if (!ret)
+   ret = desc.error;
+   if (offset)
+   put_user(pos, offset);
+
+put_in_out:
+   fput(out_file);
+put_out:
+   path_release(nd);
+err:
+   return ret;
 }
 
 /*
--- linux/arch/i386/kernel/entry.S.orig Mon Jan 15 22:42:47 2001
+++ linux/arch/i386/kernel/entry.S  Mon Jan 15 22:43:12 2001
@@ -646,6 +646,7 @@
.long SYMBOL_NAME(sys_getdents64)   /* 220 */
.long SYMBOL_NAME(sys_fcntl64)
.long SYMBOL_NAME(sys_ni_syscall)   /* reserved for TUX */
+   .long

Re: 4G SGI quad Xeon - memory-related slowdowns

2001-01-15 Thread Ingo Molnar


On 15 Jan 2001, Linus Torvalds wrote:

 The performance problem is _probably_ due to the kernel having to
 double-buffer the IO requests, coupled with bad MTRR settings (ie
 memory above the 4GB range is probably marked as non-cacheable or
 something, which means that you'll get really bad performance).

the highmem related double-buffering alone on such a category of system is
miniscule, compared to other costs of IO, and considering the expected
bandwidth (20-30 MB/sec).

the MTRR part could be a problem.

 Not using the high memory will avoid the double-buffering, and will
 also avoid using memory that isn't cached. If I'm right.

 The hang still indicates that something is wrong in PAE-land, though.

it's working just fine on all 4GB+ systems tested (including 32GB
systems), Intel, Dell, Compaq boxes. So if it's a unique PAE bug, then it
must be some boundary condition.

Paul, here is the memory map of my 8GB system:

BIOS-provided physical RAM map:
 BIOS-e820: 0009d400 @  (usable)
 BIOS-e820: 2c00 @ 0009d400 (reserved)
 BIOS-e820: 0002 @ 000e (reserved)
 BIOS-e820: 03ef8000 @ 0010 (usable)
 BIOS-e820: 7c00 @ 03ff8000 (ACPI data)
 BIOS-e820: 0400 @ 03fffc00 (ACPI NVS)
 BIOS-e820: ec00 @ 0400 (usable)
 BIOS-e820: 0140 @ fec0 (reserved)
 BIOS-e820: f000 @ 0001 (usable)

and here are the MTRR settings:

[root@m mingo]# cat /proc/mtrr
reg00: base=0xf000 (3840MB), size= 256MB: uncachable, count=1
reg01: base=0x (   0MB), size=4096MB: write-back, count=1
reg02: base=0x1 (4096MB), size=2048MB: write-back, count=1
reg03: base=0x18000 (6144MB), size=1024MB: write-back, count=1
reg04: base=0x1c000 (7168MB), size= 512MB: write-back, count=1
reg05: base=0x1e000 (7680MB), size= 256MB: write-back, count=1

i'd suggest using the mem=exact feature to force different type of memory
maps. Eg. i'm using the following append= line to force a 800 MB setup:

append="mem=exactmap mem=0x0009d800@0x mem=0x03ef8000@0x0010 
mem=0x2bffe000@0x0400"

such mem=exactmap lines can be constructed based on the BIOS output.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] sendpath() support, 2.4.0-test3/-ac9

2001-01-16 Thread Ingo Molnar


On Mon, 15 Jan 2001, dean gaudet wrote:

  just for kicks i've implemented sendpath() support.
 
  _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size)

 hey so how do you implement transmit timeouts with sendpath() ?
 (i.e. drop the client after 30 seconds of no progress.)

well this problem is not unique to sendpath(), sendfile() has it as well.

in TUX i've added per-socket connection timers, and i believe something
like this should be done in Apache as well - timers are IMO not a good
enough excuse for avoiding event-based IO models and using select() or
poll().

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



'native files', 'object fingerprints' [was: sendpath()]

2001-01-16 Thread Ingo Molnar


On Mon, 15 Jan 2001, Linus Torvalds wrote:

  _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size)

 You want to do a non-blocking send, so that you don't block on the
 socket, and do some simple multiplexing in your server.

 And "sendpath()" cannot do that without having to look up the name
 again, and again, and again. Which makes the performance
 "optimization" a horrible pessimisation.

yep, correct. But take a look at the trick it does with file descriptors,
i believe it could be a useful way of doing things. It basically
privatizes a struct file, without inserting it into the enumerated file
descriptors. This shows that 'native files' are possible: file struct
without file descriptor integers mapped to them.

ob'plug: this privatized file descriptor mechanizm is used in TUX [TUX
privatizes files by putting them into the HTTP request structure - ie.
timeouts and continuation/nonblocking logic can be done with them]. But
TUX is trusted code, and it can pass a struct file to the VFS without
having to validate it, and TUX will also free such file descriptors.

But even user-space code could use 'native files', via the following, safe
mechanizm:

1) current-native_files list, freed at exit_files() time.

2) "struct native_file" which embedds "struct file". It has the following
   fields:

struct native_file {
unsigned long master_fingerprint[8];
unsigned long file_fingerprint[8];
struct file file;
};

'fingerprints' are 256 bit, true random numbers. master_fingerprint is
global to the kernel and is generated once per boot. It validates the
pointer of the structure. The master fingerprint is never known to
user-space.

file_fingerprint is a 256-bit identifier generated for this native file.
The file fingerprint and the (kernel) pointer to the native file is
returned to user-space. The cryptographical safety of these 256-bit random
numbers guarantees that no breach can occur in a reasonable period of
time. It's in essence an 'encrypted' communication between kernel and
user-space.

user-space thus can pass a pointer to the following structure:


struct safe_kpointer {
void *kaddr;
unsigned long fingerprint[4];
};

the kernel can validate kaddr by 1) validating the pointer via the master
fingerprint (every valid kernel pointer must point to a structure that
starts with the master fingerprint's copy). Then usage-permissions are
validated by checking the file fingerprint (the per-object fingerprint).

this is a safe, very fast [ O(1) ] object-permission model. (it's a
variation of a former idea of yours.) A process can pass object
fingerprints and kernel pointers to other processes too - thus the other
process can access the object too. Threads will 'naturally' share objects,
because fingerprints are typically stored in memory.

3) on closing a native file the fingerprint is destroyed (first byte of
the master fingerprint copy is overwritten).

what do you think about this? I believe most of the file APIs can be /
should be reworked to use native files, and 'Unix files' would just be a
compatibility layer parallel to them. Then various applications could
convert to 'native file' usage - i believe file servers which have lots of
file descriptors would do this first.

(this 'fingerprint' mechanizm can be used for any object, not only files.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 'native files', 'object fingerprints' [was: sendpath()]

2001-01-16 Thread Ingo Molnar


On Tue, 16 Jan 2001, Andi Kleen wrote:

 On Tue, Jan 16, 2001 at 10:48:34AM +0100, Ingo Molnar wrote:
  this is a safe, very fast [ O(1) ] object-permission model. (it's a
  variation of a former idea of yours.) A process can pass object
  fingerprints and kernel pointers to other processes too - thus the other
  process can access the object too. Threads will 'naturally' share objects,
 ...

 Just setuid etc. doesn't work with that because access cannot be
 easily revoked without disturbing other clients.

well, you cannot easily close() an already shared file descriptor in
another process's context either. Is revocation so important? Why is
setuid() a problem? A native file is just like a normal file, with the
difference that not an integer but a fingerprint identifies it, and that
access and usage counts are not automatically inherited across some
explicit sharing interface.

perhaps we could get most of the advantages by allowing the relaxation of
the 'allocate first free file descriptor number' rule for normal Unix
files?

 Also the model depends on good secure random numbers, which is
 questionable in many environments (e.g. a diskless box where the
 random device effectively gets no new input)

true, although newer chipsets include hardware random generators. But
indeed, object fingerprints (tokens? ids?) make the random generator a
much more central thing.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



O_ANY [was: Re: 'native files', 'object fingerprints' [was:sendpath()]]

2001-01-16 Thread Ingo Molnar


On Tue, 16 Jan 2001, Andi Kleen wrote:

  the 'allocate first free file descriptor number' rule for normal Unix
  files?
 Not sure I follow. You mean dup2() ?

I'm sure you know this: when there are thousands of files open already,
much of the overhead of opening a new file comes from the mandatory POSIX
requirement of allocating the first not yet allocated file descriptor
integer to this file. Eg. if files 0, 1, 2, 10, 11 are already open, the
kernel must allocate file descriptor 3. Many utilities rely on this, and
the rule makes sense in a select() environment, because it compresses the
'file descriptor spectrum'. But in a non-select(), event-drive environment
it becomes unnecessery overhead.

- probably the most radical solution is what i suggested, to completely
avoid the unique-mapping of file structures to an integer range, and use
the address of the file structure (and some cookies) as an identification.

- a less radical solution would be to still map file structures to an
integer range (file descriptors) and usage-maintain files per processes,
but relax the 'allocate first non-allocated integer in the range' rule.
I'm not sure exactly how simple this is, but something like this should
work: on close()-ing file descriptors the freed file descriptors would be
cached in a list (this needs a new, separate structure which must be
allocated/freed as well). Something like:

struct lazy_filedesc {
int fd;
struct file *file;
}

struct task {
...
struct lazy_filedesc *lazy_files;
...
}

the actual filedescriptor bit of a 'lazy file' would be cleared for real
on close(), and the '*file' argument is not a real file - it's NULL if at
close() time this process wasnt the last user of the file, or contains a
pointer to an allocated (but otherwise invalid) file structure. This must
happen to ensure the first-free-desc rule, and to optimize
freeing/allocate of file structures. Now, if the new code does a:

fd = open(...,O_ANY);

then the kernel looks at the current-lazy_files list, and tries to set
the file descriptor bit in the current-files file table. If successful
then open() uses desc-fd and desc-file (if available) for opening the
new file, and unlinks+frees the lazy descriptor. If unsuccessful then
open() frees desc-file, frees and unlinks the descriptor and goes on to
look at the next descriptor.

- worst case overhead is the extra allocation overhead of the (very small)
  lazy file descriptor. Worst-case happens only if O_ANY allocation is
  mixed in a special way with normal open()s.

- Best-case overhead saves us a get_unused_fd() call, which can be *very*
  expensive (in terms of CPU time and cache footprint) if thousands of
  files are used. If O_ANY is used mostly, then the best-case is always
  triggered.

- (the number of lazy files must be limited to some sane value)

at exit_files() time the current-lazy_files list must be processed. On
exec() it does not get inherited.

current-lazy_files has no effect on task state or semantics otherwise,
it's only an isolated 'information cache'.

Have i missed something important?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was:sendpath()]]

2001-01-16 Thread Ingo Molnar


On Tue, 16 Jan 2001, Ingo Molnar wrote:

   struct lazy_filedesc {
   int fd;
   struct file *file;
   }

in fact "struct file" can (ab)used for this, no need for new structures or
new fields. Eg. file-f_flags contains the cached descriptor-information.
file-f_list is used for the current-lazy_files ringlist.

this way there is no additional allocation overhead in the worst-case.

(unless i'm missing something obvious.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was:sendpath()]]

2001-01-16 Thread Ingo Molnar


On Tue, 16 Jan 2001, Peter Samuelson wrote:

 [Ingo Molnar]
  - probably the most radical solution is what i suggested, to
  completely avoid the unique-mapping of file structures to an integer
  range, and use the address of the file structure (and some cookies)
  as an identification.

 Careful, these must cast to non-negative integers, without clashing.

if you read my (radical) proposal, the identification is based on a kernel
pointer and a 256-bit random integer. So non-negative integers are not
needed. (file-IO system-calls would be modified to detect if 'Unix file
descriptors' or pointers to 'native file descriptors' are passed to them,
so this is truly radical.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is sendfile all that sexy?

2001-01-16 Thread Ingo Molnar


On Tue, 16 Jan 2001, Felix von Leitner wrote:

 I don't know how Linux does it, but returning the first free file
 descriptor can be implemented as O(1) operation.

only if special allocation patters are assumed. Otherwise it cannot be a
generic O(1) solution. The first-free rule adds an implicit ordering to
the file descriptor space, and this order cannot be maintained in an O(1)
way. Linux can allocate up to a million file descriptors.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is sendfile all that sexy?

2001-01-16 Thread Ingo Molnar


On Tue, 16 Jan 2001, Felix von Leitner wrote:

 I don't know how Linux does it, but returning the first free file
 descriptor can be implemented as O(1) operation.

to put it more accurately: the requirement is to be able to open(), use
and close() an unlimited number of file descriptors with O(1) overhead,
under any allocation pattern, with only RAM limiting the number of files.
Both of my proposals attempt to provide this. It's possible to open() O(1)
but do a O(log(N)) close(), but that is of no practical value IMO.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Hi memory support in 2.4 not working correctly.

2001-01-17 Thread Ingo Molnar


On Wed, 17 Jan 2001, Micah Gorrell wrote:

 I have a compaq 8 way server with 4 gigs of memory.  I am running 2.4.0 and
 everything works just fine (except the gig - I'm still fighting with that)
 The only strange thing that I am seeing is that I only see 3.3 gigs of
 memory instead of the full 4.  Has anyone seen this and possibly know of a
 fix?

could you run this command against your .config:

grep -i highmem .config

what does it say?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Wed, 17 Jan 2001, dean gaudet wrote:

 with the linux TCP_CORK API you only get one trailing small packet.
 in case you haven't heard of TCP_CORK -- when the cork is set, the
 kernel is free to send any maximum size packets it can form, but has
 to hold on to the stragglers until userland gives it more data or pops
 the cork.

TCP_CORK has been basically replaced by MSG_MORE these days. The problem
with the cork approach is that it's a persistent socket flag - and it
easily triggers programming errors when the application writer tracks the
state of the cork incorrectly. Also, removing the cork is one extra
system-call. So what you can do with MSG_MORE is to specify at
sendmsg()/writev() time, whether at the end of the buffer there is a cork
or not.

this is what TUX uses. When a eg. static HTTP request arrives it sends
reply headers shortly after having checked file permissions and stuff (but
the file is not yet sent), with MSG_MORE set. Then it sends the file, and
sendfile() keeps MSG_MORE set right until the end of the request, when it
clears it for the last fragment so the last partial packet gets flushed to
the network. In fact there is one more optimization here, if the request
is not keepalive then TUX still kees MSG_MORE set, and closes the socket -
which will implicitly flush the output queue anyway and send any partial
packet, but will also have the FIN packet merged with the last outgoing
packet.

(if there is saturation then further merging might happen as well - if a
sendmsg() comes before a partial, but already xmit-queued packet is sent,
then the TCP layer merges this sendmsg() output with the outgoing packet.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Wed, 17 Jan 2001, Rick Jones wrote:

 i'd heard interesting generalities but no specifics. for instance,
 when the send is small, does TCP wait exclusively for the app to
 flush, or is there an "if all else fails" sort of timer running?

yes there is a per-socket timer for this. According to RFC 1122 a TCP
stack 'MUST NOT' buffer app-sent TCP data indefinitely if the PSH bit
cannot be explicitly set by a SEND operation. Was this a trick question?
:-)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Wed, 17 Jan 2001, Linus Torvalds wrote:

 (I also had one person point out that BSD's have the notion of
 TCP_NOPUSH, which does almost what TCP_CORK does under Linux, except
 it doesn't seem to have the notion of uncorking - you can turn NOPUSH
 off, but apparently it doesn't affect queued packets. This makes it
 even less clear why they have the ugly sendfile)

this is what MSG_MORE does. Basically i added MSG_MORE for the purpose of
getting perfect TUX packet boundaries (and was ignorant enough to not know
about BSD's NOPUSH), without an additional system-call overhead, and
without the persistency of TCP_CORK. Alexey and David agreed, and actually
implemented it correctly :-)

basically if MSG_MORE is not set that means an explicit packet boundary in
the noncontended scenario. If MSG_MORE is set then that means all full-MSS
packets are queued, partial packets are not queued (but are timing out).
sendfile() uses the more flag internally - i've changed sendfile() in my
tree to specify the more flag from higher levels as well - eg. if a sent
file is embedded into other replies, or multiple files are sent.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Wed, 17 Jan 2001, Rick Jones wrote:

 certainly, i see by your examples how cork can make life easier on the
 developer - they can putc() the reply if they want. for a persistent
 http connection, there would be the cork and uncork each time, for a
 pipelined connection, it is basically a race - how does the client
 present requests to the connection, what are the speeds of that
 connection relative to the speed of the server getting replies into
 the socket that sort of thing.

such dynamic properties should IMO never become visible to user-space
interfaces i believe. TCP_CORK/MSG_MORE (which are both the same thing, in
a different interface) are a way to specify logical neighborhood of
app-side SENDs. I believe the most sensible and generic thing to do is to
require MSG_MORE information from the application: 'is it likely that the
application is going to SEND something soon, or not?'.

Submitting an exact timetable of planned future SENDs (with a fully
specified probability distribution of every expected future SEND event)
would be the most informative thing to do, but this is not very practical.

Basically MSG_MORE is a simplified probability distribution of the next
SEND, and it already covers all the other (iovec, nagle, TCP_CORK)
mechanizm available, in a surprisingly easy way IMO. I believe MSG_MORE is
very robust from a theoreticaly point of view.

To use this information to judge saturation situations properly is
completely up to the stack.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001, Linus Torvalds wrote:

 Yeah, and how are you going to teach a perl CGI script that writes to
 stdout to use it?

yep, correct. But you can have TCP_CORK behavior from user-space (by
setting the cork flag in user-space and writing it for all network
output), while you cannot have MSG_MORE in the TCP_CORK case. And a perl
script will likely use none of these mechanizms, it's the webserver CGI
host code that does the network send, perl CGI scripts do not send to the
network directly, they send to a pipe so the CGI host code can have
absolute control over eg. CGI-generated HTTP headers.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001, Linus Torvalds wrote:

 Remember the UNIX philosophy: everything is a file. MSG_MORE
 completely breaks that, because the only way to use it is with
 send[msg](). It's absolutely unusable with something like a
 traditional UNIX "anonymous" application that doesn't know or care
 that it's writing to the network.

yep you are right - i only thought in terms of applications that know that
they are dealing with a network.

 In contrast, TCP_CORK has an interface much like TCP_NOPUSH, along
 with the notion of persistency. The difference between those two is
 that TCP_CORK really took the notion of persistency to the end, and
 made uncorking actually say "Ok, no more packets". You can't do that
 with TCP_NOPUSH: with TCP_NOPUSH you basically have to know what your
 last write is, and clear the bit _before_ that write if you want to
 avoid bad latencies (alternatively, you can just close the socket,
 which works equally well, and was probably the designed interface for
 the thing. That has the disadvantage of, well, closing the socket - so
 it doesn't work if you don't _know_ whether you'd write more or not).

i believe BSD's TCP_NOPUSH should add those 3 lines that are needed to
flush pending packets, this is what we do too - we do a
tcp_push_pending_frames() if the socket option TCP_CORK is cleared.

 So the three are absolutely not equivalent. I personally think that
 TCP_NOPUSH is always the wrong thing - it has the persistency without
 the ability to shut it off gracefully after the fact. In contrast,
 both MSG_MORE and TCP_CORK have well-defined behaviour but they have
 very different uses.

yep - i agree now. In terms of network-aware applications, i found
MSG_MORE to be both cheaper and less bug-prone - but for uncooperative (or
simply too generic) applications which are output-ing to simple files
there is no way to control buffering, only some persistent mechanizm.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001 [EMAIL PROTECTED] wrote:

 Actually, TUX-1.1 (Ingo, do I not lie, did you not kill this code?)
 does this. It does not ack quickly, when complete request is received
 and still not answered, so that all the redundant acks disappear.

(it's TUX 2.0 meanwhile), and yes, TUX uses it. We speculatively delay ACK
of parsed input packet in the hope of merging it with the first output
packet. If the output frame does not happen for 200 msecs then we send a
standalone ACK to be RFC-conform. This way TUX can do single-packet web
replies for small requests. (well, plus SYN-ACK and FIN-ACK)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001, Linus Torvalds wrote:

 I think Andrea was thinking more of the case of the anonymous IO
 generator, and having the "controller" program thgat keeps the socket
 always in CORK mode, but uses SIOCPUSH when it doesn't know what teh
 future access patterns will be.

yep.

 Again, the actual data _senders_ may not be aware of the network
 issues. They are the worker bees, and they may not know or care that
 they are pushing out data to the network.

yep.

 Ingo, you should realize that people actually _want_ to use things
 like stdio. [...]

yep, i already acknowledged that not all applications want to care about
issues like that and rather want to have a 'default behavior' - ie. a
persistent cork.

i also said that user-space (ie. libc) could maintain a persistent flag
itself (a user-space variable) much cheaper than the kernel, and could
pass the current 'more' value to the kernel, whenever sendmsg is done. I
understand that normal file IO has no 'flag' for MSG_MORE - a pity that no
extra flags can be passed in to write(). But this doesnt make it right. It
makes it a practical problem, it shows the (perhaps-) weakness of the file
API which is right now not prepared to pass 'streaming related info' along
with a send, but doesnt make it right.

now if your point is that passing a flag (or flags) along every (generic)
file-write would be a mistake, that would be a point. But you didnt say
that so far.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001, Andrea Arcangeli wrote:

  {
  int val = 1;
  setsockopt(req-sock, IPPROTO_TCP, TCP_CORK,
  (char *)val,sizeof(val));
  val = 0;
  setsockopt(req-sock, IPPROTO_TCP, TCP_CORK,
  (char *)val,sizeof(val));
  }
 
  differ from what you posted. It does the same in my opinion. Maybe we are
  not talking about the same thing?

 The above is equivalent to SIOCPUSH _only_ if the caller wasn't using either
 TCP_NODELAY or TCP_CORK.

why? I can restore whatever state i want, the above is just a mechanizm to
force the push.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001, Andrea Arcangeli wrote:

 This is a possible slow (but userspace based) implementation of SIOCPUSH:

of course this is what i meant. Lets stop wasting time on this, ok?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001, Andrea Arcangeli wrote:

 BTW, the simmetry between getsockopt/setsockopt further bias how
 SIOCPUSH doesn't fit into the setsockopt options but it fits very well
 into the ioctl categorty instead. There's simply no state one can
 return via getsockopt for the SIOCPUSH functionality. It's not setting
 any option, it's just doing one thing that controls the I/O.

you convinced me. I guess i was just distracted by the common wisdom:
'ioctls are a hack'. But SIOCPUSH *IS* an 'IO control' after all :-)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001, Andrea Arcangeli wrote:

 Agreed. However since TCP_CORK logic is more generic than MSG_MORE
 [...]

why? TCP_CORK is equivalent to MSG_MORE, it's just a different
representation of the same issue. TCP_CORK needs an extra syscall (in the
case of a push event - which might be rare), the MSG_MORE solution needs
an extra flag (which is merged with other flags in the send() case).

  i believe it should rather be a new setsockopt TCP_CORK value (or a new
  setsockopt constant), not an ioctl. Eg. a value of 2 to TCP_CORK could
  mean 'force packet boundary now if possible, and dont touch TCP_CORK
  state'.


 Doing PUSH from setsockopt(TCP_CORK) looked obviously wrong because it
 isn't setting any socket state, [...]

well, neither is clearing/setting TCP_CORK ...

 and also because the SIOCPUSH has nothing specific with TCP_CORK, as
 said it can be useful also to flush the last fragment of data pending
 in the send queue without having to wait all the unacknowledged data
 to be acknowledged from the receiver when TCP_NODELAY isn't set.

huh? in what way does the following:

{
int val = 1;
setsockopt(req-sock, IPPROTO_TCP, TCP_CORK,
(char *)val,sizeof(val));
val = 0;
setsockopt(req-sock, IPPROTO_TCP, TCP_CORK,
(char *)val,sizeof(val));
}

differ from what you posted. It does the same in my opinion. Maybe we are
not talking about the same thing?

 Changing the semantics of setsockopt(TCP_CORK, 2) would also break
 backwards compatibility with all 2.[24].x kernels out there.

[this is nitpicking. I'm quite sure all the code uses '1' as the value,
not 2.]

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Ingo Molnar


On Thu, 18 Jan 2001, Andrea Arcangeli wrote:

 I'm all for TCP_CORK but it has the disavantage of two syscalls for
 doing the flush of the outgoing queue to the network. And one of those
 two syscalls is spurious. [...]

i believe a network-conscious application should use MSG_MORE - that has
no system-call overhead.

 + case SIOCPUSH:
 + lock_sock(sk);
 + __tcp_push_pending_frames(sk, tp, tcp_current_mss(sk), 1);
 + release_sock(sk);
 + break;

i believe it should rather be a new setsockopt TCP_CORK value (or a new
setsockopt constant), not an ioctl. Eg. a value of 2 to TCP_CORK could
mean 'force packet boundary now if possible, and dont touch TCP_CORK
state'.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is sendfile all that sexy?

2001-01-21 Thread Ingo Molnar


On Sun, 21 Jan 2001, James Sutherland wrote:

 For many applications, yes - but think about a file server for a
 moment. 99% of the data read from the RAID (or whatever) is really
 aimed at the appropriate NIC - going via main memory would just slow
 things down.

patently wrong. Compare the bandwidth of PCI and the bandwidth of memory
controllers. It's both slower, has higher latency and uses up more
valuable (PCI) bandwidth to do PCI-PCI transfers. The number of
situations where PCI-PCI transactions are the preferred method are *very*
limited, and i think we should deal with them when we see them. But this
has been said at the very beginning of this thread already, please read it
all ...

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] wait4-2.4.0-A0

2001-01-22 Thread Ingo Molnar


the attached patch (against -pre9) fixes a possibly dangerous sys_wait4()
prototype mismatch.

Ingo


--- linux/include/linux/sched.h.origMon Jan 22 17:28:36 2001
+++ linux/include/linux/sched.h Mon Jan 22 17:29:17 2001
@@ -563,6 +563,7 @@
 #define wake_up_interruptible_all(x)   __wake_up((x),TASK_INTERRUPTIBLE, 0)
 #define wake_up_interruptible_sync(x)  __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
 #define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE,  nr)
+asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct 
+rusage * ru);
 
 extern int in_group_p(gid_t);
 extern int in_egroup_p(gid_t);
--- linux/arch/i386/kernel/signal.c.origMon Jan 22 17:28:25 2001
+++ linux/arch/i386/kernel/signal.c Mon Jan 22 17:28:31 2001
@@ -26,8 +26,6 @@
 
 #define _BLOCKABLE (~(sigmask(SIGKILL) | sigmask(SIGSTOP)))
 
-asmlinkage int sys_wait4(pid_t pid, unsigned long *stat_addr,
-int options, unsigned long *ru);
 asmlinkage int FASTCALL(do_signal(struct pt_regs *regs, sigset_t *oldset));
 
 int copy_siginfo_to_user(siginfo_t *to, siginfo_t *from)



[patch] new, scalable timer implementation, smptimers-2.4.0-B1

2001-01-28 Thread Ingo Molnar


a new, 'ultra SMP scalable' implementation of Linux kernel timers is now
available for download:

http://www.redhat.com/~mingo/scalable-timers/smptimers-2.4.0-B1

the patch is against 2.4.1-pre10 or ac12. The timer design in this
implementation is a work of David Miller, Alexey Kuznetsov and myself.

Internals: the current 2.4 timer implementation uses a global spinlock for
synchronizing access to the global timer lists. This causes excessive
cacheline ping-pongs and visible performance degradation under very high
TCP networking load (and other, timer-intensive operations).

The new implementation introduces per-CPU timer lists and per-CPU
spinlocks that protect them. All timer operations, add_timer(),
del_timer() and mod_timer() are still O(1) and cause no cacheline
contention at all (because all data structures are separated). All
existing semantics of Linux timers are preserved, so the patch is
'transparent' to all other subsystems.

In addition, the role of TIMER_BH has been redefined, and run_local_timers
is used directly from APIC timer interrupts to run timers (not from
TIMER_BH). This means that timer expiry is per-CPU as well - it is global
in vanilla 2.4. Every timer is started and expired on the CPU where it has
been added. Timers get migrated between CPUs if mod_timer() is done on
another CPU (because eg. a process using them migrates to another CPU.).
In the typical case timer handling is completely localized to one CPU.

The new timers still maintain 'semantical compatibility' with older
concepts such as the IRQ lock and manipulation of TIMER_BH state. These
constructs are quite rare already, in 2.5 they can be removed completely.

the patch has been sanity tested on UP-pure, UP-APIC, UP-IOAPIC and SMP
systems. Reports/comments/questions/suggestions welcome!

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] raid-B1, 2.4.1-pre11, fixes, cleanups

2001-01-29 Thread Ingo Molnar


On Tue, 30 Jan 2001, Neil Brown wrote:

  -#define MAX_MD_BOOT_DEVS 8
  +#define MAX_MD_BOOT_DEVS MAX_MD_DEVS

 Actually, this is not fine.  Check the code that says:

indeed - it will work only up to 32 devices.

i've fixed the code to not have this assumption - it's init-time code only
anyway. There are also a few other cleanups in raid-2.4.1-B1:

 - CONFIG_MD_BOOT gone. __init thing and people actually use it.

 - CONFIG_AUTODETECT_RAID gone. __init thing, and can be turned off via
   command-line. Needs special partition ID to be activated anyway.

 - static-init cleanups (no need to initialize to zero)

 - new RAID_AUTORUN ioctrl for initrd kernels to be able to start up
   autostart arrays.

code is cleaner and simpler now. The patch removes a total of 7 lines,
while adding a new feature :-)

Dave, does this patch do the trick for you? (raid-2.4.1-B1 is against the
2.4.1-pre11 kernel.)

Ingo


--- linux/fs/partitions/msdos.c.origWed Jul 19 08:29:16 2000
+++ linux/fs/partitions/msdos.c Mon Jan 29 23:41:53 2001
@@ -36,7 +36,7 @@
 #include "check.h"
 #include "msdos.h"
 
-#if CONFIG_BLK_DEV_MD  CONFIG_AUTODETECT_RAID
+#if CONFIG_BLK_DEV_MD
 extern void md_autodetect_dev(kdev_t dev);
 #endif
 
@@ -136,7 +136,7 @@
add_gd_partition(hd, current_minor,
 this_sector+START_SECT(p)*sector_size,
 NR_SECTS(p)*sector_size);
-#if CONFIG_BLK_DEV_MD  CONFIG_AUTODETECT_RAID
+#if CONFIG_BLK_DEV_MD
if (SYS_IND(p) == LINUX_RAID_PARTITION) {
md_autodetect_dev(MKDEV(hd-major,current_minor));
}
@@ -448,7 +448,7 @@
continue;
add_gd_partition(hd, minor, first_sector+START_SECT(p)*sector_size,
 NR_SECTS(p)*sector_size);
-#if CONFIG_BLK_DEV_MD  CONFIG_AUTODETECT_RAID
+#if CONFIG_BLK_DEV_MD
if (SYS_IND(p) == LINUX_RAID_PARTITION) {
md_autodetect_dev(MKDEV(hd-major,minor));
}
--- linux/include/linux/raid/md_u.h.origTue Nov 14 22:16:37 2000
+++ linux/include/linux/raid/md_u.h Mon Jan 29 23:41:53 2001
@@ -22,6 +22,7 @@
 #define GET_ARRAY_INFO _IOR (MD_MAJOR, 0x11, mdu_array_info_t)
 #define GET_DISK_INFO  _IOR (MD_MAJOR, 0x12, mdu_disk_info_t)
 #define PRINT_RAID_DEBUG   _IO (MD_MAJOR, 0x13)
+#define RAID_AUTORUN   _IO (MD_MAJOR, 0x14)
 
 /* configuration */
 #define CLEAR_ARRAY_IO (MD_MAJOR, 0x20)
--- linux/drivers/md/md.c.orig  Mon Dec 11 22:19:35 2000
+++ linux/drivers/md/md.c   Mon Jan 29 23:42:53 2001
@@ -2033,68 +2033,65 @@
 struct {
int set;
int noautodetect;
+} raid_setup_args md__initdata;
 
-} raid_setup_args md__initdata = { 0, 0 };
-
-void md_setup_drive(void) md__init;
+void md_setup_drive (void) md__init;
 
 /*
  * Searches all registered partitions for autorun RAID arrays
  * at boot time.
  */
-#ifdef CONFIG_AUTODETECT_RAID
-static int detected_devices[128] md__initdata = { 0, };
-static int dev_cnt=0;
+static int detected_devices[128] md__initdata;
+static int dev_cnt;
+
 void md_autodetect_dev(kdev_t dev)
 {
if (dev_cnt = 0  dev_cnt  127)
detected_devices[dev_cnt++] = dev;
 }
-#endif
 
-int md__init md_run_setup(void)
+
+static void autostart_arrays (void)
 {
-#ifdef CONFIG_AUTODETECT_RAID
mdk_rdev_t *rdev;
int i;
 
-   if (raid_setup_args.noautodetect)
-   printk(KERN_INFO "skipping autodetection of RAID arrays\n");
-   else {
-
-   printk(KERN_INFO "autodetecting RAID arrays\n");
+   printk(KERN_INFO "autodetecting RAID arrays\n");
 
-   for (i=0; idev_cnt; i++) {
-   kdev_t dev = detected_devices[i];
+   for (i=0; idev_cnt; i++) {
+   kdev_t dev = detected_devices[i];
 
-   if (md_import_device(dev,1)) {
-   printk(KERN_ALERT "could not import %s!\n",
-  partition_name(dev));
-   continue;
-   }
-   /*
-* Sanity checks:
-*/
-   rdev = find_rdev_all(dev);
-   if (!rdev) {
-   MD_BUG();
-   continue;
-   }
-   if (rdev-faulty) {
-   MD_BUG();
-   continue;
-   }
-   md_list_add(rdev-pending, pending_raid_disks);
+   if (md_import_device(dev,1)) {
+   printk(KERN_ALERT "could not import %s!\n",
+  

Re: Desk check of raid5.c patch from mtew@cds.duke.edu?

2001-01-30 Thread Ingo Molnar


On Mon, 29 Jan 2001, Quim K Holland wrote:

 I've been following the recent 2.4.1-pre series and am wondering why
 the following one-liner (obviously correct) patch has not been
 applied. [...]

 -   return_ok = bh-b_reqnext;
 +   return_fail = bh-b_reqnext;

oops - i do have it in my tree, somehow it escaped.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todo with ECN)

2001-01-30 Thread Ingo Molnar


On Tue, 30 Jan 2001, jamal wrote:

 Kernel |  tput  | sender-CPU | receiver-CPU |
 -
 2.4.0-pre3 | 99MB/s |   87%  |  23% |
 NSF|||  |
 -
 2.4.0-pre3 | 68 |   8%   |  8%  |
 +ZC  SF| MB/s   ||  |
 -

isnt the CPU utilization difference amazing? :-)

a couple of questions:

- is this UDP or TCP based? (UDP i guess)

- what wsize/rsize are you using? How do these requests look like on the
  network, ie. are they suffieciently MTU-sized?

- what happens if you run multiple instances of the testcode, does it
  saturate bandwidth (or CPU)?

Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todo with ECN)

2001-01-30 Thread Ingo Molnar


On Tue, 30 Jan 2001, jamal wrote:

  - is this UDP or TCP based? (UDP i guess)
 
 TCP

well then i'd suggest to do:

echo 10 10 10  /proc/sys/net/ipv4/tcp_wmem

does this make any difference?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todo with ECN)

2001-01-31 Thread Ingo Molnar


On Wed, 31 Jan 2001, Malcolm Beattie wrote:

 Without the raised tcp_wmem setting I was getting 81 MByte/s. With
 tcp_wmem set as above I got 86 MByte/s. Nice increase. Any other
 setting I can tweak apart from {r,w}mem_max and tcp_{w,r}mem? The CPU
 on the client (350 MHz PII) is the bottleneck: gensink4 maxes out at
 69 Mbyte/s pulling TCP from the server and 94 Mbyte/s pushing. (The
 other system, 733 MHz PIII pushes 100MByte/s UDP with ttcp but the
 client drops most of it).

you can speed up the client significantly by using the MSG_TRUNC option
('truncate message'). It will zap incoming data without copying it into
user-space. (you can use this for the 'bulk transfer' part - the initial
protocol handling code needs to see the actual data.) This way you should
be able to saturate the server even more.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: start_thread question...

2001-05-20 Thread Ingo Molnar


On Sun, 20 May 2001, Dave Airlie wrote:

 I'm implementing start_thread for the VAX port and am wondering does
 start_thread have to return to load_elf_binary? I'm working on the
 init thread and what is happening is it is returning the whole way
 back to the execve caller .. which I know shouldn't happen.

start_thread() doesnt do what one would intuitively think it does.

start_thread() simply prepares the new task's register set to be ready to
start user-space (which task is the current task as well, so certain
current CPU registers might have to be manually bootstrapped as well), but
start_thread() does not actually start execution of user-space code yet.

(a more correct name for start_thread() would be prepare_user_thread().)

 so I suppose what I'm looking for is the point where the user space
 code gets control... is it when the registers are set in the
 start_thread? if so how does start_thread return

execution starts when the process returns from sys_execve(). By that time
we have already changed pagetables and other context information, dropped
basically everything from the previous context - without actually doing a
context-switch. In fact sys_execve() has an implicit context-switch,
without ever changing the kernel-stack though.

 On the VAX we have to call a return from interrupt to get to user
 space and I'm trying to figure out where this should happen...

this is how it happens on x86 too. Basically you start the new binary by
returning from an syscall that has bootstrapped all userspace context -
this approach should work on any architecture. (because every architecture
has to be able to execute user-space code after syscalls.)

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4 del_timer_sync oops in schedule_timeout

2001-05-20 Thread Ingo Molnar


On Sat, 19 May 2001, Jacob Luna Lundberg wrote:

 This is 2.4.4 with the aic7xxx driver version 6.1.13 dropped in.

 Unable to handle kernel paging request at virtual address 78626970

this appears to be some sort of DMA-corruption or other memory scribble
problem. hexa 78626970 is ASCII pibx, which shows in the direction of
some sort of disk-related DMA corruption.

we havent had any similar crash in del_timer_sync() for ages.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4 del_timer_sync oops in schedule_timeout

2001-05-20 Thread Ingo Molnar


On Sun, 20 May 2001, Jacob Luna Lundberg wrote:

   Unable to handle kernel paging request at virtual address 78626970
  this appears to be some sort of DMA-corruption or other memory scribble
  problem. hexa 78626970 is ASCII pibx, which shows in the direction of
  some sort of disk-related DMA corruption.
  we havent had any similar crash in del_timer_sync() for ages.

 Ahh.  Thanks then, I'll go look hard at the disk in that box.  :)

not necesserily the disk. it can be any sort of overheating or other
thermal noise (unlikely), or SCSI/IDE cable problem (likely), or driver
problem (likely too). Disk faults typically show very different symptoms.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux-2.4.5

2001-05-26 Thread Ingo Molnar


On Sat, 26 May 2001, Andrea Arcangeli wrote:

 On Sat, May 26, 2001 at 02:11:15PM -0400, Ben LaHaise wrote:
  No.  It does not fix the deadlock.  Neither does the patch you posted.

 can you give a try if you can deadlock 2.4.5aa1 just in case, and post a
 SYSRQ+T + system.map if it still deadlocks?

Andrea, can you rather start running the Cerberus testsuite instead? All
these deadlocks happen pretty early during the test, and we've been fixing
tons of these deadlocks, and no, it's not easy.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[patch] severe softirq handling performance bug, fix, 2.4.5

2001-05-26 Thread Ingo Molnar


i've been seeing really bad average TCP latencies on certain gigabit cards
(~300-400 microseconds instead of the expected 100-200 microseconds), ever
since softnet went into the main kernel, and never found a real
explanation for it, until today.

the problem always went away when i tried to use tcpdump or strace, so the
bug remained hidden and was hard to prove that it actually existed. (apart
from the bad lat_tcp numbers.) We found many related bugs, but this
problem remained. tcpdumps done on the network did not show any fault of
the TCP stack. The lat_tcp latencies fluctuated alot, but for certain
cards the latencies were stable, so i suspected some sort of hw problem.
The loopback networking device never showed these problems, which added to
the mystery.

the problem turned out to be a severe softirq handling bug in the x86
code.

background: soft interrupts were introduced as a generic kernel framework
around January 2000, as part of the softnet networking-rewrite, that
predated the final scalability rewrite of the Linux TCP/IP networking
code. Soft interrupts have unique semantics, they can be best described as
'IRQ-triggered atomic system calls'. (unlike bottom halves, soft-IRQs do
not preempt kernel code.)

soft-IRQs, like their name suggest, are used from device interrupts ('hard
interrupts') to trigger 'background' work related to interrupts. Soft-IRQs
are triggered per-CPU, and they are supposed to execute whenever nothing
else is done by the kernel on that particular CPU. Softirqs are executed
with interrupts enabled, so hard interrupts can re-enable them while they
are executing. do_softirq() is a kernel function that returns with IRQs
disabled and at this point it's guaranteed that there are no more pending
softirqs for this CPU.

this mechanizm was the intention, but not the reality. In two important
and frequently used code paths it was possible for an active soft-IRQ to
go unnoticed: i measured as long as 140 milliseconds (!!!) latency
between softirq activation and softirq execution in certain cases. This is
obviously bad behavior.

the two error cases are:

 #1 hard-IRQ interrupts user-space code, activates softirq, and returns to
user-space code

 #2 hard-IRQ interrupts the idle task, activates softirq and returns to
the idle task.

category #1 is easy to fix, in entry.S we have to check active softirqs
not only the exception and ret-from-syscall cases, but also in the
IRQ-ret-to-userspace case.

category #2 is subtle, because the idle process is kernel code, so
returning to it we do not execute active softirqs. The main two types of
idle handlers both had a window do 'miss' softirq execution:

- the HLT-based default handler could be called after schedule()'s check
  for softirqs, but after enabling IRQs. In this case an interrupt handler
  has a window to activate a softirq and neither the IRQ return code, nor
  the idle loop would execute it immediately. The fix is to do a softirq
  check right before the safe_halt call.

- the idle-poll handler does not check for softirqs either, it now does
  this in every iteration.

with the attached softirq-2.4.5-A0 patch applied to vanilla 2.4.5, i see
picture-perfect lat_tcp latencies of 109 microseconds over real gigabit
network. I see very stable (and very good) TUX latencies as well. TCP
bandwidth got better as well, probably due to the caching-locality bonus
when executing softirqs right after hardirqs.

[I'd like to ask everyone who had TCP latency problems (or other
networking performance problems) to test 2.4.5 with this patch applied -
thanks!]

impact of the bug: all softirq-using code is affected, mostly networking.
The loopback net driver was not affected because it's not interrupt-based.
The bug went away due to strace or tcpdump because those two utilities
pumped system-calls into the system which 'fixed' the softirq handling
bug.

(other softirq-based code is the tasklet code, and the keyboard code is
using tasklets, so the keyboard code might be affected as well.)

Ingo


--- linux/arch/i386/kernel/entry.S.orig Sat May 26 19:20:48 2001
+++ linux/arch/i386/kernel/entry.S  Sat May 26 19:21:52 2001
@@ -214,7 +214,6 @@
 #endif
jne   handle_softirq

-ret_with_reschedule:
cmpl $0,need_resched(%ebx)
jne reschedule
cmpl $0,sigpending(%ebx)
@@ -275,7 +274,7 @@
movl EFLAGS(%esp),%eax  # mix EFLAGS and CS
movb CS(%esp),%al
testl $(VM_MASK | 3),%eax   # return to VM86 mode or non-supervisor?
-   jne ret_with_reschedule
+   jne ret_from_sys_call
jmp restore_all
 
ALIGN
--- linux/arch/i386/kernel/process.c.orig   Sat May 26 19:21:56 2001
+++ linux/arch/i386/kernel/process.cSat May 26 19:28:06 2001
@@ -79,8 +79,12 @@
  */
 static void default_idle(void)
 {
+   int this_cpu = smp_processor_id();
+
if (current_cpu_data.hlt_works_ok  !hlt_counter) {
__cli();
+  

Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Ingo Molnar


On Sun, 20 May 2001, Alexander Viro wrote:

 Linus, as much as I'd like to agree with you, you are hopeless
 optimist. 90% of drivers contain code written by stupid gits.

90% of drivers contain code written by people who do driver development in
their spare time, with limited resources, most of the time serving as a
learning excercise. And they do this freely and for fun. Accusing them of
being 'stupid gits' is just micharacterising the situation. People do not
get born as VFS hackers, there is a very steep learning curve, and only a
few make it to to have knowledge like you. Much of the learning curve of
various people has traces in drivers/*, it's more like the history of
Linux then some coherent image of people's capabilities.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] softirq-2.4.5-B0

2001-05-27 Thread Ingo Molnar


On Sun, 27 May 2001, David S. Miller wrote:

 Hooray, some sanity in this thread finally :-)

[ finally i had some sleep after a really long debugging session :-| ]

   the attached softirq-2.4.5-B0 patch fixes this problem by calling
   do_softirq()  from local_bh_enable() [if the bh count is 0, to avoid
   recursion].

 Yikes!  I do not like this fix.

i think we have no choice, unfortunately.

and i think function calls are not that scary anymore, especially not with
regparms and similar compiler optimizations. The function is simple, the
function just goes in and returns in 90% of the cases, which should be
handled nicely by most BTBs.

we have other fundamental primitives that are a function call too, eg.
dget(), and they are used just as frequently. In 2.4 we were moving
inlined code into functions in a number of cases, and it appeared to work
out well in most cases.

 I'd rather local_bh_enable() not become a more heavy primitive.

 I know, in one respect it makes sense because it parallels how
 hardware interrupts work, but not this thing is a function call
 instead of a counter bump :-(

i believe the important thing is that the function has no serialization or
other 'heavy' stuff. BHs had the misdesign of not being restarted after
being re-enabled, and it caused performance problems - we should not
repeat history.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] severe softirq handling performance bug, fix, 2.4.5

2001-05-27 Thread Ingo Molnar


On Sun, 27 May 2001, Andrea Arcangeli wrote:

 Yes the stock kernel.

yep you are right.

i had this fixed too at a certain point, there is one subtle issue: under
certain circumstances tasklets re-activate the tasklet softirq(s) from
within the softirq handler, which leads to infinite loops if we just
naively restart softirq handling. This fix is not in the -B0 patch yet.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[patch] ioapic-2.4.5-A1

2001-05-29 Thread Ingo Molnar


the attached ioapic-2.4.5-A1 patch includes a number of important IO-APIC
related fixes (against 2.4.5-ac3):

 - correctly handle bridged devices that are not listed in the mptable
   directly. This fixes eg. dual-port eepro100 devices on Compaq boxes
   with such PCI layout:

-+-[0d]---0b.0
 +-[05]-+-02.0
 |  \-0b.0
 \-[00]-+-02.0
+-03.0-[01]--+-04.0=== eth0
|\-05.0=== eth1
+-0b.0
+-0c.0
+-0d.0
+-0e.0
+-0f.0
+-14.0
+-14.1
+-19.0
+-1a.0
\-1b.0

   without the patch the eepro100 devices get misdetected as XT-PIC IRQs
   and their interrupts are stuck.

 - the srcbus entry in the mptable does not have to be translated into
   a PCI-bus value.

 - add more APIC versions to the whitelist

 - initialize mp_bus_id_to_pci_bus[] correctly, so that we can detect
   nonlisted/bridged PCI busses more accurately.

the patch should only affect systems that were not working properly
before, but it might break broken-mptable systems - we'll see.

Ingo


--- linux/arch/i386/kernel/io_apic.c.orig   Tue May 29 12:13:15 2001
+++ linux/arch/i386/kernel/io_apic.cTue May 29 12:19:55 2001
@@ -256,10 +256,16 @@
  */
 static int pin_2_irq(int idx, int apic, int pin);
 
-int IO_APIC_get_PCI_irq_vector(int bus, int slot, int pci_pin)
+int IO_APIC_get_PCI_irq_vector(int bus, int slot, int pin)
 {
int apic, i, best_guess = -1;
 
+   Dprintk(querying PCI - IRQ mapping bus:%d, slot:%d, pin:%d.\n,
+   bus, slot, pin);
+   if (mp_bus_id_to_pci_bus[bus] == -1) {
+   printk(KERN_WARNING PCI BIOS passed nonexistent PCI bus %d!\n, bus);
+   return -1;
+   }
for (i = 0; i  mp_irq_entries; i++) {
int lbus = mp_irqs[i].mpc_srcbus;
 
@@ -270,14 +276,14 @@
 
if ((mp_bus_id_to_type[lbus] == MP_BUS_PCI) 
!mp_irqs[i].mpc_irqtype 
-   (bus == mp_bus_id_to_pci_bus[mp_irqs[i].mpc_srcbus]) 
+   (bus == lbus) 
(slot == ((mp_irqs[i].mpc_srcbusirq  2)  0x1f))) {
int irq = pin_2_irq(i,apic,mp_irqs[i].mpc_dstirq);
 
if (!(apic || IO_APIC_IRQ(irq)))
continue;
 
-   if (pci_pin == (mp_irqs[i].mpc_srcbusirq  3))
+   if (pin == (mp_irqs[i].mpc_srcbusirq  3))
return irq;
/*
 * Use the first all-but-pin matching entry as a
@@ -738,9 +744,11 @@
printk(KERN_DEBUG  register #01: %08X\n, *(int *)reg_01);
printk(KERN_DEBUG ... : max redirection entries: %04X\n, 
reg_01.entries);
if ((reg_01.entries != 0x0f)  /* older (Neptune) boards */
+   (reg_01.entries != 0x11) 
(reg_01.entries != 0x17)  /* typical ISA+PCI boards */
(reg_01.entries != 0x1b)  /* Compaq Proliant boards */
(reg_01.entries != 0x1f)  /* dual Xeon boards */
+   (reg_01.entries != 0x20) 
(reg_01.entries != 0x22)  /* bigger Xeon boards */
(reg_01.entries != 0x2E) 
(reg_01.entries != 0x3F)
--- linux/arch/i386/kernel/mpparse.c.orig   Tue May 29 12:13:15 2001
+++ linux/arch/i386/kernel/mpparse.cTue May 29 12:13:46 2001
@@ -36,7 +36,7 @@
  */
 int apic_version [MAX_APICS];
 int mp_bus_id_to_type [MAX_MP_BUSSES];
-int mp_bus_id_to_pci_bus [MAX_MP_BUSSES] = { -1, };
+int mp_bus_id_to_pci_bus [MAX_MP_BUSSES] = { [0 ... MAX_MP_BUSSES-1] = -1 };
 int mp_current_pci_id;
 
 /* I/O APIC entries */
--- linux/arch/i386/kernel/pci-irq.c.orig   Tue May 29 12:13:15 2001
+++ linux/arch/i386/kernel/pci-irq.cTue May 29 12:13:46 2001
@@ -660,10 +660,12 @@
if (pin) {
pin--;  /* interrupt pins are numbered 
starting from 1 */
irq = IO_APIC_get_PCI_irq_vector(dev-bus-number, 
PCI_SLOT(dev-devfn), pin);
-/*
- * Will be removed completely if things work out well with fuzzy parsing
- */
-#if 0
+   /*
+* Busses behind bridges are typically not listed in the MP-table.
+* In this case we have to look up the IRQ based on the parent bus,
+* parent slot, and pin number. The SMP code detects such bridged
+* busses itself so we should get into this branch reliably.
+*/
if (irq  0  dev-bus-parent) { /* go back to the 
bridge */
struct pci_dev * bridge = dev-bus-self;
 
@@ -674,7 +676,6 @@
printk(KERN_WARNING PCI: using 
PPB(B%d,I%d,P%d) to get irq %d\n, 
  

[patch] raid-2.4.5-A0, minor fix

2001-05-29 Thread Ingo Molnar


the attached patch (against 2.4.5-ac3) fixes a compiler warning (triggered
by gcc 2.96) in the RAID include files.

Ingo


--- linux/include/linux/raid/md_k.h.origTue May 29 12:50:30 2001
+++ linux/include/linux/raid/md_k.h Tue May 29 12:50:40 2001
@@ -38,6 +38,7 @@
case RAID5: return 5;
}
panic(pers_to_level());
+   return 0;
 }
 
 extern inline int level_to_pers (int level)



[patch] softirq-2.4.5-E5

2001-05-29 Thread Ingo Molnar


the attached softirq-2.4.5-E5 patch (against 2.4.5-ac3) tries to solve all
softirq, tasklet and scheduling latency problems i could identify while
testing TCP latencies over gigabit connections. The list of problems, as
of 2.4.5-ac3:

 - the need_resched check in the arch/i386/kernel/entry.S syscall/irq
   return code has a race that makes it possible to miss a reschedule for
   up to smp_num_cpus*HZ jiffies.

 - the softirq check in entry.S has a race as well.

 - on x86, APIC interrupts do not trigger do_softirq(). This is especially
   problematic with the smptimers patch, which is APIC-irq driven.

 - local_bh_disable() blocks the execution of do_softirq(), and it takes
   a nondeterministic amount of time after local_bh_enable() for the next
   do_softirq() to be triggered.

 - do_softirq() does not execute softirqs that got activated meanwhile,
   and the next do_softirq() run happens after a nondeterministic amount
   of time.

 - the tasklet design re-enables their driving softirq occasionally, which
   makes 'complete' softirq processing impossible.

the patch (tries to) solve all these problems. The changes:

 - all softirqs are guaranteed to be handled after do_softirq()  returns
   (even those which are activated during softirq run)

 - softirq handling is immediately restarted if bhs are re-enabled again.

 - the tasklet code got rewritten (but externally visible semantics are
   kept) to not rely on marking the softirq busy. The new code is a bit
   tricky, but it should be correct.

 - some code got a bit slower, some code got a bit faster. I believe most
   of the changes made the softirq/tasklet implementation clearer.

 - some minor uninlining of too big inline functions, and other cleanup
   was done as well.

 - no global serialization was added to any part of the softirq or tasklet
   code, so scalability is not impacted.

the patch is stable under every workload i tried, handles softirqs and
tasklets with the minimum possible latency, thus it maximizes cache
locality. The patch has no known bug, and the kernel has no known
lost-wakeup, lost-softirq problem i know of. TCP latencies and TCP
throughput is picture-perfect.

Comments?

Ingo


--- linux/kernel/softirq.c.orig Fri Dec 29 23:07:24 2000
+++ linux/kernel/softirq.c  Tue May 29 17:41:14 2001
@@ -52,12 +52,12 @@
int cpu = smp_processor_id();
__u32 active, mask;
 
+   local_irq_disable();
if (in_interrupt())
-   return;
+   goto out;
 
local_bh_disable();
 
-   local_irq_disable();
mask = softirq_mask(cpu);
active = softirq_active(cpu)  mask;
 
@@ -71,7 +71,6 @@
local_irq_enable();
 
h = softirq_vec;
-   mask = ~active;
 
do {
if (active  1)
@@ -82,12 +81,13 @@
 
local_irq_disable();
 
-   active = softirq_active(cpu);
-   if ((active = mask) != 0)
+   active = softirq_active(cpu)  mask;
+   if (active)
goto retry;
}
 
-   local_bh_enable();
+   __local_bh_enable();
+out:
 
/* Leave with locally disabled hard irqs. It is critical to close
 * window for infinite recursion, while we help local bh count,
@@ -121,6 +121,45 @@
 
 struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned;
 
+void tasklet_schedule(struct tasklet_struct *t)
+{
+   unsigned long flags;
+   int cpu;
+
+   cpu = smp_processor_id();
+   local_irq_save(flags);
+   /*
+* If nobody is running it then add it to this CPU's
+* tasklet queue.
+*/
+   if (!test_and_set_bit(TASKLET_STATE_SCHED, t-state) 
+   tasklet_trylock(t)) {
+   t-next = tasklet_vec[cpu].list;
+   tasklet_vec[cpu].list = t;
+   __cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
+   tasklet_unlock(t);
+   }
+   local_irq_restore(flags);
+}
+
+void tasklet_hi_schedule(struct tasklet_struct *t)
+{
+   unsigned long flags;
+   int cpu;
+
+   cpu = smp_processor_id();
+   local_irq_save(flags);
+
+   if (!test_and_set_bit(TASKLET_STATE_SCHED, t-state) 
+   tasklet_trylock(t)) {
+   t-next = tasklet_hi_vec[cpu].list;
+   tasklet_hi_vec[cpu].list = t;
+   __cpu_raise_softirq(cpu, HI_SOFTIRQ);
+   tasklet_unlock(t);
+   }
+   local_irq_restore(flags);
+}
+
 static void tasklet_action(struct softirq_action *a)
 {
int cpu = smp_processor_id();
@@ -129,37 +168,37 @@
local_irq_disable();
list = tasklet_vec[cpu].list;
tasklet_vec[cpu].list = NULL;
-   local_irq_enable();
 
-   while (list != NULL) {
+   while (list) {
 

Re: IRQ handling in SMP environment, kernel 2.4.3

2001-05-29 Thread Ingo Molnar


On Tue, 29 May 2001, Hilik Stein wrote:

 I am running a Linux machine with a 1GB Ethernet card which takes a
 huge amount of packets, which results in many HW interrupts. is it
 possible to make sure that only CPU #1 handles all the hardware
 interrupts generated by the NIC ? or even all the hardware interrupts
 in the systems if its too much to ask to filter IRQ based on origin ?
 thanks Hilik Stein

yes this is possible with the 2.4 kernels. Check out
Documentation/IRQ-affinity.txt. You can bind hardware interrupts to any
CPU (or arbitrary group of CPUs).

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Emulate RDTSC

2001-05-29 Thread Ingo Molnar


On Tue, 29 May 2001, Jaswinder Singh wrote:

 What is the nice way (in accuracy and performance) to emulate RDTSC in
 Linux for those architectures who dont support RDTSC like in Hitachi
 SH Processors.

if the hardware provides no way to get some accurate estimation of current
time, then there is no way to solve this problem in a generic way.
Typically there are some cycle-accuracy counters in the CPU (ideal
situation), or sometimes there is a counter in some external device (eg.
the i8254 timer counter), but access to these tend to be slow and
typically they are quite coarse as well.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [ PATCH ]: disable pcspeaker kernel: 2.4.2 - 2.4.5

2001-05-30 Thread Ingo Molnar


By making this (logical, and needed) feature unconditional, your patch's
size and complexity is reduced by 80%. (see the attached
pc_speaker.patch2)

Ingo



diff -u --recursive linux-2.4.5/drivers/char/vt.c linux-2.4.5-nc/drivers/char/vt.c
--- linux-2.4.5/drivers/char/vt.c   Fri Feb  9 20:30:22 2001
+++ linux-2.4.5-nc/drivers/char/vt.cWed May  9 23:47:36 2001
@@ -40,6 +41,7 @@
 #include asm/vc_ioctl.h
 #endif /* CONFIG_FB_COMPAT_XPMAC */
 
+extern int pcspeaker_enabled;
 char vt_dont_switch;
 extern struct tty_driver console_driver;
 
@@ -112,6 +117,9 @@
unsigned int count = 0;
unsigned long flags;
 
+   /* is the pcspeaker enabled or disabled ? 0=disabled,1=enabled */
+   if (!pcspeaker_enabled)
+   return;
if (hz  20  hz  32767)
count = 1193180 / hz;

diff -u --recursive linux-2.4.5/include/linux/sysctl.h 
linux-2.4.5-nc/include/linux/sysctl.h
--- linux-2.4.5/include/linux/sysctl.h  Tue May 29 17:56:29 2001
+++ linux-2.4.5-nc/include/linux/sysctl.h   Mon May 28 19:24:08 2001
@@ -118,7 +118,8 @@
KERN_SHMPATH=48,/* string: path to shm fs */
KERN_HOTPLUG=49,/* string: path to hotplug policy agent */
KERN_IEEE_EMULATION_WARNINGS=50, /* int: unimplemented ieee instructions */
-   KERN_S390_USER_DEBUG_LOGGING=51  /* int: dumps of user faults */
+   KERN_S390_USER_DEBUG_LOGGING=51,  /* int: dumps of user faults */
+   KERN_DISABLE_PC_SPEAKER=52 /* int: speaker on or off */
 };
 
 
diff -u --recursive linux-2.4.5/kernel/sysctl.c linux-2.4.5-nc/kernel/sysctl.c
--- linux-2.4.5/kernel/sysctl.c Tue May 29 17:55:59 2001
+++ linux-2.4.5-nc/kernel/sysctl.c  Wed May  9 23:44:30 2001
@@ -48,6 +49,7 @@
 extern int nr_queued_signals, max_queued_signals;
 extern int sysrq_enabled;
 
+int pcspeaker_enabled;
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -212,6 +217,8 @@
 0444, NULL, proc_dointvec},
{KERN_RTSIGMAX, rtsig-max, max_queued_signals, sizeof(int),
 0644, NULL, proc_dointvec},
+   {KERN_DISABLE_PC_SPEAKER, pcspeaker, pcspeaker_enabled, sizeof(int),
+0644, NULL, proc_dointvec},
 #ifdef CONFIG_SYSVIPC
{KERN_SHMMAX, shmmax, shm_ctlmax, sizeof (size_t),
 0644, NULL, proc_doulongvec_minmax},




Re: [ PATCH ]: disable pcspeaker kernel: 2.4.2 - 2.4.5

2001-05-30 Thread Ingo Molnar


 less code / one int more in the kernel
 or
 more code and #ifs / one int less in the kernel

if the #ifdefs bloat the code 4 times the size of the simple patch, then
we obviously want 4 bytes more in the kernel.

 And what about the code from kernel/sys.c ? The version you provided
 doesn't take care of what's the default value of pcspeaker. This would
 make it undefined, which is not really good.

the default value is 0, that is good enough.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: pte_page

2001-05-30 Thread Ingo Molnar


On Wed, 30 May 2001 [EMAIL PROTECTED] wrote:

 I use the 'pgt_offset', 'pmd_offset', 'pte_offset' and 'pte_page'
 inside a module to get the physical address of a user space virtual
 address. The physical address returned by 'pte_page' is not page
 aligned whereas the virtual address was page aligned. Can somebody
 tell me the reason?

__pa(page_address(pte_page(pte))) is the address you want. [or
pte_val(*pte)  (PAGE_SIZE-1) on x86 but this is platform-dependent.]

 Also, can i use these functions to get the physical address of a
 kernel virtual address using init_mm?

nope. Eg. on x86 these functions only walk normal 4K page pagetables, they
do not walk 4MB pages correctly. (which are set up on pentiums and better
CPUs, unless mem=nopentium is specified.)

a kernel virtual address can be decoded by simply doing __pa(kaddr). If
the page is a highmem page [and you have the struct page pointer] then you
can do [(page-mem_map)  PAGE_SHIFT] to get the physical address, but
only on systems where mem_map[] starts at physical address 0.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [ PATCH ]: disable pcspeaker kernel: 2.4.2 - 2.4.5

2001-05-30 Thread Ingo Molnar


On Wed, 30 May 2001, Nico Schottelius wrote:

  the default value is 0, that is good enough.

 hmm.. I don't think so... value of 1 would be much better, because
 0 normally disables the speaker.

i confused the value. Yes, an initialization to 1 would be the correct,
ie.:

+++ linux-2.4.5-nc/kernel/sysctl.c  Wed May  9 23:44:30 2001
@@ -48,6 +49,7 @@
 extern int nr_queued_signals, max_queued_signals;
 extern int sysrq_enabled;

+int pcspeaker_enabled = 1;

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: pte_page

2001-05-30 Thread Ingo Molnar


On Wed, 30 May 2001, Pete Wyckoff wrote:

  __pa(page_address(pte_page(pte))) is the address you want. [or
  pte_val(*pte)  (PAGE_SIZE-1) on x86 but this is platform-dependent.]

 Does this work on x86 non-kmapped highmem user pages too?  (i.e. is
 page-virtual valid for every potential user page.)

you are right, the highmem-compatible solution is to use page-mem_map as
the physical page index.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] severe softirq handling performance bug, fix, 2.4.5

2001-05-27 Thread Ingo Molnar


On Sun, 27 May 2001, Andrea Arcangeli wrote:

 I mean everything is fine until the same softirq is marked active
 again under do_softirq, in such case neither the do_softirq in do_IRQ
 will run it (because we are in the critical section and we hold the
 per-cpu locks), nor we will run it again ourself from the underlying
 do_softirq to avoid live locking into do_softirq.

if you mean the stock kernel, this scenario you describe is not how it
behaves, because only IRQ contexts can mark a softirq active again. And
those IRQ contexts will run do_IRQ() naturally, so while *this*
do_softirq() invocation wont run those reactivated softirqs, the IRQ
context that just triggered the softirq will do so.

the real source of softirq latencies is the local_bh_disable()/enable()
behavior, see my previous patch.

Ingo


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [CHECKER] 4 security holes in 2.4.4-ac8

2001-05-29 Thread Ingo Molnar


On Tue, 29 May 2001, Dawson Engler wrote:

  Believe it or not, this one is OK :-)
 
  All callers pass in a pointer to a local stack kernel variable
  in raddr.

 Ah.  I assumed that sys_* meant that all pointers were from user
 space --- is this generally not the case?  (Also, are there other
 functions called directly from user space that don't have the sys_*
 prefix?)

to automate this for the Stanford checker i've attached the
'getuserfunctions' script that correctly extracts these function names
from the 2.4.5 x86 entry.S file.

unfortunately the validation of the script will always be manual work,
although for the lifetime of the 2.4 kernel the actual format of the
entry.S file is not going to change. To make this automatic, i've added a
md5sum to the script itself, if entry.S changes then someone has to review
the changes manually. It's important to watch the md5 checksum, because
new system-calls can be added in 2.4 as well.

a few interesting facts. Functions that are called from entry.S but do not
have the sys_ prefix:

 do_nmi
 do_signal
 do_softirq
 old_mmap
 old_readdir
 old_select
 save_v86_state
 schedule
 schedule_tail
 syscall_trace
 do_divide_error
 do_coprocessor_error
 do_simd_coprocessor_error
 do_debug
 do_int3
 do_overflow
 do_bounds
 do_invalid_op
 do_coprocessor_segment_overrun
 do_double_fault
 do_invalid_TSS
 do_segment_not_present
 do_stack_segment
 do_general_protection
 do_alignment_check
 do_page_fault
 do_machine_check
 do_spurious_interrupt_bug

functions in the kernel source that have the sys_ prefix and use
asmlinkage but are not called from the x86 entry.S file:

 sys_accept
 sys_bind
 sys_connect
 sys_gethostname
 sys_getpeername
 sys_getsockname
 sys_getsockopt
 sys_listen
 sys_msgctl
 sys_msgget
 sys_msgrcv
 sys_msgsnd
 sys_recv
 sys_recvfrom
 sys_recvmsg
 sys_semctl
 sys_semget
 sys_semop
 sys_send
 sys_sendmsg
 sys_sendto
 sys_setsockopt
 sys_shmat
 sys_shmctl
 sys_shmdt
 sys_shmget
 sys_shutdown
 sys_socket
 sys_socketpair
 sys_utimes

the list is pretty big. There are 33 functions that are called from
entry.S but do not have the sys_ prefix or do not have the asmlinkage
declaration.

NOTE: there are other entry points into the kernel's 'protection domain'
as well, and not all of them are through function interfaces. Some of
these interfaces pass untrusted pointers and/or untrusted parameters
directly, but most of them pass a pointer to a CPU registers structure
which is stored on the kernel stack (thus the pointer can be trusted), but
the contents of the registers structure are untrusted and must not be used
unchecked.

1) IRQ handling, trap handling, exception handling entry points. I've
atttached the 'getentrypoints' script that extracts these addresses from
the i386 tree:

 divide_error
 debug
 int3
 overflow
 bounds
 invalid_op
 device_not_available
 double_fault
 coprocessor_segment_overrun
 invalid_TSS
 segment_not_present
 stack_segment
 general_protection
 spurious_interrupt_bug
 coprocessor_error
 alignment_check
 machine_check
 simd_coprocessor_error
 system_call
 lcall7
 lcall27

all of these functions get parameters passed that are untrusted.

2) bootup parameter passing.

there is a function entry point, start_kernel, but there is also lots of
implicit parameter passing, values filled out by the boot code, and
parameters stored in hardware devices (eg. PCI settings and more). These
all are theoretical protection domain entry points, but impossible to
check automatically - the validity of current system state will have to be
checked manually. (and in most cases it can be trusted - but not all
cases.) Some 'unexpected' boot-time entry points: initialize_secondary on
SMP systems for example.

3) manually constructed unsafe entry points which are hard to automate.
include/asm-i386/hw_irq.h's BUILD macros are used in a number of places.
One type of IRQ building uses do_IRQ() as an entry point. The SMP
code builds the following entry points:

 reschedule_interrupt
 invalidate_interrupt
 call_function_interrupt
 apic_timer_interrupt
 error_interrupt
 spurious_interrupt

but most of these pass no parameters, but apic_timer_interrupt does get
untrusted parameters.

4) BIOS exit/entry points, eg in the APM code. Impossible to check, we
have to trust the BIOS's code.


i think this mail should be a more or less complete description of all
entry points into the kernel. (Let me know if i missed any of them, or any
of the scripts misidentifies entry points.)

Ingo



grep -E 'set_trap_gate|set_system_gate|set_call_gate' arch/i386/*/*.c arch/i386/*/*.h 
| grep -v 'static void' | cut -d, -f2- | sed 's///g' | cut -d\) -f1




if [ `md5sum arch/i386/kernel/entry.S` != 0e19b0892f4bd25015f5f1bfe90b441a  
arch/i386/kernel/entry.S ]; then echo entry.S file's MD5sum changed! Please 
revalidate the changes and change the md5sum in this script.; exit -1; fi

(grep 'call S' arch/i386/kernel/entry.S  | grep '[()]'; grep '\.long SYMBOL_NAME' 

Re: Reserving a (large) memory block

2000-08-31 Thread Ingo Molnar


On Thu, 31 Aug 2000, Alan Cox wrote:

 We then just follow the bios. You can also reserve blocks of memory by
 hacking arch/i386/mm/init.c and marking them reserved

in 2.4 there is an explicit interface for this that also guarantees that
the allocation consists of fully valid RAM (no matter how complex the RAM
map): alloc_bootmem(). We allocate 300MB+ worth of mem_map[] with this on
multi-gigabyte boxes.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[Announce] TUX alpha source code release

2000-09-01 Thread Ingo Molnar


We are pleased to announce that the TUX kernel-space HTTP-subsystem is
available for download at:

ftp://ftp.redhat.com/pub/redhat/tux/tux-hawaii/

WARNING: this is a developer-only, alpha release. The 1.0 'consumer'
release will happen by the end of September. This release is useless to
you unless you are a kernel developer, and even in that case it might eat
your data, start World War III, or drink your coffee. As of now, it is
possible to cause TUX to BUG() via a simple browser URL - sanity checks
are not handled too nicely yet. You have been warned. Further TUX
development will be coordinated through the [EMAIL PROTECTED] mailing
list.  (See the attached README file for how to subscribe.)

Many thanks to Michael K. Johnson and Matt Wilson for making the August
release possible :-)

Ingo

the README file:
---
TUX: The Hawaii Release

This is a developer-only preview of TUX.  It is incomplete; some features
are stubbed out or incompletely implemented, and packaging is incomplete.
It is provided in source-code-only form at this time.  It is not yet
intended for enterprise use.  It has not undergone a security audit.
It is not guaranteed to run SPECweb99 correctly, because there has been
further development since the accepted SPECweb99 runs were done, and
that development may well have diverged from SPECweb99 and web standards
compliance, and has probably changed performance.  It is here so that
developers besides those at Red Hat can participate in the process of
finalizing the TUX APIs, configuration options, and documentation.

  ***   If it breaks, you get to keep both pieces.
  ***   You have been warned!

The development work will happen on the mailing list [EMAIL PROTECTED]
You can join the list by sending a mail with a subject of "Subscribe"
to [EMAIL PROTECTED]

The packages are included only in source form.  The tux package contains
some basic documentation in Docbook format; if you build the tux
package it will build html documentation from the Docbook.  The kernel
package, when built, builds the tux package, currently only for the i686
enterprise kernel binary package.  Because TUX's memory model clashes
with what is needed for the enterprise kernel, that will change; it's
just a placeholder for now.  The packages have been tested more on the
Pinstripe beta than on Red Hat Linux 6.2 at this time.  They will receive
more testing on Red Hat Linux 6.2 in the future.

We plan to make a full release by the end of September.  That release
will be validated against SPECweb99, will have updated APIs, better
configuration, richer documentation, and will be made more readily
available, with a defect tracking program to support it.

For more information, join the mailing list.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: thread rant

2000-09-02 Thread Ingo Molnar


On Sat, 2 Sep 2000, Alexander Viro wrote:

 Why? I would say that bad thing about SysV shared memory is that it's
 _not_ sufficiently filesystem-thing - a special API where 'create a
 file on ramfs and bloody mmap() it' would be sufficient. Why bother
 with special sets of syscalls?

what i mean is that i dont like the cleanup issues associated with SysV
shared memory - eg. it can hang around even if all users have exited, so
auto-cleanup of resources is not possible.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-03 Thread Ingo Molnar


On 2 Sep 2000, Jes Sorensen wrote:

 You can't DMA directly from a file cache page unless you have a
 network card that does scatter/gather DMA and surprise surprise,
 80-90% of the cards on the market don't support this. [...]

exactly. The TUX patch solves this by copying 'multi-fragment skbs' into a
temporary single-fragment skb, if the card doesnt support scatter-gather,
64-bit DMA. This way the copying is delayed as much as possible, to the
point where we queue the packet to the network device.

 Besides that you need to do copy-on-write if you want to be able to do
 zero copy on write() from user space [...]

i agree that this is hard - i'm not sure wether we want to go the pain to
enable anonymous-buffer write()s do zero-copy. My plan is to enable
sendfile() first - it should cover all the important high-performance
server cases. The point is that a write() is only used if some sort of
dynamic data is generated on the fly. If data is generated once and sent
once then there is no much win in zero-copy. If data is generated once and
reused multiple times afterwards then it should rather be written into a
temporary file and then it can be sent out via sendfile().

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-03 Thread Ingo Molnar


On Sat, 2 Sep 2000, Jeff V. Merkey wrote:

 **ALL** Netware network drivers support a scatter/gather proramming
 interface, whether the hardware does or not. In NetWare, the drivers
 get passed a fragment list in what's called an ECB (Event Control
 Block). It's the drivers responsiblity to assemble the fragment lists.
 We did it this way to support scatter/gather cards and non-scatter
 gather cards in one interface. Those drivers that do not support
 scatter gather DMA operations copy to a local buffer to assemble the
 packet. [...]

this is exactly what TUX implements and what i mentioned in my first email
that started this thread. Was a complete overhaul of the TCP/IP stack
needed like you claim? Not at all - the Linux TCP/IP code was so generic,
that to my surprise i saw the first zero-copy TCP transfer after only one
day of hacking.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-03 Thread Ingo Molnar


On Sun, 3 Sep 2000 [EMAIL PROTECTED] wrote:

 If we go for a Linux-specific solution anyway, maybe one could add
 another send{,to,msg} flag that makes send*(2)'s buffer access
 non-atomic. That way, the kernel only needs to make sure the pages
 don't disappear, but there's no need for expensive MMU games.
 
 Of course, this would give applications a way for generating packets
 with an incorrect TCP/UDP checksum, [...]

i believe such zero-copy send should only be allowed for drivers which can
guarantee correct checksums. (ie. cards which do Tx-checksums) The other
drivers will still copy. I dont think this is a problem - the number of
cards that can do scatter-gather DMA but cannot do TX-checksumming is
rather low. (i only know about the Tulip.) All modern cards do
TX-checksumming and scatter-gather DMA.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-03 Thread Ingo Molnar


On Sun, 3 Sep 2000, Andi Kleen wrote:

 I did the same for fragment RX some months ago (simple fragment lists
 that were copy-checksummed to user space). Overall it is probably
 better to use a kiovec, because that can be more easily used in nfsd
 and sendfile.

the basic fragment type introduced by the TUX changes is a 'struct
skb_frag', which has csum, size, *page, page_offset, frag_done, *data and
*private fields - this is more than normal kiovecs offer. But i think
kiovecs can be extended to do all this (if Stephen  everybody else
agrees), i just didnt want to touch it for the time being.

Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-03 Thread Ingo Molnar


On Sun, 3 Sep 2000, Andi Kleen wrote:

 You can already cause incorrect checksums on the wire just by passing
 a partly unmapped address (the zero-the-rest exception handler in
 csum_copy_generic in i386 forgets to add in the carry)
 
 I do not believe it is a big deal, packets with bad checksum are not
 really a problem (you can usually do other better DoS that do not need
 it)

i think it's a quality of implementation issue. The csum_copy_generic
thing is a bug. Allowing incorrect checksums to be sent out would be a
design bug. I think some RFCs do even forbid the sending of incorrect
packets?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-05 Thread Ingo Molnar


On Tue, 5 Sep 2000, Jeff V. Merkey wrote:

   while (x)
   {
 x = x-next
   }
  
   all over the place that increases latency. [...]
  
  i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If
  it's all over the place and if it increases latency, you certainly can
  show at least one such place.
 
 When I have time to do this exercise, I will. [...]

well, your original claim (quoted above) shows that you have identified
numerous such places already, so you dont have to do any additional
'exercise'. The "all over the place" code shouldnt be too hard to find
again - please just say filename and line number in any kernel version of
your choice and we'll look into it.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-05 Thread Ingo Molnar


On Tue, 5 Sep 2000, Jeff V. Merkey wrote:

 Alright Ingo, you asked for it. I am going through it now and going
 over ALL my notes. I will catalog ALL of them and post it. Is this
 what you really want?

yes, this would be the best indeed, to get those places fixed. But if you
dont want to spend your time on that then it's enough to just post a
single incident of such inefficiency and list-walking that impacts latency
like you claim.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-05 Thread Ingo Molnar


On Tue, 5 Sep 2000, Jeff V. Merkey wrote:

 The origin of this comment was related to a comparison of the
 MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's
 code handles fast paths well and from what I've seen is comparable to
 NetWare. [...]

can we thus take this as a retraction of your below quoted three
derogatory comments?

" The entire Linux Network subsystem needs an overhaul. "

" In networking, the enemy is LATENCY for fast performance.  That's why
  NetWare can handle 5000 users and Linux barfs on 100 in similiar tests.  
  Copying increases latency, and the long code paths in the Linux Network
  layer. "


" Alan, Please.  I'm in your code and there are copies all over the
  place.  I agree you have a "fast path" for most stuff, but there's all
  kinds of handles lookups, linear list searching like

  while (x)
  {
x = x-next
  }
 
  all over the place that increases latency. "

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: zero-copy TCP

2000-09-05 Thread Ingo Molnar


On Wed, 6 Sep 2000, Chris Wedgwood wrote:

 [...] The point is that a write() is only used if some sort of
 dynamic data is generated on the fly.
 
 There are exsiting applications out there that use mmap+write
 (caching the maps), it would be nice for the authors of these not to
 have to _require_ non-portable sendfile semantics for the best
 performance.

this is not just an interface question, mmap()+write() is conceptually
inferior to a sendfile(). [if the goal is to send the same data multiple
times.]

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Clearing of Ram?

2000-09-06 Thread Ingo Molnar


On Wed, 6 Sep 2000, Frank Peters wrote:

 My question is who cleared it the kernel or the malloc function in
 glibc?? (i found some code in glibc but nothing in kernel) thx

it's the second clear_user_highpage() in mm/memory.c that does the page
clearing in the typical malloc()-ed memory case. It's only allocated and
cleared once you or glibc accesses it, with page granularity.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [ANNOUNCE] Withdrawl of Open Source NDS Project/NTFS/M2FS forLinux

2000-09-06 Thread Ingo Molnar


On Wed, 6 Sep 2000, J. Dow wrote:

 If the Kernel Debugger creates faulty solutions through lack of
 thinking, and asking why, then surely printk is at least as bad
 because it allows somebody to view the operation of the kernel through
 a keyhole darkly. [...]

i'd like to quote David here, because i cannot put it any simpler:

 " It is hoped that because it isn't the default, some new people
   will take the quantum leap to actually try debugging using the
   best debugger any of us have, our brains, instead of relying on
   automated tools. "

my claim (which others share) is that we need more people who can debug
the really tough problems (for which there are no tools in any OS) with
their brains, and also we need people who will produce code with less bugs
in the future.

There is also the important question of 'bug prevention'. The kernel isnt
some magical soup which must be debugged only, code is *added* and
debugged. If people who write code use more code reviews to fix bugs, then
as a side-effect they'll sooner or later write code that is less prone to
bugs. This is because they identify the bug-risks based on the code
pattern - if you use a debugger mainly then you dont really see the code
pattern but the current state of the system, which you validate. So the
difference is this:

 - compare code, algorithm and concept with the original intention;
   analyze the symptoms and find the bug

 - compare the system state discovered through the debugger with the
   intended state of the system. Potentially step through the code before
   and after the faulty behavior, try to identify the 'point of bug' and
   constantly compare actual system state with intended system state.
   (it's certainly more complex than this, but you get the point.) This is
   why tools/features visualizing system state are so popular.

i claim that the second behavior is 'passive', 'disconnected' and has no
connection to the code itself, and thus tends to lead to inferior code. It
leads to the frequent behavior of 'patching the state', not modifying the
code itself. Eg. 'ok, we have a NULL here, lets return then so it wont
crash later in the function.'

The first behavior IMO produces a more 'integrated' coding style, where
designing, writing and debugging code is closely interwoven, and naturally
leads to higher quality code. Eg. 'we must never get a NULL here, who
called this function and why??'.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [ANNOUNCE] Withdrawl of Open Source NDS Project/NTFS/M2FS forLinux

2000-09-05 Thread Ingo Molnar


On Tue, 5 Sep 2000, Alan Cox wrote:

 I spend my time thinking. But I prefer to spend it thinking about the
 bug not about finding it and how long fsck takes. [...]

if we only optimize for the debugging time spent by seasoned kernel
developers then you are completely right. But if we optimize for new
kernel developers learning the right methodology, and if we optimize for
the *development* process (not the *release* process) of the kernel then
reducing the amount of debugging functionality is the right choice.

 Things like GUI source level kernel debugging, nice graphs of things
 like cache line reloads between two points and run time spinlock
 deadlock validation and lock tracking (the last one is on my todo list
 only right now) are rather useful

IMO there was only one historically hard spinlock-related problem that
needed solving, this is the 'locks up hard' problem (which is solved). The
rest was never really an debugging obstacle, 99% of the spinlock related
bugs manifest themselves in clear, unambiguous lockups.

there is another type of bug that is tough to find without an automatic
tool - memory leaks.

I dont think there is any other systematic bug (besides hard lockups and
memory leaks) that occur often and can only be effectively found via
debugging tools.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [ANNOUNCE] Withdrawl of Open Source NDS Project/NTFS/M2FS forLinux

2000-09-05 Thread Ingo Molnar


On Tue, 5 Sep 2000, Jeff V. Merkey wrote:

 Your arguments are personal, not technical. [...]

no, my arguments are technical, but are simply focused towards the
conceptual (horizontal) development of Linux, not the vertical
development of Linux (drivers) and support issues.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [ANNOUNCE] Withdrawl of Open Source NDS Project/NTFS/M2FS forLinux

2000-09-05 Thread Ingo Molnar


On Tue, 5 Sep 2000, Jeff V. Merkey wrote:

 A kernel debugger will reduce development costs. [...]

... of Jeff V. Merkey - possibly. You are too much focused on your own
needs, you dont contribute a bit to the generic kernel and the kernel
infrastructure itself.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Terrible elevator performance in kernel 2.4.0-test8

2000-09-14 Thread Ingo Molnar


i'm seeing similar problems. I think these problems started when the
elevator was rewritten, i believe it broke the proper unplugging of IO
devices. Does your performance problem get fixed by the attached
workaround?

Ingo

On Thu, 14 Sep 2000, Robert Cohen wrote:

 For a while, Ive been seeing a performance problem with 2.4.0-test
 kernels.
 The benchmark I am using is an netatalk performance benchmark.
 But I think this is a general performance problem, not appletalk
 related.
 The benchmark has a varying number of clients reading and writing 30 Meg
 files.
 The symptom I see is that with more an 2 or 3 clients, I see a suddent
 and gigantic reduction in write performance. At the same time I can hear
 the disk seeking wildly. And the throughput reported by "vmstat 5" drops
 from 2000-3000 to 100-200.
 
 What I believe is happening is that the elevator isn't merging the
 requests properly.
 I think that this may be the same problem reported here
 http://www.uwsg.indiana.edu/hypermail/linux/kernel/0008.2/0389.html
 
 When stracing the netatalk servers, I can see that they are reading from
 the network then doing an 8k write and repeating.
 If I try to simulate the problem by running multiple iozones doing 8k
 writes, I dont see the same kind of problems. 
 However, in a non networked benchmark like iozone, each process is doing
 many writes in its timeslice. And these writes coalesce naturally.
 In the networked benchmark, the read from the network is introducting
 enough delay that we get a context switch and the writes to different
 files become interleaved.
 This is precisely the sort of situation that the elevator is supposed to
 help with.
 
 With kernel version 2.4.0-test1-ac22, I saw adequate performance.
 In this version, the default elevator settings had a max_bomb value of
 32.
 
 In 2.4.0-test3 - test6, the default max_bombs value became 0. And the
 performance with this setting was terrible.
 If I increase max_bombs with elvtune, the performance markedly improves.
 Although I still saw a tendency for a client to get write starved.
 
 In 2.4.0-test, the max_bombs value has been eliminated so I can't change
 it. I was hoping that that meant that the algorithm had been improved.
 Unfortunately, the benchmarks don't show any improvement.
 
 --
 Robert Cohen
 Unix Support, TLTSU
 Australian National University
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
 


--- linux/kernel/sched.c.orig   Sun Sep  3 10:03:35 2000
+++ linux/kernel/sched.cMon Sep  4 09:23:07 2000
@@ -508,6 +508,7 @@
if (tq_scheduler)
goto handle_tq_scheduler;
 tq_scheduler_back:
+   run_task_queue(tq_disk);
 
prev = current;
this_cpu = prev-processor;



Re: [PATCH] old+new RAID for 2.2.17+

2000-09-22 Thread Ingo Molnar


i strongly disagree. It's a nightmare to have three variants of the same
code at once. (mdtools, raidtools and raidtools2.) This mess has been
cleaned up in 2.4, and we shouldnt touch 2.2's RAID code beyond bugfixes.
This is not support for 'old hardware', it's support for the very same
thing.

moving the RAID files into a separate directory is a natural cleanup in
the context of 2.4, but it's just causing confusion in 2.2.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



refill_inactive()

2000-09-24 Thread Ingo Molnar


i'm wondering about the following piece of code in refill_inactive():

if (current-need_resched  (gfp_mask  __GFP_IO)) {
__set_current_state(TASK_RUNNING);
schedule();
}

shouldnt this be __GFP_WAIT? It's true that __GFP_IO implies __GFP_WAIT
(because IO cannot be done without potentially scheduling), so the code is
not buggy, but the above 'yielding' of the CPU should be done in the
GFP_BUFFER case as well. (which is __GFP_WAIT but not __GFP_IO)

Objections?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



__GFP_IO shrink_[d|i]cache_memory()?

2000-09-24 Thread Ingo Molnar


i've seen a couple of GFP_BUFFER allocation deadlocks in an atypical
system which had lots of RAM allocated to inodes. The reason for the
deadlock is that the shrink_*() functions cannot be called if __GFP_IO is
not set. Nothing else can be freed at that point, so the try_again: loop
in page_alloc() gets into an infinite loop.

as an immediate solution the previous __GFP_WAIT suggestion solves the
deadlock - because the GFP_BUFFER allocator yields the CPU and kswapd can
run and do the dcache/icache shrinking. [i cannot reproduce any deadlocks
after doing this change.]

as a longer term solution, i'm wondering how hard it would be to propagate
gfp_mask into the shrink_*() functions, and prevent recursion similarly to
the swap-out logic? This way even GFP_BUFFER allocators could touch/free
the dcache/icache.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: __GFP_IO shrink_[d|i]cache_memory()?

2000-09-24 Thread Ingo Molnar


On Sun, 24 Sep 2000, Linus Torvalds wrote:

 [...] I don't think shrinking the inode cache is actually illegal when
 GPF_IO isn't set. In fact, it's probably only the buffer cache itself
 that has to avoid recursion - the other stuff doesn't actually do any
 IO.

i just found this out by example, i'm running the shrink_[i|d]cache stuff
even if __GFP_IO is not set, and no problems so far. (and much better
balancing behavior)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] vmfixes-2.4.0-test9-B2

2000-09-24 Thread Ingo Molnar


the attached vmfixes-B2 patch adds the following fixes/cleanups:

vmscan.c:

 - check for __GFP_WAIT not __GFP_IO when yielding the CPU. This fixes
   GFP_BUFFER deadlocks. In fact since no caller to do_try_to_free_pages()
   can expect that function to not block, we dont test for __GFP_WAIT
   either. [GFP_KSWAPD is the only caller without __GFP_WAIT set.]

 - do shrink_[d|i]cache_memory() even if !__GFP_IO. This improves balance.

 - push the __GFP_IO test into shm_swap().

 - after shm_swap() do not test for !count but for = 0, because count
   could be negative if in the future the shrink_ functions return bigger
   than 1, and we could then get into an infinite loop. Same after
   swap_out() and refill_inactive_scan(). No performance penalty, test
   for zero is exchanged with test for sign.

 - kmem_cache_reap() is done within refill_inactive(), so it's
   unnecessery to call it at the beginning of do_try_to_free_pages().
   Moved to the else branch. (i saw kmem_cache_reap() show up in profiles)

 - (small codestyle cleanup.)


page_alloc.c:

 - in __alloc_pages(), the infinite allocation loop yields the CPU if
   necessery. This prevents a potential lockup on UP, and even on SMP it
   can prevent livelocks. (i saw this happen.)

mm.h:

 - made the GFP_ flag definitions easier to parse for humans :-)

 - remove shrink_mmap() prototype, it doesnt exist anymore.

shm.c:

 - the trivial test for __GFP_IO.

swap_state.c, filemap.c:

 - (shrink_mmap doesnt exist anymore, it's refill_inactive.)

(The patch applies and compiles cleanly, and is tested under various VM
loads i use.)

Ingo


--- linux/mm/vmscan.c.orig  Sun Sep 24 11:41:38 2000
+++ linux/mm/vmscan.c   Sun Sep 24 12:20:27 2000
@@ -119,7 +119,7 @@
 * our scan.
 *
 * Basically, this just makes it possible for us to do
-* some real work in the future in "shrink_mmap()".
+* some real work in the future in "refill_inactive()".
 */
if (!pte_dirty(pte)) {
flush_cache_page(vma, address);
@@ -159,7 +159,7 @@
 * NOTE NOTE NOTE! This should just set a
 * dirty bit in 'page', and just drop the
 * pte. All the hard work would be done by
-* shrink_mmap().
+* refill_inactive().
 *
 * That would get rid of a lot of problems.
 */
@@ -891,7 +891,7 @@
do {
made_progress = 0;
 
-   if (current-need_resched  (gfp_mask  __GFP_IO)) {
+   if (current-need_resched) {
__set_current_state(TASK_RUNNING);
schedule();
}
@@ -899,34 +899,32 @@
while (refill_inactive_scan(priority, 1) ||
swap_out(priority, gfp_mask, idle_time)) {
made_progress = 1;
-   if (!--count)
+   if (--count = 0)
goto done;
}
 
-   /* Try to get rid of some shared memory pages.. */
-   if (gfp_mask  __GFP_IO) {
-   /*
-* don't be too light against the d/i cache since
-* shrink_mmap() almost never fail when there's
-* really plenty of memory free. 
-*/
-   count -= shrink_dcache_memory(priority, gfp_mask);
-   count -= shrink_icache_memory(priority, gfp_mask);
-   /*
-* Not currently working, see fixme in shrink_?cache_memory
-* In the inner funtions there is a comment:
-* "To help debugging, a zero exit status indicates
-*  all slabs were released." (-arca?)
-* lets handle it in a primitive but working way...
-*  if (count = 0)
-*  goto done;
-*/
+   /*
+* don't be too light against the d/i cache since
+* refill_inactive() almost never fail when there's
+* really plenty of memory free. 
+*/
+   count -= shrink_dcache_memory(priority, gfp_mask);
+   count -= shrink_icache_memory(priority, gfp_mask);
+   /*
+* Not currently working, see fixme in shrink_?cache_memory
+* In the inner funtions there is a comment:
+* "To help debugging, a zero exit status indicates
+*  all slabs were released." (-arca?)
+* lets handle it in a primitive but working way...
+*  if (count = 0)
+*  goto done;
+*/
 
-   while (shm_swap(priority, gfp_mask)) {
-

Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-24 Thread Ingo Molnar


On Sun, 24 Sep 2000, Andrea Arcangeli wrote:

 ext2_new_block (or whatever that runs getblk with the superlock lock
 acquired)-getblk-GFP-shrink_dcache_memory-prune_dcache-
 prune_one_dentry-dput-dentry_iput-iput-inode-i_sb-s_op-
 put_inode-ext2_discard_prealloc-ext2_free_blocks-lock_super-D

nasty indeed, sigh. Shouldnt ext2_new_block drop the superblock lock in
places where we might block?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 1023rd thread crashes 2.4.0-test8 from non-root user

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Mark Hahn wrote:

  The problem is large numbers of threads in 2.4.0-test8 can result in a
  hard crash of the entire kernel.  This can be done as a non-root user.
 
 this appears to be reproducable (128M duron, haven't tried intel UP/SMP):

i've done some experimentation, and to me it appears we overload the
queued signal limit of bash, or something like that? The Ctrl-C thing
definitely creates alot of signals. And the default limit for queued
signals [kernel/signal.c:max_queued_signals] is 1024 ...

so i think this is threading-unrelated, to me it (tentatively) looks like
to be a signal handling bug.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 1023rd thread crashes 2.4.0-test8 from non-root user

2000-09-25 Thread Ingo Molnar


indeed, after changing max_queued_signals to 4096, i cannot crash the
kernel anymore with 2000 threads.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   3   4   5   6   7   8   9   10   >