Re: Oops in 2.4.0-ac5
On Wed, 10 Jan 2001, Alan Cox wrote: it. I could never persuade Ingo to use wrmsr_eio() and check the return code, maybe this will change his mind. Extract from kdb v1.7. I have a patch from Ingo to fix this one properly. Its just getting tested i prefer clear oopses and bug reports instead of ignoring them. A failed MSR write is not something to be taken easily. MSR writes if fail mean that there is a serious kernel bug - we want to stop the kernel and complain ASAP. And correct code will be much more readable that way. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] highmem-2.4.0-A0 (was: Re: 2.4.0-ac6: mm/vmalloc.c compileerror)
On Thu, 11 Jan 2001, Frank Davis wrote: Hello, The following error occurred while compiling 2.4.0-ac6. [...] vmalloc.c: In function `get_vm_area': vmalloc.c:188: `PKMAP_BASE' undeclared (first use in this function) you are compiling with HIGHMEM enabled (which makes sense only if you have more than ~900MB RAM), and i accidentally broke HIGHMEM with the vmalloc fix in -pre1/-ac5. The attached patch fixes it. Ingo --- linux/include/linux/vmalloc.h.orig Thu Jan 11 11:28:06 2001 +++ linux/include/linux/vmalloc.h Thu Jan 11 11:28:33 2001 @@ -4,6 +4,7 @@ #include linux/sched.h #include linux/mm.h #include linux/spinlock.h +#include linux/highmem.h #include asm/pgtable.h
Re: Oops in 2.4.0-ac5
On Thu, 11 Jan 2001, David Woodhouse wrote: The bug here seems to be that we're using the same bit (X86_FEATURE_APIC) to report two _different_ features. i think that the AMD APIC is truly 'compatible', but we are trying to enable the APIC and program performance counters in an Intel-way. The MSRs can be incompatible between steppings of the same CPU, so we should not mark something 'incompatible' on that basis. so the correct statement is: the UP-P6-specific way of enabling APICs does not work on Athlons. It doesnt work on P5's either. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Updated zerocopy patch up on kernel.org
On Tue, 9 Jan 2001, David S. Miller wrote: Nothing interesting or new, just merges up with the latest 2.4.1-pre1 patch from Linus. ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p1-1.diff.gz I haven't had any reports from anyone, which must mean that it is working perfectly fine and adds no new bugs, testers are thus in nirvana and thus have nothing to report. :-) (works like a charm here.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] Lowlatency Patch for 2.4.0-ac6 and 2.4.1-pre2
a new version against recent 2.4 kernels of my multimedia-lowlatency patchset is now available. These patches are the 2.4-adapted versions of my 2.2 lowlatency patch, which project has now reached an age of 1.5+ years. the lowlatency patch against 2.4.0-ac6 can also be found at: http://www.kernel.org/pub/linux/kernel/people/mingo/lowlatency-patches/lowlatency-2.4.0-ac6-A2 the lowlatency patch against 2.4.1-pre2 can be found at: http://www.kernel.org/pub/linux/kernel/people/mingo/lowlatency-patches/lowlatency-2.4.1-A2 this patch still follows the 'take no prisoners' approach, is optimized on x86 but should work on other platforms as well. The patch uses assembly speedups and offline assembly sections to minimize the impact of conditional schedule points as much as possible. This is the reason why this patch does not offer a configuration option. The patch changes lowlevel x86 assembly routines too, to make them perform with lower latency. on a 500 MHz 1-CPU box typical latencies during 'everyday work', with this patch applied are 0.1 msec or less, under high load i've measured a maximum latency was 0.3 millisec. The patch fixes latencies generated by intense X sessions, high block IO and networking load and lots of user-space processes load as well, and other more unusual latency sources. I tested every latency source i could think of, the patch tries to be a 'complete solution' and tries to squash all latency sources larger than 0.5 msecs on a typical system. bugreports, comments, suggestions and contributions welcome! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
On Tue, 9 Jan 2001, Stephen Frost wrote: Now, the interesting bit here is that the processes can grow to be pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what happens with MOSIX is that entire processes get sent over the wire to other machines for work. MOSIX will also attempt to rebalance the load on all of the machines in the cluster and whatnot so it can often be moving processes back and forth. then you'll love the zerocopy patch :-) Just use sendfile() or specify MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card DMA-and-checksumming on cards that support it. the discussion with Stephen is about various device-to-device schemes. (which Mosix i dont think wants to use. Mosix wants to use memory to device zero-copy, right?) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
lies, lies and web benchmarks :-)
On Thu, 11 Jan 2001, Christoph Lameter wrote: I got into a bragging game whose webserver is the fastest with Jim Nelson one of the authors of the boa webserver. We finally settled on the Zeus test to decide the battle. may i add my (hopefully comparable) TUX 2.0 numbers to this bragging game :-) TUX had logging turned on, a 1-CPU 500 PIII MHz system (with enough RAM) was used for the test. UP kernel, nohighmem. The used code is 2.4.0-ac4 + DaveM's zerocopy patch + Jen's blk patch + TUX 2.0 patch. (to make these results somewhat comparable, what system did you use?) First boa won hands down because it supports persistant connections. Boa on port 6000. Khttpd on port 80: clameter@melchi:~$ ./zb localhost /index.html -k -c 215 -n 2 -p 6000 Server: Boa/0.94.8.3 Doucment Length:1666 Requests per seconds: 590.58 Server: kHTTPd/0.1.6 Doucment Length:1666 Requests per seconds: 196.59 well, TUX supports persistent (keepalive, ie. lightweight) HTTP connections as well: over localhost (like the above test) it gives: m:~ ./zb localhost /index.html -k -c 215 -n 2 Server: TUX Document Length:1666 Complete requests: 2 Failed requests:0 Requests per seconds: 12658.23 Connnection Times (ms) min avg max Connect: 0 011 Total: 51631 Over 100mbit Ethernet (using eepro100) TUX does: h:~ ./zb m /index.html -k -c 215 -n 2 Server: TUX Document Length:1666 Complete requests: 2 Failed requests:0 Requests per seconds: 6033.18 Transfer rate: 11002.97 kb/s Connnection Times (ms) min avg max Connect: 0 012 Total: 332 3250 As visible from the 'Transfer rate', the 100 mbit link is fully saturated. The eepro100 was not running in zerocopy mode, so all data was copied once. As a comparison, over 1000 mbit Ethernet with a native zero-copy driver (SysKonnect), TUX does: Server: TUX Document Length:1666 Complete requests: 2 Failed requests:0 Keep-Alive requests:20094 Requests per seconds: 12812.30 Connnection Times (ms) min avg max Connect: 0 012 Total: 101629 (but the server is at 70% CPU utilization in the gigabit test - the dual-PIII/500 client is 100% CPU utilized and thus not fast enough to saturate the server. With two clients it does about 15000 reqs/sec.) Then we decided to switch persistant connection off... But boa still wins. clameter@melchi:~$ ./zb localhost /index.html -c 215 -n 2 -p 6000 Server: Boa/0.94.8.3 Doucment Length:1666 Requests per seconds: 227.17 with normal, non-keepalive HTTP requests, TUX 2.0 over localhost does: m:~ ./zb localhost /index.html -c 215 -n 2 Server: TUX Document Length:1666 Complete requests: 2 Failed requests:0 Requests per seconds: 5154.64 Connnection Times (ms) min avg max Connect:111923 Total: 344045 over 100mbit ethernet (eepro100) it does: h:~ ./zb m /index.html -c 215 -n 2 Server: TUX Document Length:1666 Complete requests: 2 Failed requests:0 Requests per seconds: 4435.57 Connnection Times (ms) min avg max Connect: 115 3020 Total: 1847 3068 over gigabit SysKonnect zero-copy it does: h:~ ./zb mg /index.html -c 215 -n 2 Server: TUX Document Length:1666 Requests per seconds: 5327.65 Connnection Times (ms) min avg max Connect: 01116 Total: 213981 (but the nonpersistent test puts even more load on the client, the server is only about 60% utilized - with two clients it does about 8000 reqs/sec.) at this point i couldnt resist - i assembled a few TUX 2.0 CGI execution benchmarks. The CGI used for this test is a real, standard Linux ELF CGI executable which is exec()-ed for every HTTP request: it read()s the same /index.html file the other tests were using, write()s a HTML header to stdout, then write()s the /index.html file to stdout and finally write()s a HTML trailer to stdout and exit()s. [A separate process is created for every single HTTP request]. Over localhost, TUX 2.0 CGI does: m:~ ./zb localhost x?/index.html -c 215 -n 2 --- Server: TUX Document Length:1876 Complete requests: 2 Failed requests:0 Requests per seconds: 1227.14 Connnection Times (ms) min avg max Connect: 11232 Total: 41 172 346 --- over 100 mbit Ethernet (eepro100) TUX 2.0 CGI does: Requests per seconds: 1336.18 (the 1000 mbit number is the same as the 100 mbit one, because the server is saturated executing CGIs already, network
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
On Fri, 12 Jan 2001, Manfred Spraul wrote: The PPro local apic documentation says: The processor's local APIC includes an in-service entry and a holding entry for each priority level. To avoid losing interrupts, software should allocate no more than 2 interrupt vectors per priority. Ok, we must reorder the vector numbers for our own interrupts (0xfb-0xff), but that doesn't explain our problems: we don't loose reschedule interrupts, we have problems with normal interrupts - and there we only use 2 irq at the same priority level. we *already* reorder vector numbers and spread them out as much as possible. We do this in 2.2 as well. We did this almost from day 1 of IO-APIC support. If any manually allocated IRQ vector creates a '3 vectors in the same 16-vector region' situation then thats a bug in hw_irq.h.. the 'loss of interrupts' above does not include external interrupts, only local interrupts (such as the APIC timer interrupt) can get lost in such a situation. (nevertheless there is something going on.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
On Fri, 12 Jan 2001, Manfred Spraul wrote: 2.4 spreads the vectors for the external (hardware, from io apic) interrupts, but 5 ipi vectors have the same priority: reschedule, call function, tlb invalidate, apic error, spurious interrupt. my reading of the errata is that the lost APIC timer IRQ happens only if the APIC timer IRQ vector's priority level has more than 2 active vectors. It's a very limited case, which does not happen in recent CPUs anyway (such as the PIII). But that doesn't explain what happens with ne2k cards: neither 2.2 nor 2.4 have more than 2 interrupts in class for the hardware interrupt 16/19. yep. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, 12 Jan 2001, Linus Torvalds wrote: [...] Ingo, what was the focus-cpu thing? well, some time ago i had an ne2k card in an SMP system as well, and found this very problem. Disabling/enabling focus-cpu appeared to make a difference, but later on i made experiments that show that in both cases the hang happens. I spent a good deal of time trying to fix this problem, but failed - so any fresh ideas are more than welcome. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, 12 Jan 2001, Frank de Lange wrote: In addition, I patched apic.c (focus cpu enabled) In addition, I patched io_apic ((TARGET_CPUS 0xff) please try it with the focus CPU enabling change (we want to enable that feature, i only disabled it due to the stuck-ne2k bug), but with TARGET_CPUS set to cpu_online_mask. (this later is needed for certain crappy BIOSes.) i believe the ne2k driver change is the key. I have a first idea: we send an EOI to an interrupt that is masked on the IO apic, perhaps that causes the problems. Sound plausible... does not help. I've tried it (and many other combinations). I did not find any direct workaround for this problem. (i tried very hard.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, 12 Jan 2001, Frank de Lange wrote: WITH or WITHOUT the changed 8390 driver? I can already give you the results for running WITH the changed driver: it works. I have not yet tried it WITHOUT the changed 8390 driver (so that would be stock 8390, patched apic.c, stock io_apic.c). Please let me know which you want... WITH. patched 8390.c, patched apic.c, sock io_apic.c. My very strong feeling is that this will be a stable combination, and that this is what we want as a final solution. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, 12 Jan 2001, Frank de Lange wrote: BTW, does this (TARGET_CPUS cpu_online_mask) not wreak havoc with systems with hot-pluggable CPUs (many Suns, etc...)? Wouldn;t it be better to make this a config option (like the optional PCI fixes for crappy BIOSs)? ? this is x86-only code. There is no hot-pluggable CPU support for Linux AFAIK. (But in any case, the code is basically ready for hot-pluggable CPUs, just take a few precautions and change cpu_online_mask and a couple of other things.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, 12 Jan 2001, Frank de Lange wrote: It is. As I already mentioned in other messages, I already tested with JUST the patched 8390.c driver, no other patches. It was stable. I then patched apic.c AND io_apic.c, which did not introduce new instabilities. Unless you think that reverting back to a stock io_apic.c would cause instabilities (which would be weird, since I had no instabilities running only a patched 8390.c), I think the patch to 8390.c DOES remove the symptoms all by itself. No other patches seem necessary to get a stable box. okay - i just wanted to hear a definitive word from you that this fixes your problem, because this is what we'll have to do as a final solution. (barring any other solution.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, 12 Jan 2001, Frank de Lange wrote: PATCHED 8390.c (using irq_safe spinlocks instead of disable_irq) PATCHED apic.c (focus cpu ENABLED) STOCK io_apic.c No problems under heavy network load. Gentleman, this (the patch to 8390.c) seems to fix the problem. great. Back when i had the same problem, flood pinging another host (on the local network) was the quickest way to reproduce the hang: ping -f -s 10 otherhost this produced an IOAPIC-hang within seconds. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Updated zerocopy patch up on kernel.org
On Tue, 9 Jan 2001, David S. Miller wrote: Is there any value to supporting fragments in a driver which doesn't do hardware checksumming? IIRC Alexey had a patch to do such for Tulip, but I don't see it in the above patchset. I'm actually considering making the SG w/o hwcsum situation illegal. i believe it might still make some limited sense for normal sendmsg() and higher MTUs (or 8k NFS) - we could copy checksum stuff into the -tcp_page if SG is possible and thus the SG capability improves the VM. (because we can allocate at PAGE_SIZE granularity.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Ingo's RAID patch for 2.2.18 final?
On 13 Jan 2001 [EMAIL PROTECTED] wrote: What is at http://www.kernel.org/pub/linux/kernel/people/mingo/ look official enough to me... raid-2.2.18-B0 12-Jan-2001 10:18 392k yep, it is the 'official' 2.2.18 RAID patch. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
On Sun, 14 Jan 2001, jamal wrote: regular ttcp, no ZC and no sendfile. [...] Throughput: ~99MB/sec (for those obsessed with Mbps ~810Mbps) CPU abuse: server side 87% client side 22% [...] sendfile server. - throughput: 86MB/sec - CPU: server 100%, client 17% i believe what you are seeing here is the overhead of the pagecache. When using sendmsg() only, you do not read() the file every time, right? Is ttcp using multiple threads? In that case if the sendfile() is using the *same* file all the time, creating SMP locking overhead. if this is the case, what result do you get if you use a separate, isolated file per process? (And i bet that with DaveM's pagecache scalability patch the situation would also get much better - the global pagecache_lock hurts.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
On 14 Jan 2001, Linus Torvalds wrote: Does anybody but apache actually use it? There is a Samba patch as well that makes it sendfile() based. Various other projects use it too (phttpd for example), some FTP servers i believe, and khttpd and TUX. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
On Sun, 14 Jan 2001, Linus Torvalds wrote: There is a Samba patch as well that makes it sendfile() based. Various other projects use it too (phttpd for example), some FTP servers i believe, and khttpd and TUX. At least khttpd uses "do_generic_file_read()", not sendfile per se. I assume TUX does too. Sendfile itself is mainly only useful from user space.. yes, you are right. TUX does it mainly to avoid some of the user-space interfacing overhead present in sys_sendfile(), and to be able to control packet boundaries. (ie. to have or not have the MSG_MORE flag). So TUX is using its own sock_send_actor and own read_descriptor. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] fixes for RAID1/RAID5 hot-add/hot-remove, 2.4.0-ac9
- the attached patch (against -ac9) fixes a bug in the RAID1 and RAID4/5 code that made raidhotremove fail under certain (rare) circumstances. Please apply. Ingo --- linux/drivers/md/raid1.c.orig Mon Dec 11 22:19:35 2000 +++ linux/drivers/md/raid1.cMon Jan 15 14:45:35 2001 @@ -832,6 +832,7 @@ struct mirror_info *tmp, *sdisk, *fdisk, *rdisk, *adisk; mdp_super_t *sb = mddev-sb; mdp_disk_t *failed_desc, *spare_desc, *added_desc; + mdk_rdev_t *spare_rdev, *failed_rdev; print_raid1_conf(conf); md_spin_lock_irq(conf-device_lock); @@ -989,6 +990,10 @@ /* * do the switch finally */ + spare_rdev = find_rdev_nr(mddev, spare_desc-number); + failed_rdev = find_rdev_nr(mddev, failed_desc-number); + xchg_values(spare_rdev-desc_nr, failed_rdev-desc_nr); + xchg_values(*spare_desc, *failed_desc); xchg_values(*fdisk, *sdisk); --- linux/drivers/md/raid5.c.orig Mon Jan 15 14:45:50 2001 +++ linux/drivers/md/raid5.cMon Jan 15 14:46:01 2001 @@ -1707,6 +1707,7 @@ struct disk_info *tmp, *sdisk, *fdisk, *rdisk, *adisk; mdp_super_t *sb = mddev-sb; mdp_disk_t *failed_desc, *spare_desc, *added_desc; + mdk_rdev_t *spare_rdev, *failed_rdev; print_raid5_conf(conf); md_spin_lock_irq(conf-device_lock); @@ -1878,6 +1879,10 @@ /* * do the switch finally */ + spare_rdev = find_rdev_nr(mddev, spare_desc-number); + failed_rdev = find_rdev_nr(mddev, failed_desc-number); + xchg_values(spare_rdev-desc_nr, failed_rdev-desc_nr); + xchg_values(*spare_desc, *failed_desc); xchg_values(*fdisk, *sdisk);
Re: Is sendfile all that sexy?
On Mon, 15 Jan 2001, Jonathan Thackray wrote: It's a very useful system call and makes file serving much more scalable, and I'm glad that most Un*xes now have support for it (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to Linux is sendpath(), which does the open() before the sendfile() all combined into one system call. i believe the right model for a user-space webserver is to cache open file descriptors, and directly hash URLs to open files. This way you can do pure sendfile() without any open(). Not that open() is too expensive in Linux: m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall open Simple open/close: 7.5756 microseconds m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall stat Simple stat: 5.4864 microseconds m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall write Simple write: 0.9614 microseconds m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall read Simple read: 1.1420 microseconds m:~/lm/lmbench-2alpha9/bin/i686-linux ./lat_syscall null Simple syscall: 0.6349 microseconds (note that lmbench opens a nontrivial path, it can be cheaper than this.) nevertheless saving the lookup can be win. [ TUX uses dentries directly so there is no file opening cost - it's pretty equivalent to sendpath(), with the difference that TUX can do security evaluation on the (held) file prior sending it - while sendpath() is pretty much a shot into the dark. ] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] sendpath() support, 2.4.0-test3/-ac9
On 15 Jan 2001, Linus Torvalds wrote: int fd = open(..) fstat(fd..); sendfile(fd..); close(fd); is any slower than .. cache stat() in user space based on name .. sendpath(name, ..); on any real load. just for kicks i've implemented sendpath() support. (patch against 2.4.0-test and sample code attached) It appears to work just fine here. With a bit of reorganization in mm/filemap.c it was quite straightforward to do. Jonathan, is this what Zeus needs? If yes, it could be interesting to run a simple benchmark to compare sendpath() to open()+sendfile()? Ingo --- linux/mm/filemap.c.orig Mon Jan 15 22:43:21 2001 +++ linux/mm/filemap.c Mon Jan 15 23:09:55 2001 @@ -39,6 +39,8 @@ * page-cache, 21.05.1999, Ingo Molnar [EMAIL PROTECTED] * * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli [EMAIL PROTECTED] + * + * Started sendpath() support, (C) 2000 Ingo Molnar [EMAIL PROTECTED] */ atomic_t page_cache_size = ATOMIC_INIT(0); @@ -1450,15 +1452,15 @@ return written; } -asmlinkage ssize_t sys_sendfile(int out_fd, int in_fd, off_t *offset, size_t count) +/* + * Get input file, and verify that it is ok.. + */ +static struct file * get_verify_in_file (int in_fd, size_t count) { - ssize_t retval; - struct file * in_file, * out_file; - struct inode * in_inode, * out_inode; + struct inode * in_inode; + struct file * in_file; + int retval; - /* -* Get input file, and verify that it is ok.. -*/ retval = -EBADF; in_file = fget(in_fd); if (!in_file) @@ -1474,10 +1476,21 @@ retval = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, in_file-f_pos, count); if (retval) goto fput_in; + return in_file; +fput_in: + fput(in_file); +out: + return ERR_PTR(retval); +} +/* + * Get output file, and verify that it is ok.. + */ +static struct file * get_verify_out_file (int out_fd, size_t count) +{ + struct file *out_file; + struct inode *out_inode; + int retval; - /* -* Get output file, and verify that it is ok.. -*/ retval = -EBADF; out_file = fget(out_fd); if (!out_file) @@ -1491,6 +1504,29 @@ retval = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file-f_pos, count); if (retval) goto fput_out; + return out_file; + +fput_out: + fput(out_file); +fput_in: + return ERR_PTR(retval); +} + +asmlinkage ssize_t sys_sendfile(int out_fd, int in_fd, off_t *offset, size_t count) +{ + ssize_t retval; + struct file * in_file, *out_file; + + in_file = get_verify_in_file(in_fd, count); + if (IS_ERR(in_file)) { + retval = PTR_ERR(in_file); + goto out; + } + out_file = get_verify_out_file(out_fd, count); + if (IS_ERR(out_file)) { + retval = PTR_ERR(out_file); + goto fput_in; + } retval = 0; if (count) { @@ -1524,6 +1560,56 @@ fput(in_file); out: return retval; +} + +asmlinkage ssize_t sys_sendpath(int out_fd, char *path, off_t *offset, size_t count) +{ + struct file in_file, *out_file; + read_descriptor_t desc; + loff_t pos = 0, *ppos; + struct nameidata nd; + int ret; + + out_file = get_verify_out_file(out_fd, count); + if (IS_ERR(out_file)) { + ret = PTR_ERR(out_file); + goto err; + } + ret = user_path_walk(path, nd); + if (ret) + goto put_out; + ret = -EINVAL; + if (!nd.dentry || !nd.dentry-d_inode) + goto put_in_out; + + memset(in_file, 0, sizeof(in_file)); + in_file.f_dentry = nd.dentry; + in_file.f_op = nd.dentry-d_inode-i_fop; + + ppos = in_file.f_pos; + if (offset) { + if (get_user(pos, offset)) + goto put_in_out; + ppos = pos; + } + desc.written = 0; + desc.count = count; + desc.buf = (char *) out_file; + desc.error = 0; + do_generic_file_read(in_file, ppos, desc, file_send_actor, 0); + + ret = desc.written; + if (!ret) + ret = desc.error; + if (offset) + put_user(pos, offset); + +put_in_out: + fput(out_file); +put_out: + path_release(nd); +err: + return ret; } /* --- linux/arch/i386/kernel/entry.S.orig Mon Jan 15 22:42:47 2001 +++ linux/arch/i386/kernel/entry.S Mon Jan 15 22:43:12 2001 @@ -646,6 +646,7 @@ .long SYMBOL_NAME(sys_getdents64) /* 220 */ .long SYMBOL_NAME(sys_fcntl64) .long SYMBOL_NAME(sys_ni_syscall) /* reserved for TUX */ + .long
Re: 4G SGI quad Xeon - memory-related slowdowns
On 15 Jan 2001, Linus Torvalds wrote: The performance problem is _probably_ due to the kernel having to double-buffer the IO requests, coupled with bad MTRR settings (ie memory above the 4GB range is probably marked as non-cacheable or something, which means that you'll get really bad performance). the highmem related double-buffering alone on such a category of system is miniscule, compared to other costs of IO, and considering the expected bandwidth (20-30 MB/sec). the MTRR part could be a problem. Not using the high memory will avoid the double-buffering, and will also avoid using memory that isn't cached. If I'm right. The hang still indicates that something is wrong in PAE-land, though. it's working just fine on all 4GB+ systems tested (including 32GB systems), Intel, Dell, Compaq boxes. So if it's a unique PAE bug, then it must be some boundary condition. Paul, here is the memory map of my 8GB system: BIOS-provided physical RAM map: BIOS-e820: 0009d400 @ (usable) BIOS-e820: 2c00 @ 0009d400 (reserved) BIOS-e820: 0002 @ 000e (reserved) BIOS-e820: 03ef8000 @ 0010 (usable) BIOS-e820: 7c00 @ 03ff8000 (ACPI data) BIOS-e820: 0400 @ 03fffc00 (ACPI NVS) BIOS-e820: ec00 @ 0400 (usable) BIOS-e820: 0140 @ fec0 (reserved) BIOS-e820: f000 @ 0001 (usable) and here are the MTRR settings: [root@m mingo]# cat /proc/mtrr reg00: base=0xf000 (3840MB), size= 256MB: uncachable, count=1 reg01: base=0x ( 0MB), size=4096MB: write-back, count=1 reg02: base=0x1 (4096MB), size=2048MB: write-back, count=1 reg03: base=0x18000 (6144MB), size=1024MB: write-back, count=1 reg04: base=0x1c000 (7168MB), size= 512MB: write-back, count=1 reg05: base=0x1e000 (7680MB), size= 256MB: write-back, count=1 i'd suggest using the mem=exact feature to force different type of memory maps. Eg. i'm using the following append= line to force a 800 MB setup: append="mem=exactmap mem=0x0009d800@0x mem=0x03ef8000@0x0010 mem=0x2bffe000@0x0400" such mem=exactmap lines can be constructed based on the BIOS output. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sendpath() support, 2.4.0-test3/-ac9
On Mon, 15 Jan 2001, dean gaudet wrote: just for kicks i've implemented sendpath() support. _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size) hey so how do you implement transmit timeouts with sendpath() ? (i.e. drop the client after 30 seconds of no progress.) well this problem is not unique to sendpath(), sendfile() has it as well. in TUX i've added per-socket connection timers, and i believe something like this should be done in Apache as well - timers are IMO not a good enough excuse for avoiding event-based IO models and using select() or poll(). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
'native files', 'object fingerprints' [was: sendpath()]
On Mon, 15 Jan 2001, Linus Torvalds wrote: _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size) You want to do a non-blocking send, so that you don't block on the socket, and do some simple multiplexing in your server. And "sendpath()" cannot do that without having to look up the name again, and again, and again. Which makes the performance "optimization" a horrible pessimisation. yep, correct. But take a look at the trick it does with file descriptors, i believe it could be a useful way of doing things. It basically privatizes a struct file, without inserting it into the enumerated file descriptors. This shows that 'native files' are possible: file struct without file descriptor integers mapped to them. ob'plug: this privatized file descriptor mechanizm is used in TUX [TUX privatizes files by putting them into the HTTP request structure - ie. timeouts and continuation/nonblocking logic can be done with them]. But TUX is trusted code, and it can pass a struct file to the VFS without having to validate it, and TUX will also free such file descriptors. But even user-space code could use 'native files', via the following, safe mechanizm: 1) current-native_files list, freed at exit_files() time. 2) "struct native_file" which embedds "struct file". It has the following fields: struct native_file { unsigned long master_fingerprint[8]; unsigned long file_fingerprint[8]; struct file file; }; 'fingerprints' are 256 bit, true random numbers. master_fingerprint is global to the kernel and is generated once per boot. It validates the pointer of the structure. The master fingerprint is never known to user-space. file_fingerprint is a 256-bit identifier generated for this native file. The file fingerprint and the (kernel) pointer to the native file is returned to user-space. The cryptographical safety of these 256-bit random numbers guarantees that no breach can occur in a reasonable period of time. It's in essence an 'encrypted' communication between kernel and user-space. user-space thus can pass a pointer to the following structure: struct safe_kpointer { void *kaddr; unsigned long fingerprint[4]; }; the kernel can validate kaddr by 1) validating the pointer via the master fingerprint (every valid kernel pointer must point to a structure that starts with the master fingerprint's copy). Then usage-permissions are validated by checking the file fingerprint (the per-object fingerprint). this is a safe, very fast [ O(1) ] object-permission model. (it's a variation of a former idea of yours.) A process can pass object fingerprints and kernel pointers to other processes too - thus the other process can access the object too. Threads will 'naturally' share objects, because fingerprints are typically stored in memory. 3) on closing a native file the fingerprint is destroyed (first byte of the master fingerprint copy is overwritten). what do you think about this? I believe most of the file APIs can be / should be reworked to use native files, and 'Unix files' would just be a compatibility layer parallel to them. Then various applications could convert to 'native file' usage - i believe file servers which have lots of file descriptors would do this first. (this 'fingerprint' mechanizm can be used for any object, not only files.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 'native files', 'object fingerprints' [was: sendpath()]
On Tue, 16 Jan 2001, Andi Kleen wrote: On Tue, Jan 16, 2001 at 10:48:34AM +0100, Ingo Molnar wrote: this is a safe, very fast [ O(1) ] object-permission model. (it's a variation of a former idea of yours.) A process can pass object fingerprints and kernel pointers to other processes too - thus the other process can access the object too. Threads will 'naturally' share objects, ... Just setuid etc. doesn't work with that because access cannot be easily revoked without disturbing other clients. well, you cannot easily close() an already shared file descriptor in another process's context either. Is revocation so important? Why is setuid() a problem? A native file is just like a normal file, with the difference that not an integer but a fingerprint identifies it, and that access and usage counts are not automatically inherited across some explicit sharing interface. perhaps we could get most of the advantages by allowing the relaxation of the 'allocate first free file descriptor number' rule for normal Unix files? Also the model depends on good secure random numbers, which is questionable in many environments (e.g. a diskless box where the random device effectively gets no new input) true, although newer chipsets include hardware random generators. But indeed, object fingerprints (tokens? ids?) make the random generator a much more central thing. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
O_ANY [was: Re: 'native files', 'object fingerprints' [was:sendpath()]]
On Tue, 16 Jan 2001, Andi Kleen wrote: the 'allocate first free file descriptor number' rule for normal Unix files? Not sure I follow. You mean dup2() ? I'm sure you know this: when there are thousands of files open already, much of the overhead of opening a new file comes from the mandatory POSIX requirement of allocating the first not yet allocated file descriptor integer to this file. Eg. if files 0, 1, 2, 10, 11 are already open, the kernel must allocate file descriptor 3. Many utilities rely on this, and the rule makes sense in a select() environment, because it compresses the 'file descriptor spectrum'. But in a non-select(), event-drive environment it becomes unnecessery overhead. - probably the most radical solution is what i suggested, to completely avoid the unique-mapping of file structures to an integer range, and use the address of the file structure (and some cookies) as an identification. - a less radical solution would be to still map file structures to an integer range (file descriptors) and usage-maintain files per processes, but relax the 'allocate first non-allocated integer in the range' rule. I'm not sure exactly how simple this is, but something like this should work: on close()-ing file descriptors the freed file descriptors would be cached in a list (this needs a new, separate structure which must be allocated/freed as well). Something like: struct lazy_filedesc { int fd; struct file *file; } struct task { ... struct lazy_filedesc *lazy_files; ... } the actual filedescriptor bit of a 'lazy file' would be cleared for real on close(), and the '*file' argument is not a real file - it's NULL if at close() time this process wasnt the last user of the file, or contains a pointer to an allocated (but otherwise invalid) file structure. This must happen to ensure the first-free-desc rule, and to optimize freeing/allocate of file structures. Now, if the new code does a: fd = open(...,O_ANY); then the kernel looks at the current-lazy_files list, and tries to set the file descriptor bit in the current-files file table. If successful then open() uses desc-fd and desc-file (if available) for opening the new file, and unlinks+frees the lazy descriptor. If unsuccessful then open() frees desc-file, frees and unlinks the descriptor and goes on to look at the next descriptor. - worst case overhead is the extra allocation overhead of the (very small) lazy file descriptor. Worst-case happens only if O_ANY allocation is mixed in a special way with normal open()s. - Best-case overhead saves us a get_unused_fd() call, which can be *very* expensive (in terms of CPU time and cache footprint) if thousands of files are used. If O_ANY is used mostly, then the best-case is always triggered. - (the number of lazy files must be limited to some sane value) at exit_files() time the current-lazy_files list must be processed. On exec() it does not get inherited. current-lazy_files has no effect on task state or semantics otherwise, it's only an isolated 'information cache'. Have i missed something important? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was:sendpath()]]
On Tue, 16 Jan 2001, Ingo Molnar wrote: struct lazy_filedesc { int fd; struct file *file; } in fact "struct file" can (ab)used for this, no need for new structures or new fields. Eg. file-f_flags contains the cached descriptor-information. file-f_list is used for the current-lazy_files ringlist. this way there is no additional allocation overhead in the worst-case. (unless i'm missing something obvious.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was:sendpath()]]
On Tue, 16 Jan 2001, Peter Samuelson wrote: [Ingo Molnar] - probably the most radical solution is what i suggested, to completely avoid the unique-mapping of file structures to an integer range, and use the address of the file structure (and some cookies) as an identification. Careful, these must cast to non-negative integers, without clashing. if you read my (radical) proposal, the identification is based on a kernel pointer and a 256-bit random integer. So non-negative integers are not needed. (file-IO system-calls would be modified to detect if 'Unix file descriptors' or pointers to 'native file descriptors' are passed to them, so this is truly radical.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
On Tue, 16 Jan 2001, Felix von Leitner wrote: I don't know how Linux does it, but returning the first free file descriptor can be implemented as O(1) operation. only if special allocation patters are assumed. Otherwise it cannot be a generic O(1) solution. The first-free rule adds an implicit ordering to the file descriptor space, and this order cannot be maintained in an O(1) way. Linux can allocate up to a million file descriptors. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
On Tue, 16 Jan 2001, Felix von Leitner wrote: I don't know how Linux does it, but returning the first free file descriptor can be implemented as O(1) operation. to put it more accurately: the requirement is to be able to open(), use and close() an unlimited number of file descriptors with O(1) overhead, under any allocation pattern, with only RAM limiting the number of files. Both of my proposals attempt to provide this. It's possible to open() O(1) but do a O(log(N)) close(), but that is of no practical value IMO. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Hi memory support in 2.4 not working correctly.
On Wed, 17 Jan 2001, Micah Gorrell wrote: I have a compaq 8 way server with 4 gigs of memory. I am running 2.4.0 and everything works just fine (except the gig - I'm still fighting with that) The only strange thing that I am seeing is that I only see 3.3 gigs of memory instead of the full 4. Has anyone seen this and possibly know of a fix? could you run this command against your .config: grep -i highmem .config what does it say? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Wed, 17 Jan 2001, dean gaudet wrote: with the linux TCP_CORK API you only get one trailing small packet. in case you haven't heard of TCP_CORK -- when the cork is set, the kernel is free to send any maximum size packets it can form, but has to hold on to the stragglers until userland gives it more data or pops the cork. TCP_CORK has been basically replaced by MSG_MORE these days. The problem with the cork approach is that it's a persistent socket flag - and it easily triggers programming errors when the application writer tracks the state of the cork incorrectly. Also, removing the cork is one extra system-call. So what you can do with MSG_MORE is to specify at sendmsg()/writev() time, whether at the end of the buffer there is a cork or not. this is what TUX uses. When a eg. static HTTP request arrives it sends reply headers shortly after having checked file permissions and stuff (but the file is not yet sent), with MSG_MORE set. Then it sends the file, and sendfile() keeps MSG_MORE set right until the end of the request, when it clears it for the last fragment so the last partial packet gets flushed to the network. In fact there is one more optimization here, if the request is not keepalive then TUX still kees MSG_MORE set, and closes the socket - which will implicitly flush the output queue anyway and send any partial packet, but will also have the FIN packet merged with the last outgoing packet. (if there is saturation then further merging might happen as well - if a sendmsg() comes before a partial, but already xmit-queued packet is sent, then the TCP layer merges this sendmsg() output with the outgoing packet.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Wed, 17 Jan 2001, Rick Jones wrote: i'd heard interesting generalities but no specifics. for instance, when the send is small, does TCP wait exclusively for the app to flush, or is there an "if all else fails" sort of timer running? yes there is a per-socket timer for this. According to RFC 1122 a TCP stack 'MUST NOT' buffer app-sent TCP data indefinitely if the PSH bit cannot be explicitly set by a SEND operation. Was this a trick question? :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Wed, 17 Jan 2001, Linus Torvalds wrote: (I also had one person point out that BSD's have the notion of TCP_NOPUSH, which does almost what TCP_CORK does under Linux, except it doesn't seem to have the notion of uncorking - you can turn NOPUSH off, but apparently it doesn't affect queued packets. This makes it even less clear why they have the ugly sendfile) this is what MSG_MORE does. Basically i added MSG_MORE for the purpose of getting perfect TUX packet boundaries (and was ignorant enough to not know about BSD's NOPUSH), without an additional system-call overhead, and without the persistency of TCP_CORK. Alexey and David agreed, and actually implemented it correctly :-) basically if MSG_MORE is not set that means an explicit packet boundary in the noncontended scenario. If MSG_MORE is set then that means all full-MSS packets are queued, partial packets are not queued (but are timing out). sendfile() uses the more flag internally - i've changed sendfile() in my tree to specify the more flag from higher levels as well - eg. if a sent file is embedded into other replies, or multiple files are sent. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Wed, 17 Jan 2001, Rick Jones wrote: certainly, i see by your examples how cork can make life easier on the developer - they can putc() the reply if they want. for a persistent http connection, there would be the cork and uncork each time, for a pipelined connection, it is basically a race - how does the client present requests to the connection, what are the speeds of that connection relative to the speed of the server getting replies into the socket that sort of thing. such dynamic properties should IMO never become visible to user-space interfaces i believe. TCP_CORK/MSG_MORE (which are both the same thing, in a different interface) are a way to specify logical neighborhood of app-side SENDs. I believe the most sensible and generic thing to do is to require MSG_MORE information from the application: 'is it likely that the application is going to SEND something soon, or not?'. Submitting an exact timetable of planned future SENDs (with a fully specified probability distribution of every expected future SEND event) would be the most informative thing to do, but this is not very practical. Basically MSG_MORE is a simplified probability distribution of the next SEND, and it already covers all the other (iovec, nagle, TCP_CORK) mechanizm available, in a surprisingly easy way IMO. I believe MSG_MORE is very robust from a theoreticaly point of view. To use this information to judge saturation situations properly is completely up to the stack. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001, Linus Torvalds wrote: Yeah, and how are you going to teach a perl CGI script that writes to stdout to use it? yep, correct. But you can have TCP_CORK behavior from user-space (by setting the cork flag in user-space and writing it for all network output), while you cannot have MSG_MORE in the TCP_CORK case. And a perl script will likely use none of these mechanizms, it's the webserver CGI host code that does the network send, perl CGI scripts do not send to the network directly, they send to a pipe so the CGI host code can have absolute control over eg. CGI-generated HTTP headers. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001, Linus Torvalds wrote: Remember the UNIX philosophy: everything is a file. MSG_MORE completely breaks that, because the only way to use it is with send[msg](). It's absolutely unusable with something like a traditional UNIX "anonymous" application that doesn't know or care that it's writing to the network. yep you are right - i only thought in terms of applications that know that they are dealing with a network. In contrast, TCP_CORK has an interface much like TCP_NOPUSH, along with the notion of persistency. The difference between those two is that TCP_CORK really took the notion of persistency to the end, and made uncorking actually say "Ok, no more packets". You can't do that with TCP_NOPUSH: with TCP_NOPUSH you basically have to know what your last write is, and clear the bit _before_ that write if you want to avoid bad latencies (alternatively, you can just close the socket, which works equally well, and was probably the designed interface for the thing. That has the disadvantage of, well, closing the socket - so it doesn't work if you don't _know_ whether you'd write more or not). i believe BSD's TCP_NOPUSH should add those 3 lines that are needed to flush pending packets, this is what we do too - we do a tcp_push_pending_frames() if the socket option TCP_CORK is cleared. So the three are absolutely not equivalent. I personally think that TCP_NOPUSH is always the wrong thing - it has the persistency without the ability to shut it off gracefully after the fact. In contrast, both MSG_MORE and TCP_CORK have well-defined behaviour but they have very different uses. yep - i agree now. In terms of network-aware applications, i found MSG_MORE to be both cheaper and less bug-prone - but for uncooperative (or simply too generic) applications which are output-ing to simple files there is no way to control buffering, only some persistent mechanizm. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001 [EMAIL PROTECTED] wrote: Actually, TUX-1.1 (Ingo, do I not lie, did you not kill this code?) does this. It does not ack quickly, when complete request is received and still not answered, so that all the redundant acks disappear. (it's TUX 2.0 meanwhile), and yes, TUX uses it. We speculatively delay ACK of parsed input packet in the hope of merging it with the first output packet. If the output frame does not happen for 200 msecs then we send a standalone ACK to be RFC-conform. This way TUX can do single-packet web replies for small requests. (well, plus SYN-ACK and FIN-ACK) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001, Linus Torvalds wrote: I think Andrea was thinking more of the case of the anonymous IO generator, and having the "controller" program thgat keeps the socket always in CORK mode, but uses SIOCPUSH when it doesn't know what teh future access patterns will be. yep. Again, the actual data _senders_ may not be aware of the network issues. They are the worker bees, and they may not know or care that they are pushing out data to the network. yep. Ingo, you should realize that people actually _want_ to use things like stdio. [...] yep, i already acknowledged that not all applications want to care about issues like that and rather want to have a 'default behavior' - ie. a persistent cork. i also said that user-space (ie. libc) could maintain a persistent flag itself (a user-space variable) much cheaper than the kernel, and could pass the current 'more' value to the kernel, whenever sendmsg is done. I understand that normal file IO has no 'flag' for MSG_MORE - a pity that no extra flags can be passed in to write(). But this doesnt make it right. It makes it a practical problem, it shows the (perhaps-) weakness of the file API which is right now not prepared to pass 'streaming related info' along with a send, but doesnt make it right. now if your point is that passing a flag (or flags) along every (generic) file-write would be a mistake, that would be a point. But you didnt say that so far. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001, Andrea Arcangeli wrote: { int val = 1; setsockopt(req-sock, IPPROTO_TCP, TCP_CORK, (char *)val,sizeof(val)); val = 0; setsockopt(req-sock, IPPROTO_TCP, TCP_CORK, (char *)val,sizeof(val)); } differ from what you posted. It does the same in my opinion. Maybe we are not talking about the same thing? The above is equivalent to SIOCPUSH _only_ if the caller wasn't using either TCP_NODELAY or TCP_CORK. why? I can restore whatever state i want, the above is just a mechanizm to force the push. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001, Andrea Arcangeli wrote: This is a possible slow (but userspace based) implementation of SIOCPUSH: of course this is what i meant. Lets stop wasting time on this, ok? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001, Andrea Arcangeli wrote: BTW, the simmetry between getsockopt/setsockopt further bias how SIOCPUSH doesn't fit into the setsockopt options but it fits very well into the ioctl categorty instead. There's simply no state one can return via getsockopt for the SIOCPUSH functionality. It's not setting any option, it's just doing one thing that controls the I/O. you convinced me. I guess i was just distracted by the common wisdom: 'ioctls are a hack'. But SIOCPUSH *IS* an 'IO control' after all :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001, Andrea Arcangeli wrote: Agreed. However since TCP_CORK logic is more generic than MSG_MORE [...] why? TCP_CORK is equivalent to MSG_MORE, it's just a different representation of the same issue. TCP_CORK needs an extra syscall (in the case of a push event - which might be rare), the MSG_MORE solution needs an extra flag (which is merged with other flags in the send() case). i believe it should rather be a new setsockopt TCP_CORK value (or a new setsockopt constant), not an ioctl. Eg. a value of 2 to TCP_CORK could mean 'force packet boundary now if possible, and dont touch TCP_CORK state'. Doing PUSH from setsockopt(TCP_CORK) looked obviously wrong because it isn't setting any socket state, [...] well, neither is clearing/setting TCP_CORK ... and also because the SIOCPUSH has nothing specific with TCP_CORK, as said it can be useful also to flush the last fragment of data pending in the send queue without having to wait all the unacknowledged data to be acknowledged from the receiver when TCP_NODELAY isn't set. huh? in what way does the following: { int val = 1; setsockopt(req-sock, IPPROTO_TCP, TCP_CORK, (char *)val,sizeof(val)); val = 0; setsockopt(req-sock, IPPROTO_TCP, TCP_CORK, (char *)val,sizeof(val)); } differ from what you posted. It does the same in my opinion. Maybe we are not talking about the same thing? Changing the semantics of setsockopt(TCP_CORK, 2) would also break backwards compatibility with all 2.[24].x kernels out there. [this is nitpicking. I'm quite sure all the code uses '1' as the value, not 2.] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, 18 Jan 2001, Andrea Arcangeli wrote: I'm all for TCP_CORK but it has the disavantage of two syscalls for doing the flush of the outgoing queue to the network. And one of those two syscalls is spurious. [...] i believe a network-conscious application should use MSG_MORE - that has no system-call overhead. + case SIOCPUSH: + lock_sock(sk); + __tcp_push_pending_frames(sk, tp, tcp_current_mss(sk), 1); + release_sock(sk); + break; i believe it should rather be a new setsockopt TCP_CORK value (or a new setsockopt constant), not an ioctl. Eg. a value of 2 to TCP_CORK could mean 'force packet boundary now if possible, and dont touch TCP_CORK state'. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
On Sun, 21 Jan 2001, James Sutherland wrote: For many applications, yes - but think about a file server for a moment. 99% of the data read from the RAID (or whatever) is really aimed at the appropriate NIC - going via main memory would just slow things down. patently wrong. Compare the bandwidth of PCI and the bandwidth of memory controllers. It's both slower, has higher latency and uses up more valuable (PCI) bandwidth to do PCI-PCI transfers. The number of situations where PCI-PCI transactions are the preferred method are *very* limited, and i think we should deal with them when we see them. But this has been said at the very beginning of this thread already, please read it all ... Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] wait4-2.4.0-A0
the attached patch (against -pre9) fixes a possibly dangerous sys_wait4() prototype mismatch. Ingo --- linux/include/linux/sched.h.origMon Jan 22 17:28:36 2001 +++ linux/include/linux/sched.h Mon Jan 22 17:29:17 2001 @@ -563,6 +563,7 @@ #define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0) #define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1) #define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, nr) +asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct +rusage * ru); extern int in_group_p(gid_t); extern int in_egroup_p(gid_t); --- linux/arch/i386/kernel/signal.c.origMon Jan 22 17:28:25 2001 +++ linux/arch/i386/kernel/signal.c Mon Jan 22 17:28:31 2001 @@ -26,8 +26,6 @@ #define _BLOCKABLE (~(sigmask(SIGKILL) | sigmask(SIGSTOP))) -asmlinkage int sys_wait4(pid_t pid, unsigned long *stat_addr, -int options, unsigned long *ru); asmlinkage int FASTCALL(do_signal(struct pt_regs *regs, sigset_t *oldset)); int copy_siginfo_to_user(siginfo_t *to, siginfo_t *from)
[patch] new, scalable timer implementation, smptimers-2.4.0-B1
a new, 'ultra SMP scalable' implementation of Linux kernel timers is now available for download: http://www.redhat.com/~mingo/scalable-timers/smptimers-2.4.0-B1 the patch is against 2.4.1-pre10 or ac12. The timer design in this implementation is a work of David Miller, Alexey Kuznetsov and myself. Internals: the current 2.4 timer implementation uses a global spinlock for synchronizing access to the global timer lists. This causes excessive cacheline ping-pongs and visible performance degradation under very high TCP networking load (and other, timer-intensive operations). The new implementation introduces per-CPU timer lists and per-CPU spinlocks that protect them. All timer operations, add_timer(), del_timer() and mod_timer() are still O(1) and cause no cacheline contention at all (because all data structures are separated). All existing semantics of Linux timers are preserved, so the patch is 'transparent' to all other subsystems. In addition, the role of TIMER_BH has been redefined, and run_local_timers is used directly from APIC timer interrupts to run timers (not from TIMER_BH). This means that timer expiry is per-CPU as well - it is global in vanilla 2.4. Every timer is started and expired on the CPU where it has been added. Timers get migrated between CPUs if mod_timer() is done on another CPU (because eg. a process using them migrates to another CPU.). In the typical case timer handling is completely localized to one CPU. The new timers still maintain 'semantical compatibility' with older concepts such as the IRQ lock and manipulation of TIMER_BH state. These constructs are quite rare already, in 2.5 they can be removed completely. the patch has been sanity tested on UP-pure, UP-APIC, UP-IOAPIC and SMP systems. Reports/comments/questions/suggestions welcome! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] raid-B1, 2.4.1-pre11, fixes, cleanups
On Tue, 30 Jan 2001, Neil Brown wrote: -#define MAX_MD_BOOT_DEVS 8 +#define MAX_MD_BOOT_DEVS MAX_MD_DEVS Actually, this is not fine. Check the code that says: indeed - it will work only up to 32 devices. i've fixed the code to not have this assumption - it's init-time code only anyway. There are also a few other cleanups in raid-2.4.1-B1: - CONFIG_MD_BOOT gone. __init thing and people actually use it. - CONFIG_AUTODETECT_RAID gone. __init thing, and can be turned off via command-line. Needs special partition ID to be activated anyway. - static-init cleanups (no need to initialize to zero) - new RAID_AUTORUN ioctrl for initrd kernels to be able to start up autostart arrays. code is cleaner and simpler now. The patch removes a total of 7 lines, while adding a new feature :-) Dave, does this patch do the trick for you? (raid-2.4.1-B1 is against the 2.4.1-pre11 kernel.) Ingo --- linux/fs/partitions/msdos.c.origWed Jul 19 08:29:16 2000 +++ linux/fs/partitions/msdos.c Mon Jan 29 23:41:53 2001 @@ -36,7 +36,7 @@ #include "check.h" #include "msdos.h" -#if CONFIG_BLK_DEV_MD CONFIG_AUTODETECT_RAID +#if CONFIG_BLK_DEV_MD extern void md_autodetect_dev(kdev_t dev); #endif @@ -136,7 +136,7 @@ add_gd_partition(hd, current_minor, this_sector+START_SECT(p)*sector_size, NR_SECTS(p)*sector_size); -#if CONFIG_BLK_DEV_MD CONFIG_AUTODETECT_RAID +#if CONFIG_BLK_DEV_MD if (SYS_IND(p) == LINUX_RAID_PARTITION) { md_autodetect_dev(MKDEV(hd-major,current_minor)); } @@ -448,7 +448,7 @@ continue; add_gd_partition(hd, minor, first_sector+START_SECT(p)*sector_size, NR_SECTS(p)*sector_size); -#if CONFIG_BLK_DEV_MD CONFIG_AUTODETECT_RAID +#if CONFIG_BLK_DEV_MD if (SYS_IND(p) == LINUX_RAID_PARTITION) { md_autodetect_dev(MKDEV(hd-major,minor)); } --- linux/include/linux/raid/md_u.h.origTue Nov 14 22:16:37 2000 +++ linux/include/linux/raid/md_u.h Mon Jan 29 23:41:53 2001 @@ -22,6 +22,7 @@ #define GET_ARRAY_INFO _IOR (MD_MAJOR, 0x11, mdu_array_info_t) #define GET_DISK_INFO _IOR (MD_MAJOR, 0x12, mdu_disk_info_t) #define PRINT_RAID_DEBUG _IO (MD_MAJOR, 0x13) +#define RAID_AUTORUN _IO (MD_MAJOR, 0x14) /* configuration */ #define CLEAR_ARRAY_IO (MD_MAJOR, 0x20) --- linux/drivers/md/md.c.orig Mon Dec 11 22:19:35 2000 +++ linux/drivers/md/md.c Mon Jan 29 23:42:53 2001 @@ -2033,68 +2033,65 @@ struct { int set; int noautodetect; +} raid_setup_args md__initdata; -} raid_setup_args md__initdata = { 0, 0 }; - -void md_setup_drive(void) md__init; +void md_setup_drive (void) md__init; /* * Searches all registered partitions for autorun RAID arrays * at boot time. */ -#ifdef CONFIG_AUTODETECT_RAID -static int detected_devices[128] md__initdata = { 0, }; -static int dev_cnt=0; +static int detected_devices[128] md__initdata; +static int dev_cnt; + void md_autodetect_dev(kdev_t dev) { if (dev_cnt = 0 dev_cnt 127) detected_devices[dev_cnt++] = dev; } -#endif -int md__init md_run_setup(void) + +static void autostart_arrays (void) { -#ifdef CONFIG_AUTODETECT_RAID mdk_rdev_t *rdev; int i; - if (raid_setup_args.noautodetect) - printk(KERN_INFO "skipping autodetection of RAID arrays\n"); - else { - - printk(KERN_INFO "autodetecting RAID arrays\n"); + printk(KERN_INFO "autodetecting RAID arrays\n"); - for (i=0; idev_cnt; i++) { - kdev_t dev = detected_devices[i]; + for (i=0; idev_cnt; i++) { + kdev_t dev = detected_devices[i]; - if (md_import_device(dev,1)) { - printk(KERN_ALERT "could not import %s!\n", - partition_name(dev)); - continue; - } - /* -* Sanity checks: -*/ - rdev = find_rdev_all(dev); - if (!rdev) { - MD_BUG(); - continue; - } - if (rdev-faulty) { - MD_BUG(); - continue; - } - md_list_add(rdev-pending, pending_raid_disks); + if (md_import_device(dev,1)) { + printk(KERN_ALERT "could not import %s!\n", +
Re: Desk check of raid5.c patch from mtew@cds.duke.edu?
On Mon, 29 Jan 2001, Quim K Holland wrote: I've been following the recent 2.4.1-pre series and am wondering why the following one-liner (obviously correct) patch has not been applied. [...] - return_ok = bh-b_reqnext; + return_fail = bh-b_reqnext; oops - i do have it in my tree, somehow it escaped. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todo with ECN)
On Tue, 30 Jan 2001, jamal wrote: Kernel | tput | sender-CPU | receiver-CPU | - 2.4.0-pre3 | 99MB/s | 87% | 23% | NSF||| | - 2.4.0-pre3 | 68 | 8% | 8% | +ZC SF| MB/s || | - isnt the CPU utilization difference amazing? :-) a couple of questions: - is this UDP or TCP based? (UDP i guess) - what wsize/rsize are you using? How do these requests look like on the network, ie. are they suffieciently MTU-sized? - what happens if you run multiple instances of the testcode, does it saturate bandwidth (or CPU)? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todo with ECN)
On Tue, 30 Jan 2001, jamal wrote: - is this UDP or TCP based? (UDP i guess) TCP well then i'd suggest to do: echo 10 10 10 /proc/sys/net/ipv4/tcp_wmem does this make any difference? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todo with ECN)
On Wed, 31 Jan 2001, Malcolm Beattie wrote: Without the raised tcp_wmem setting I was getting 81 MByte/s. With tcp_wmem set as above I got 86 MByte/s. Nice increase. Any other setting I can tweak apart from {r,w}mem_max and tcp_{w,r}mem? The CPU on the client (350 MHz PII) is the bottleneck: gensink4 maxes out at 69 Mbyte/s pulling TCP from the server and 94 Mbyte/s pushing. (The other system, 733 MHz PIII pushes 100MByte/s UDP with ttcp but the client drops most of it). you can speed up the client significantly by using the MSG_TRUNC option ('truncate message'). It will zap incoming data without copying it into user-space. (you can use this for the 'bulk transfer' part - the initial protocol handling code needs to see the actual data.) This way you should be able to saturate the server even more. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: start_thread question...
On Sun, 20 May 2001, Dave Airlie wrote: I'm implementing start_thread for the VAX port and am wondering does start_thread have to return to load_elf_binary? I'm working on the init thread and what is happening is it is returning the whole way back to the execve caller .. which I know shouldn't happen. start_thread() doesnt do what one would intuitively think it does. start_thread() simply prepares the new task's register set to be ready to start user-space (which task is the current task as well, so certain current CPU registers might have to be manually bootstrapped as well), but start_thread() does not actually start execution of user-space code yet. (a more correct name for start_thread() would be prepare_user_thread().) so I suppose what I'm looking for is the point where the user space code gets control... is it when the registers are set in the start_thread? if so how does start_thread return execution starts when the process returns from sys_execve(). By that time we have already changed pagetables and other context information, dropped basically everything from the previous context - without actually doing a context-switch. In fact sys_execve() has an implicit context-switch, without ever changing the kernel-stack though. On the VAX we have to call a return from interrupt to get to user space and I'm trying to figure out where this should happen... this is how it happens on x86 too. Basically you start the new binary by returning from an syscall that has bootstrapped all userspace context - this approach should work on any architecture. (because every architecture has to be able to execute user-space code after syscalls.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.4 del_timer_sync oops in schedule_timeout
On Sat, 19 May 2001, Jacob Luna Lundberg wrote: This is 2.4.4 with the aic7xxx driver version 6.1.13 dropped in. Unable to handle kernel paging request at virtual address 78626970 this appears to be some sort of DMA-corruption or other memory scribble problem. hexa 78626970 is ASCII pibx, which shows in the direction of some sort of disk-related DMA corruption. we havent had any similar crash in del_timer_sync() for ages. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.4 del_timer_sync oops in schedule_timeout
On Sun, 20 May 2001, Jacob Luna Lundberg wrote: Unable to handle kernel paging request at virtual address 78626970 this appears to be some sort of DMA-corruption or other memory scribble problem. hexa 78626970 is ASCII pibx, which shows in the direction of some sort of disk-related DMA corruption. we havent had any similar crash in del_timer_sync() for ages. Ahh. Thanks then, I'll go look hard at the disk in that box. :) not necesserily the disk. it can be any sort of overheating or other thermal noise (unlikely), or SCSI/IDE cable problem (likely), or driver problem (likely too). Disk faults typically show very different symptoms. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.4.5
On Sat, 26 May 2001, Andrea Arcangeli wrote: On Sat, May 26, 2001 at 02:11:15PM -0400, Ben LaHaise wrote: No. It does not fix the deadlock. Neither does the patch you posted. can you give a try if you can deadlock 2.4.5aa1 just in case, and post a SYSRQ+T + system.map if it still deadlocks? Andrea, can you rather start running the Cerberus testsuite instead? All these deadlocks happen pretty early during the test, and we've been fixing tons of these deadlocks, and no, it's not easy. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] severe softirq handling performance bug, fix, 2.4.5
i've been seeing really bad average TCP latencies on certain gigabit cards (~300-400 microseconds instead of the expected 100-200 microseconds), ever since softnet went into the main kernel, and never found a real explanation for it, until today. the problem always went away when i tried to use tcpdump or strace, so the bug remained hidden and was hard to prove that it actually existed. (apart from the bad lat_tcp numbers.) We found many related bugs, but this problem remained. tcpdumps done on the network did not show any fault of the TCP stack. The lat_tcp latencies fluctuated alot, but for certain cards the latencies were stable, so i suspected some sort of hw problem. The loopback networking device never showed these problems, which added to the mystery. the problem turned out to be a severe softirq handling bug in the x86 code. background: soft interrupts were introduced as a generic kernel framework around January 2000, as part of the softnet networking-rewrite, that predated the final scalability rewrite of the Linux TCP/IP networking code. Soft interrupts have unique semantics, they can be best described as 'IRQ-triggered atomic system calls'. (unlike bottom halves, soft-IRQs do not preempt kernel code.) soft-IRQs, like their name suggest, are used from device interrupts ('hard interrupts') to trigger 'background' work related to interrupts. Soft-IRQs are triggered per-CPU, and they are supposed to execute whenever nothing else is done by the kernel on that particular CPU. Softirqs are executed with interrupts enabled, so hard interrupts can re-enable them while they are executing. do_softirq() is a kernel function that returns with IRQs disabled and at this point it's guaranteed that there are no more pending softirqs for this CPU. this mechanizm was the intention, but not the reality. In two important and frequently used code paths it was possible for an active soft-IRQ to go unnoticed: i measured as long as 140 milliseconds (!!!) latency between softirq activation and softirq execution in certain cases. This is obviously bad behavior. the two error cases are: #1 hard-IRQ interrupts user-space code, activates softirq, and returns to user-space code #2 hard-IRQ interrupts the idle task, activates softirq and returns to the idle task. category #1 is easy to fix, in entry.S we have to check active softirqs not only the exception and ret-from-syscall cases, but also in the IRQ-ret-to-userspace case. category #2 is subtle, because the idle process is kernel code, so returning to it we do not execute active softirqs. The main two types of idle handlers both had a window do 'miss' softirq execution: - the HLT-based default handler could be called after schedule()'s check for softirqs, but after enabling IRQs. In this case an interrupt handler has a window to activate a softirq and neither the IRQ return code, nor the idle loop would execute it immediately. The fix is to do a softirq check right before the safe_halt call. - the idle-poll handler does not check for softirqs either, it now does this in every iteration. with the attached softirq-2.4.5-A0 patch applied to vanilla 2.4.5, i see picture-perfect lat_tcp latencies of 109 microseconds over real gigabit network. I see very stable (and very good) TUX latencies as well. TCP bandwidth got better as well, probably due to the caching-locality bonus when executing softirqs right after hardirqs. [I'd like to ask everyone who had TCP latency problems (or other networking performance problems) to test 2.4.5 with this patch applied - thanks!] impact of the bug: all softirq-using code is affected, mostly networking. The loopback net driver was not affected because it's not interrupt-based. The bug went away due to strace or tcpdump because those two utilities pumped system-calls into the system which 'fixed' the softirq handling bug. (other softirq-based code is the tasklet code, and the keyboard code is using tasklets, so the keyboard code might be affected as well.) Ingo --- linux/arch/i386/kernel/entry.S.orig Sat May 26 19:20:48 2001 +++ linux/arch/i386/kernel/entry.S Sat May 26 19:21:52 2001 @@ -214,7 +214,6 @@ #endif jne handle_softirq -ret_with_reschedule: cmpl $0,need_resched(%ebx) jne reschedule cmpl $0,sigpending(%ebx) @@ -275,7 +274,7 @@ movl EFLAGS(%esp),%eax # mix EFLAGS and CS movb CS(%esp),%al testl $(VM_MASK | 3),%eax # return to VM86 mode or non-supervisor? - jne ret_with_reschedule + jne ret_from_sys_call jmp restore_all ALIGN --- linux/arch/i386/kernel/process.c.orig Sat May 26 19:21:56 2001 +++ linux/arch/i386/kernel/process.cSat May 26 19:28:06 2001 @@ -79,8 +79,12 @@ */ static void default_idle(void) { + int this_cpu = smp_processor_id(); + if (current_cpu_data.hlt_works_ok !hlt_counter) { __cli(); +
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, 20 May 2001, Alexander Viro wrote: Linus, as much as I'd like to agree with you, you are hopeless optimist. 90% of drivers contain code written by stupid gits. 90% of drivers contain code written by people who do driver development in their spare time, with limited resources, most of the time serving as a learning excercise. And they do this freely and for fun. Accusing them of being 'stupid gits' is just micharacterising the situation. People do not get born as VFS hackers, there is a very steep learning curve, and only a few make it to to have knowledge like you. Much of the learning curve of various people has traces in drivers/*, it's more like the history of Linux then some coherent image of people's capabilities. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] softirq-2.4.5-B0
On Sun, 27 May 2001, David S. Miller wrote: Hooray, some sanity in this thread finally :-) [ finally i had some sleep after a really long debugging session :-| ] the attached softirq-2.4.5-B0 patch fixes this problem by calling do_softirq() from local_bh_enable() [if the bh count is 0, to avoid recursion]. Yikes! I do not like this fix. i think we have no choice, unfortunately. and i think function calls are not that scary anymore, especially not with regparms and similar compiler optimizations. The function is simple, the function just goes in and returns in 90% of the cases, which should be handled nicely by most BTBs. we have other fundamental primitives that are a function call too, eg. dget(), and they are used just as frequently. In 2.4 we were moving inlined code into functions in a number of cases, and it appeared to work out well in most cases. I'd rather local_bh_enable() not become a more heavy primitive. I know, in one respect it makes sense because it parallels how hardware interrupts work, but not this thing is a function call instead of a counter bump :-( i believe the important thing is that the function has no serialization or other 'heavy' stuff. BHs had the misdesign of not being restarted after being re-enabled, and it caused performance problems - we should not repeat history. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] severe softirq handling performance bug, fix, 2.4.5
On Sun, 27 May 2001, Andrea Arcangeli wrote: Yes the stock kernel. yep you are right. i had this fixed too at a certain point, there is one subtle issue: under certain circumstances tasklets re-activate the tasklet softirq(s) from within the softirq handler, which leads to infinite loops if we just naively restart softirq handling. This fix is not in the -B0 patch yet. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] ioapic-2.4.5-A1
the attached ioapic-2.4.5-A1 patch includes a number of important IO-APIC related fixes (against 2.4.5-ac3): - correctly handle bridged devices that are not listed in the mptable directly. This fixes eg. dual-port eepro100 devices on Compaq boxes with such PCI layout: -+-[0d]---0b.0 +-[05]-+-02.0 | \-0b.0 \-[00]-+-02.0 +-03.0-[01]--+-04.0=== eth0 |\-05.0=== eth1 +-0b.0 +-0c.0 +-0d.0 +-0e.0 +-0f.0 +-14.0 +-14.1 +-19.0 +-1a.0 \-1b.0 without the patch the eepro100 devices get misdetected as XT-PIC IRQs and their interrupts are stuck. - the srcbus entry in the mptable does not have to be translated into a PCI-bus value. - add more APIC versions to the whitelist - initialize mp_bus_id_to_pci_bus[] correctly, so that we can detect nonlisted/bridged PCI busses more accurately. the patch should only affect systems that were not working properly before, but it might break broken-mptable systems - we'll see. Ingo --- linux/arch/i386/kernel/io_apic.c.orig Tue May 29 12:13:15 2001 +++ linux/arch/i386/kernel/io_apic.cTue May 29 12:19:55 2001 @@ -256,10 +256,16 @@ */ static int pin_2_irq(int idx, int apic, int pin); -int IO_APIC_get_PCI_irq_vector(int bus, int slot, int pci_pin) +int IO_APIC_get_PCI_irq_vector(int bus, int slot, int pin) { int apic, i, best_guess = -1; + Dprintk(querying PCI - IRQ mapping bus:%d, slot:%d, pin:%d.\n, + bus, slot, pin); + if (mp_bus_id_to_pci_bus[bus] == -1) { + printk(KERN_WARNING PCI BIOS passed nonexistent PCI bus %d!\n, bus); + return -1; + } for (i = 0; i mp_irq_entries; i++) { int lbus = mp_irqs[i].mpc_srcbus; @@ -270,14 +276,14 @@ if ((mp_bus_id_to_type[lbus] == MP_BUS_PCI) !mp_irqs[i].mpc_irqtype - (bus == mp_bus_id_to_pci_bus[mp_irqs[i].mpc_srcbus]) + (bus == lbus) (slot == ((mp_irqs[i].mpc_srcbusirq 2) 0x1f))) { int irq = pin_2_irq(i,apic,mp_irqs[i].mpc_dstirq); if (!(apic || IO_APIC_IRQ(irq))) continue; - if (pci_pin == (mp_irqs[i].mpc_srcbusirq 3)) + if (pin == (mp_irqs[i].mpc_srcbusirq 3)) return irq; /* * Use the first all-but-pin matching entry as a @@ -738,9 +744,11 @@ printk(KERN_DEBUG register #01: %08X\n, *(int *)reg_01); printk(KERN_DEBUG ... : max redirection entries: %04X\n, reg_01.entries); if ((reg_01.entries != 0x0f) /* older (Neptune) boards */ + (reg_01.entries != 0x11) (reg_01.entries != 0x17) /* typical ISA+PCI boards */ (reg_01.entries != 0x1b) /* Compaq Proliant boards */ (reg_01.entries != 0x1f) /* dual Xeon boards */ + (reg_01.entries != 0x20) (reg_01.entries != 0x22) /* bigger Xeon boards */ (reg_01.entries != 0x2E) (reg_01.entries != 0x3F) --- linux/arch/i386/kernel/mpparse.c.orig Tue May 29 12:13:15 2001 +++ linux/arch/i386/kernel/mpparse.cTue May 29 12:13:46 2001 @@ -36,7 +36,7 @@ */ int apic_version [MAX_APICS]; int mp_bus_id_to_type [MAX_MP_BUSSES]; -int mp_bus_id_to_pci_bus [MAX_MP_BUSSES] = { -1, }; +int mp_bus_id_to_pci_bus [MAX_MP_BUSSES] = { [0 ... MAX_MP_BUSSES-1] = -1 }; int mp_current_pci_id; /* I/O APIC entries */ --- linux/arch/i386/kernel/pci-irq.c.orig Tue May 29 12:13:15 2001 +++ linux/arch/i386/kernel/pci-irq.cTue May 29 12:13:46 2001 @@ -660,10 +660,12 @@ if (pin) { pin--; /* interrupt pins are numbered starting from 1 */ irq = IO_APIC_get_PCI_irq_vector(dev-bus-number, PCI_SLOT(dev-devfn), pin); -/* - * Will be removed completely if things work out well with fuzzy parsing - */ -#if 0 + /* +* Busses behind bridges are typically not listed in the MP-table. +* In this case we have to look up the IRQ based on the parent bus, +* parent slot, and pin number. The SMP code detects such bridged +* busses itself so we should get into this branch reliably. +*/ if (irq 0 dev-bus-parent) { /* go back to the bridge */ struct pci_dev * bridge = dev-bus-self; @@ -674,7 +676,6 @@ printk(KERN_WARNING PCI: using PPB(B%d,I%d,P%d) to get irq %d\n,
[patch] raid-2.4.5-A0, minor fix
the attached patch (against 2.4.5-ac3) fixes a compiler warning (triggered by gcc 2.96) in the RAID include files. Ingo --- linux/include/linux/raid/md_k.h.origTue May 29 12:50:30 2001 +++ linux/include/linux/raid/md_k.h Tue May 29 12:50:40 2001 @@ -38,6 +38,7 @@ case RAID5: return 5; } panic(pers_to_level()); + return 0; } extern inline int level_to_pers (int level)
[patch] softirq-2.4.5-E5
the attached softirq-2.4.5-E5 patch (against 2.4.5-ac3) tries to solve all softirq, tasklet and scheduling latency problems i could identify while testing TCP latencies over gigabit connections. The list of problems, as of 2.4.5-ac3: - the need_resched check in the arch/i386/kernel/entry.S syscall/irq return code has a race that makes it possible to miss a reschedule for up to smp_num_cpus*HZ jiffies. - the softirq check in entry.S has a race as well. - on x86, APIC interrupts do not trigger do_softirq(). This is especially problematic with the smptimers patch, which is APIC-irq driven. - local_bh_disable() blocks the execution of do_softirq(), and it takes a nondeterministic amount of time after local_bh_enable() for the next do_softirq() to be triggered. - do_softirq() does not execute softirqs that got activated meanwhile, and the next do_softirq() run happens after a nondeterministic amount of time. - the tasklet design re-enables their driving softirq occasionally, which makes 'complete' softirq processing impossible. the patch (tries to) solve all these problems. The changes: - all softirqs are guaranteed to be handled after do_softirq() returns (even those which are activated during softirq run) - softirq handling is immediately restarted if bhs are re-enabled again. - the tasklet code got rewritten (but externally visible semantics are kept) to not rely on marking the softirq busy. The new code is a bit tricky, but it should be correct. - some code got a bit slower, some code got a bit faster. I believe most of the changes made the softirq/tasklet implementation clearer. - some minor uninlining of too big inline functions, and other cleanup was done as well. - no global serialization was added to any part of the softirq or tasklet code, so scalability is not impacted. the patch is stable under every workload i tried, handles softirqs and tasklets with the minimum possible latency, thus it maximizes cache locality. The patch has no known bug, and the kernel has no known lost-wakeup, lost-softirq problem i know of. TCP latencies and TCP throughput is picture-perfect. Comments? Ingo --- linux/kernel/softirq.c.orig Fri Dec 29 23:07:24 2000 +++ linux/kernel/softirq.c Tue May 29 17:41:14 2001 @@ -52,12 +52,12 @@ int cpu = smp_processor_id(); __u32 active, mask; + local_irq_disable(); if (in_interrupt()) - return; + goto out; local_bh_disable(); - local_irq_disable(); mask = softirq_mask(cpu); active = softirq_active(cpu) mask; @@ -71,7 +71,6 @@ local_irq_enable(); h = softirq_vec; - mask = ~active; do { if (active 1) @@ -82,12 +81,13 @@ local_irq_disable(); - active = softirq_active(cpu); - if ((active = mask) != 0) + active = softirq_active(cpu) mask; + if (active) goto retry; } - local_bh_enable(); + __local_bh_enable(); +out: /* Leave with locally disabled hard irqs. It is critical to close * window for infinite recursion, while we help local bh count, @@ -121,6 +121,45 @@ struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned; +void tasklet_schedule(struct tasklet_struct *t) +{ + unsigned long flags; + int cpu; + + cpu = smp_processor_id(); + local_irq_save(flags); + /* +* If nobody is running it then add it to this CPU's +* tasklet queue. +*/ + if (!test_and_set_bit(TASKLET_STATE_SCHED, t-state) + tasklet_trylock(t)) { + t-next = tasklet_vec[cpu].list; + tasklet_vec[cpu].list = t; + __cpu_raise_softirq(cpu, TASKLET_SOFTIRQ); + tasklet_unlock(t); + } + local_irq_restore(flags); +} + +void tasklet_hi_schedule(struct tasklet_struct *t) +{ + unsigned long flags; + int cpu; + + cpu = smp_processor_id(); + local_irq_save(flags); + + if (!test_and_set_bit(TASKLET_STATE_SCHED, t-state) + tasklet_trylock(t)) { + t-next = tasklet_hi_vec[cpu].list; + tasklet_hi_vec[cpu].list = t; + __cpu_raise_softirq(cpu, HI_SOFTIRQ); + tasklet_unlock(t); + } + local_irq_restore(flags); +} + static void tasklet_action(struct softirq_action *a) { int cpu = smp_processor_id(); @@ -129,37 +168,37 @@ local_irq_disable(); list = tasklet_vec[cpu].list; tasklet_vec[cpu].list = NULL; - local_irq_enable(); - while (list != NULL) { + while (list) {
Re: IRQ handling in SMP environment, kernel 2.4.3
On Tue, 29 May 2001, Hilik Stein wrote: I am running a Linux machine with a 1GB Ethernet card which takes a huge amount of packets, which results in many HW interrupts. is it possible to make sure that only CPU #1 handles all the hardware interrupts generated by the NIC ? or even all the hardware interrupts in the systems if its too much to ask to filter IRQ based on origin ? thanks Hilik Stein yes this is possible with the 2.4 kernels. Check out Documentation/IRQ-affinity.txt. You can bind hardware interrupts to any CPU (or arbitrary group of CPUs). Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Emulate RDTSC
On Tue, 29 May 2001, Jaswinder Singh wrote: What is the nice way (in accuracy and performance) to emulate RDTSC in Linux for those architectures who dont support RDTSC like in Hitachi SH Processors. if the hardware provides no way to get some accurate estimation of current time, then there is no way to solve this problem in a generic way. Typically there are some cycle-accuracy counters in the CPU (ideal situation), or sometimes there is a counter in some external device (eg. the i8254 timer counter), but access to these tend to be slow and typically they are quite coarse as well. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ PATCH ]: disable pcspeaker kernel: 2.4.2 - 2.4.5
By making this (logical, and needed) feature unconditional, your patch's size and complexity is reduced by 80%. (see the attached pc_speaker.patch2) Ingo diff -u --recursive linux-2.4.5/drivers/char/vt.c linux-2.4.5-nc/drivers/char/vt.c --- linux-2.4.5/drivers/char/vt.c Fri Feb 9 20:30:22 2001 +++ linux-2.4.5-nc/drivers/char/vt.cWed May 9 23:47:36 2001 @@ -40,6 +41,7 @@ #include asm/vc_ioctl.h #endif /* CONFIG_FB_COMPAT_XPMAC */ +extern int pcspeaker_enabled; char vt_dont_switch; extern struct tty_driver console_driver; @@ -112,6 +117,9 @@ unsigned int count = 0; unsigned long flags; + /* is the pcspeaker enabled or disabled ? 0=disabled,1=enabled */ + if (!pcspeaker_enabled) + return; if (hz 20 hz 32767) count = 1193180 / hz; diff -u --recursive linux-2.4.5/include/linux/sysctl.h linux-2.4.5-nc/include/linux/sysctl.h --- linux-2.4.5/include/linux/sysctl.h Tue May 29 17:56:29 2001 +++ linux-2.4.5-nc/include/linux/sysctl.h Mon May 28 19:24:08 2001 @@ -118,7 +118,8 @@ KERN_SHMPATH=48,/* string: path to shm fs */ KERN_HOTPLUG=49,/* string: path to hotplug policy agent */ KERN_IEEE_EMULATION_WARNINGS=50, /* int: unimplemented ieee instructions */ - KERN_S390_USER_DEBUG_LOGGING=51 /* int: dumps of user faults */ + KERN_S390_USER_DEBUG_LOGGING=51, /* int: dumps of user faults */ + KERN_DISABLE_PC_SPEAKER=52 /* int: speaker on or off */ }; diff -u --recursive linux-2.4.5/kernel/sysctl.c linux-2.4.5-nc/kernel/sysctl.c --- linux-2.4.5/kernel/sysctl.c Tue May 29 17:55:59 2001 +++ linux-2.4.5-nc/kernel/sysctl.c Wed May 9 23:44:30 2001 @@ -48,6 +49,7 @@ extern int nr_queued_signals, max_queued_signals; extern int sysrq_enabled; +int pcspeaker_enabled; /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; static int minolduid; @@ -212,6 +217,8 @@ 0444, NULL, proc_dointvec}, {KERN_RTSIGMAX, rtsig-max, max_queued_signals, sizeof(int), 0644, NULL, proc_dointvec}, + {KERN_DISABLE_PC_SPEAKER, pcspeaker, pcspeaker_enabled, sizeof(int), +0644, NULL, proc_dointvec}, #ifdef CONFIG_SYSVIPC {KERN_SHMMAX, shmmax, shm_ctlmax, sizeof (size_t), 0644, NULL, proc_doulongvec_minmax},
Re: [ PATCH ]: disable pcspeaker kernel: 2.4.2 - 2.4.5
less code / one int more in the kernel or more code and #ifs / one int less in the kernel if the #ifdefs bloat the code 4 times the size of the simple patch, then we obviously want 4 bytes more in the kernel. And what about the code from kernel/sys.c ? The version you provided doesn't take care of what's the default value of pcspeaker. This would make it undefined, which is not really good. the default value is 0, that is good enough. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pte_page
On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: I use the 'pgt_offset', 'pmd_offset', 'pte_offset' and 'pte_page' inside a module to get the physical address of a user space virtual address. The physical address returned by 'pte_page' is not page aligned whereas the virtual address was page aligned. Can somebody tell me the reason? __pa(page_address(pte_page(pte))) is the address you want. [or pte_val(*pte) (PAGE_SIZE-1) on x86 but this is platform-dependent.] Also, can i use these functions to get the physical address of a kernel virtual address using init_mm? nope. Eg. on x86 these functions only walk normal 4K page pagetables, they do not walk 4MB pages correctly. (which are set up on pentiums and better CPUs, unless mem=nopentium is specified.) a kernel virtual address can be decoded by simply doing __pa(kaddr). If the page is a highmem page [and you have the struct page pointer] then you can do [(page-mem_map) PAGE_SHIFT] to get the physical address, but only on systems where mem_map[] starts at physical address 0. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ PATCH ]: disable pcspeaker kernel: 2.4.2 - 2.4.5
On Wed, 30 May 2001, Nico Schottelius wrote: the default value is 0, that is good enough. hmm.. I don't think so... value of 1 would be much better, because 0 normally disables the speaker. i confused the value. Yes, an initialization to 1 would be the correct, ie.: +++ linux-2.4.5-nc/kernel/sysctl.c Wed May 9 23:44:30 2001 @@ -48,6 +49,7 @@ extern int nr_queued_signals, max_queued_signals; extern int sysrq_enabled; +int pcspeaker_enabled = 1; Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pte_page
On Wed, 30 May 2001, Pete Wyckoff wrote: __pa(page_address(pte_page(pte))) is the address you want. [or pte_val(*pte) (PAGE_SIZE-1) on x86 but this is platform-dependent.] Does this work on x86 non-kmapped highmem user pages too? (i.e. is page-virtual valid for every potential user page.) you are right, the highmem-compatible solution is to use page-mem_map as the physical page index. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] severe softirq handling performance bug, fix, 2.4.5
On Sun, 27 May 2001, Andrea Arcangeli wrote: I mean everything is fine until the same softirq is marked active again under do_softirq, in such case neither the do_softirq in do_IRQ will run it (because we are in the critical section and we hold the per-cpu locks), nor we will run it again ourself from the underlying do_softirq to avoid live locking into do_softirq. if you mean the stock kernel, this scenario you describe is not how it behaves, because only IRQ contexts can mark a softirq active again. And those IRQ contexts will run do_IRQ() naturally, so while *this* do_softirq() invocation wont run those reactivated softirqs, the IRQ context that just triggered the softirq will do so. the real source of softirq latencies is the local_bh_disable()/enable() behavior, see my previous patch. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] 4 security holes in 2.4.4-ac8
On Tue, 29 May 2001, Dawson Engler wrote: Believe it or not, this one is OK :-) All callers pass in a pointer to a local stack kernel variable in raddr. Ah. I assumed that sys_* meant that all pointers were from user space --- is this generally not the case? (Also, are there other functions called directly from user space that don't have the sys_* prefix?) to automate this for the Stanford checker i've attached the 'getuserfunctions' script that correctly extracts these function names from the 2.4.5 x86 entry.S file. unfortunately the validation of the script will always be manual work, although for the lifetime of the 2.4 kernel the actual format of the entry.S file is not going to change. To make this automatic, i've added a md5sum to the script itself, if entry.S changes then someone has to review the changes manually. It's important to watch the md5 checksum, because new system-calls can be added in 2.4 as well. a few interesting facts. Functions that are called from entry.S but do not have the sys_ prefix: do_nmi do_signal do_softirq old_mmap old_readdir old_select save_v86_state schedule schedule_tail syscall_trace do_divide_error do_coprocessor_error do_simd_coprocessor_error do_debug do_int3 do_overflow do_bounds do_invalid_op do_coprocessor_segment_overrun do_double_fault do_invalid_TSS do_segment_not_present do_stack_segment do_general_protection do_alignment_check do_page_fault do_machine_check do_spurious_interrupt_bug functions in the kernel source that have the sys_ prefix and use asmlinkage but are not called from the x86 entry.S file: sys_accept sys_bind sys_connect sys_gethostname sys_getpeername sys_getsockname sys_getsockopt sys_listen sys_msgctl sys_msgget sys_msgrcv sys_msgsnd sys_recv sys_recvfrom sys_recvmsg sys_semctl sys_semget sys_semop sys_send sys_sendmsg sys_sendto sys_setsockopt sys_shmat sys_shmctl sys_shmdt sys_shmget sys_shutdown sys_socket sys_socketpair sys_utimes the list is pretty big. There are 33 functions that are called from entry.S but do not have the sys_ prefix or do not have the asmlinkage declaration. NOTE: there are other entry points into the kernel's 'protection domain' as well, and not all of them are through function interfaces. Some of these interfaces pass untrusted pointers and/or untrusted parameters directly, but most of them pass a pointer to a CPU registers structure which is stored on the kernel stack (thus the pointer can be trusted), but the contents of the registers structure are untrusted and must not be used unchecked. 1) IRQ handling, trap handling, exception handling entry points. I've atttached the 'getentrypoints' script that extracts these addresses from the i386 tree: divide_error debug int3 overflow bounds invalid_op device_not_available double_fault coprocessor_segment_overrun invalid_TSS segment_not_present stack_segment general_protection spurious_interrupt_bug coprocessor_error alignment_check machine_check simd_coprocessor_error system_call lcall7 lcall27 all of these functions get parameters passed that are untrusted. 2) bootup parameter passing. there is a function entry point, start_kernel, but there is also lots of implicit parameter passing, values filled out by the boot code, and parameters stored in hardware devices (eg. PCI settings and more). These all are theoretical protection domain entry points, but impossible to check automatically - the validity of current system state will have to be checked manually. (and in most cases it can be trusted - but not all cases.) Some 'unexpected' boot-time entry points: initialize_secondary on SMP systems for example. 3) manually constructed unsafe entry points which are hard to automate. include/asm-i386/hw_irq.h's BUILD macros are used in a number of places. One type of IRQ building uses do_IRQ() as an entry point. The SMP code builds the following entry points: reschedule_interrupt invalidate_interrupt call_function_interrupt apic_timer_interrupt error_interrupt spurious_interrupt but most of these pass no parameters, but apic_timer_interrupt does get untrusted parameters. 4) BIOS exit/entry points, eg in the APM code. Impossible to check, we have to trust the BIOS's code. i think this mail should be a more or less complete description of all entry points into the kernel. (Let me know if i missed any of them, or any of the scripts misidentifies entry points.) Ingo grep -E 'set_trap_gate|set_system_gate|set_call_gate' arch/i386/*/*.c arch/i386/*/*.h | grep -v 'static void' | cut -d, -f2- | sed 's///g' | cut -d\) -f1 if [ `md5sum arch/i386/kernel/entry.S` != 0e19b0892f4bd25015f5f1bfe90b441a arch/i386/kernel/entry.S ]; then echo entry.S file's MD5sum changed! Please revalidate the changes and change the md5sum in this script.; exit -1; fi (grep 'call S' arch/i386/kernel/entry.S | grep '[()]'; grep '\.long SYMBOL_NAME'
Re: Reserving a (large) memory block
On Thu, 31 Aug 2000, Alan Cox wrote: We then just follow the bios. You can also reserve blocks of memory by hacking arch/i386/mm/init.c and marking them reserved in 2.4 there is an explicit interface for this that also guarantees that the allocation consists of fully valid RAM (no matter how complex the RAM map): alloc_bootmem(). We allocate 300MB+ worth of mem_map[] with this on multi-gigabyte boxes. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[Announce] TUX alpha source code release
We are pleased to announce that the TUX kernel-space HTTP-subsystem is available for download at: ftp://ftp.redhat.com/pub/redhat/tux/tux-hawaii/ WARNING: this is a developer-only, alpha release. The 1.0 'consumer' release will happen by the end of September. This release is useless to you unless you are a kernel developer, and even in that case it might eat your data, start World War III, or drink your coffee. As of now, it is possible to cause TUX to BUG() via a simple browser URL - sanity checks are not handled too nicely yet. You have been warned. Further TUX development will be coordinated through the [EMAIL PROTECTED] mailing list. (See the attached README file for how to subscribe.) Many thanks to Michael K. Johnson and Matt Wilson for making the August release possible :-) Ingo the README file: --- TUX: The Hawaii Release This is a developer-only preview of TUX. It is incomplete; some features are stubbed out or incompletely implemented, and packaging is incomplete. It is provided in source-code-only form at this time. It is not yet intended for enterprise use. It has not undergone a security audit. It is not guaranteed to run SPECweb99 correctly, because there has been further development since the accepted SPECweb99 runs were done, and that development may well have diverged from SPECweb99 and web standards compliance, and has probably changed performance. It is here so that developers besides those at Red Hat can participate in the process of finalizing the TUX APIs, configuration options, and documentation. *** If it breaks, you get to keep both pieces. *** You have been warned! The development work will happen on the mailing list [EMAIL PROTECTED] You can join the list by sending a mail with a subject of "Subscribe" to [EMAIL PROTECTED] The packages are included only in source form. The tux package contains some basic documentation in Docbook format; if you build the tux package it will build html documentation from the Docbook. The kernel package, when built, builds the tux package, currently only for the i686 enterprise kernel binary package. Because TUX's memory model clashes with what is needed for the enterprise kernel, that will change; it's just a placeholder for now. The packages have been tested more on the Pinstripe beta than on Red Hat Linux 6.2 at this time. They will receive more testing on Red Hat Linux 6.2 in the future. We plan to make a full release by the end of September. That release will be validated against SPECweb99, will have updated APIs, better configuration, richer documentation, and will be made more readily available, with a defect tracking program to support it. For more information, join the mailing list. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread rant
On Sat, 2 Sep 2000, Alexander Viro wrote: Why? I would say that bad thing about SysV shared memory is that it's _not_ sufficiently filesystem-thing - a special API where 'create a file on ramfs and bloody mmap() it' would be sufficient. Why bother with special sets of syscalls? what i mean is that i dont like the cleanup issues associated with SysV shared memory - eg. it can hang around even if all users have exited, so auto-cleanup of resources is not possible. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On 2 Sep 2000, Jes Sorensen wrote: You can't DMA directly from a file cache page unless you have a network card that does scatter/gather DMA and surprise surprise, 80-90% of the cards on the market don't support this. [...] exactly. The TUX patch solves this by copying 'multi-fragment skbs' into a temporary single-fragment skb, if the card doesnt support scatter-gather, 64-bit DMA. This way the copying is delayed as much as possible, to the point where we queue the packet to the network device. Besides that you need to do copy-on-write if you want to be able to do zero copy on write() from user space [...] i agree that this is hard - i'm not sure wether we want to go the pain to enable anonymous-buffer write()s do zero-copy. My plan is to enable sendfile() first - it should cover all the important high-performance server cases. The point is that a write() is only used if some sort of dynamic data is generated on the fly. If data is generated once and sent once then there is no much win in zero-copy. If data is generated once and reused multiple times afterwards then it should rather be written into a temporary file and then it can be sent out via sendfile(). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On Sat, 2 Sep 2000, Jeff V. Merkey wrote: **ALL** Netware network drivers support a scatter/gather proramming interface, whether the hardware does or not. In NetWare, the drivers get passed a fragment list in what's called an ECB (Event Control Block). It's the drivers responsiblity to assemble the fragment lists. We did it this way to support scatter/gather cards and non-scatter gather cards in one interface. Those drivers that do not support scatter gather DMA operations copy to a local buffer to assemble the packet. [...] this is exactly what TUX implements and what i mentioned in my first email that started this thread. Was a complete overhaul of the TCP/IP stack needed like you claim? Not at all - the Linux TCP/IP code was so generic, that to my surprise i saw the first zero-copy TCP transfer after only one day of hacking. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On Sun, 3 Sep 2000 [EMAIL PROTECTED] wrote: If we go for a Linux-specific solution anyway, maybe one could add another send{,to,msg} flag that makes send*(2)'s buffer access non-atomic. That way, the kernel only needs to make sure the pages don't disappear, but there's no need for expensive MMU games. Of course, this would give applications a way for generating packets with an incorrect TCP/UDP checksum, [...] i believe such zero-copy send should only be allowed for drivers which can guarantee correct checksums. (ie. cards which do Tx-checksums) The other drivers will still copy. I dont think this is a problem - the number of cards that can do scatter-gather DMA but cannot do TX-checksumming is rather low. (i only know about the Tulip.) All modern cards do TX-checksumming and scatter-gather DMA. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On Sun, 3 Sep 2000, Andi Kleen wrote: I did the same for fragment RX some months ago (simple fragment lists that were copy-checksummed to user space). Overall it is probably better to use a kiovec, because that can be more easily used in nfsd and sendfile. the basic fragment type introduced by the TUX changes is a 'struct skb_frag', which has csum, size, *page, page_offset, frag_done, *data and *private fields - this is more than normal kiovecs offer. But i think kiovecs can be extended to do all this (if Stephen everybody else agrees), i just didnt want to touch it for the time being. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On Sun, 3 Sep 2000, Andi Kleen wrote: You can already cause incorrect checksums on the wire just by passing a partly unmapped address (the zero-the-rest exception handler in csum_copy_generic in i386 forgets to add in the carry) I do not believe it is a big deal, packets with bad checksum are not really a problem (you can usually do other better DoS that do not need it) i think it's a quality of implementation issue. The csum_copy_generic thing is a bug. Allowing incorrect checksums to be sent out would be a design bug. I think some RFCs do even forbid the sending of incorrect packets? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On Tue, 5 Sep 2000, Jeff V. Merkey wrote: while (x) { x = x-next } all over the place that increases latency. [...] i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If it's all over the place and if it increases latency, you certainly can show at least one such place. When I have time to do this exercise, I will. [...] well, your original claim (quoted above) shows that you have identified numerous such places already, so you dont have to do any additional 'exercise'. The "all over the place" code shouldnt be too hard to find again - please just say filename and line number in any kernel version of your choice and we'll look into it. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On Tue, 5 Sep 2000, Jeff V. Merkey wrote: Alright Ingo, you asked for it. I am going through it now and going over ALL my notes. I will catalog ALL of them and post it. Is this what you really want? yes, this would be the best indeed, to get those places fixed. But if you dont want to spend your time on that then it's enough to just post a single incident of such inefficiency and list-walking that impacts latency like you claim. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On Tue, 5 Sep 2000, Jeff V. Merkey wrote: The origin of this comment was related to a comparison of the MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's code handles fast paths well and from what I've seen is comparable to NetWare. [...] can we thus take this as a retraction of your below quoted three derogatory comments? " The entire Linux Network subsystem needs an overhaul. " " In networking, the enemy is LATENCY for fast performance. That's why NetWare can handle 5000 users and Linux barfs on 100 in similiar tests. Copying increases latency, and the long code paths in the Linux Network layer. " " Alan, Please. I'm in your code and there are copies all over the place. I agree you have a "fast path" for most stuff, but there's all kinds of handles lookups, linear list searching like while (x) { x = x-next } all over the place that increases latency. " Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
On Wed, 6 Sep 2000, Chris Wedgwood wrote: [...] The point is that a write() is only used if some sort of dynamic data is generated on the fly. There are exsiting applications out there that use mmap+write (caching the maps), it would be nice for the authors of these not to have to _require_ non-portable sendfile semantics for the best performance. this is not just an interface question, mmap()+write() is conceptually inferior to a sendfile(). [if the goal is to send the same data multiple times.] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Clearing of Ram?
On Wed, 6 Sep 2000, Frank Peters wrote: My question is who cleared it the kernel or the malloc function in glibc?? (i found some code in glibc but nothing in kernel) thx it's the second clear_user_highpage() in mm/memory.c that does the page clearing in the typical malloc()-ed memory case. It's only allocated and cleared once you or glibc accesses it, with page granularity. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Withdrawl of Open Source NDS Project/NTFS/M2FS forLinux
On Wed, 6 Sep 2000, J. Dow wrote: If the Kernel Debugger creates faulty solutions through lack of thinking, and asking why, then surely printk is at least as bad because it allows somebody to view the operation of the kernel through a keyhole darkly. [...] i'd like to quote David here, because i cannot put it any simpler: " It is hoped that because it isn't the default, some new people will take the quantum leap to actually try debugging using the best debugger any of us have, our brains, instead of relying on automated tools. " my claim (which others share) is that we need more people who can debug the really tough problems (for which there are no tools in any OS) with their brains, and also we need people who will produce code with less bugs in the future. There is also the important question of 'bug prevention'. The kernel isnt some magical soup which must be debugged only, code is *added* and debugged. If people who write code use more code reviews to fix bugs, then as a side-effect they'll sooner or later write code that is less prone to bugs. This is because they identify the bug-risks based on the code pattern - if you use a debugger mainly then you dont really see the code pattern but the current state of the system, which you validate. So the difference is this: - compare code, algorithm and concept with the original intention; analyze the symptoms and find the bug - compare the system state discovered through the debugger with the intended state of the system. Potentially step through the code before and after the faulty behavior, try to identify the 'point of bug' and constantly compare actual system state with intended system state. (it's certainly more complex than this, but you get the point.) This is why tools/features visualizing system state are so popular. i claim that the second behavior is 'passive', 'disconnected' and has no connection to the code itself, and thus tends to lead to inferior code. It leads to the frequent behavior of 'patching the state', not modifying the code itself. Eg. 'ok, we have a NULL here, lets return then so it wont crash later in the function.' The first behavior IMO produces a more 'integrated' coding style, where designing, writing and debugging code is closely interwoven, and naturally leads to higher quality code. Eg. 'we must never get a NULL here, who called this function and why??'. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Withdrawl of Open Source NDS Project/NTFS/M2FS forLinux
On Tue, 5 Sep 2000, Alan Cox wrote: I spend my time thinking. But I prefer to spend it thinking about the bug not about finding it and how long fsck takes. [...] if we only optimize for the debugging time spent by seasoned kernel developers then you are completely right. But if we optimize for new kernel developers learning the right methodology, and if we optimize for the *development* process (not the *release* process) of the kernel then reducing the amount of debugging functionality is the right choice. Things like GUI source level kernel debugging, nice graphs of things like cache line reloads between two points and run time spinlock deadlock validation and lock tracking (the last one is on my todo list only right now) are rather useful IMO there was only one historically hard spinlock-related problem that needed solving, this is the 'locks up hard' problem (which is solved). The rest was never really an debugging obstacle, 99% of the spinlock related bugs manifest themselves in clear, unambiguous lockups. there is another type of bug that is tough to find without an automatic tool - memory leaks. I dont think there is any other systematic bug (besides hard lockups and memory leaks) that occur often and can only be effectively found via debugging tools. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Withdrawl of Open Source NDS Project/NTFS/M2FS forLinux
On Tue, 5 Sep 2000, Jeff V. Merkey wrote: Your arguments are personal, not technical. [...] no, my arguments are technical, but are simply focused towards the conceptual (horizontal) development of Linux, not the vertical development of Linux (drivers) and support issues. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Withdrawl of Open Source NDS Project/NTFS/M2FS forLinux
On Tue, 5 Sep 2000, Jeff V. Merkey wrote: A kernel debugger will reduce development costs. [...] ... of Jeff V. Merkey - possibly. You are too much focused on your own needs, you dont contribute a bit to the generic kernel and the kernel infrastructure itself. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Terrible elevator performance in kernel 2.4.0-test8
i'm seeing similar problems. I think these problems started when the elevator was rewritten, i believe it broke the proper unplugging of IO devices. Does your performance problem get fixed by the attached workaround? Ingo On Thu, 14 Sep 2000, Robert Cohen wrote: For a while, Ive been seeing a performance problem with 2.4.0-test kernels. The benchmark I am using is an netatalk performance benchmark. But I think this is a general performance problem, not appletalk related. The benchmark has a varying number of clients reading and writing 30 Meg files. The symptom I see is that with more an 2 or 3 clients, I see a suddent and gigantic reduction in write performance. At the same time I can hear the disk seeking wildly. And the throughput reported by "vmstat 5" drops from 2000-3000 to 100-200. What I believe is happening is that the elevator isn't merging the requests properly. I think that this may be the same problem reported here http://www.uwsg.indiana.edu/hypermail/linux/kernel/0008.2/0389.html When stracing the netatalk servers, I can see that they are reading from the network then doing an 8k write and repeating. If I try to simulate the problem by running multiple iozones doing 8k writes, I dont see the same kind of problems. However, in a non networked benchmark like iozone, each process is doing many writes in its timeslice. And these writes coalesce naturally. In the networked benchmark, the read from the network is introducting enough delay that we get a context switch and the writes to different files become interleaved. This is precisely the sort of situation that the elevator is supposed to help with. With kernel version 2.4.0-test1-ac22, I saw adequate performance. In this version, the default elevator settings had a max_bomb value of 32. In 2.4.0-test3 - test6, the default max_bombs value became 0. And the performance with this setting was terrible. If I increase max_bombs with elvtune, the performance markedly improves. Although I still saw a tendency for a client to get write starved. In 2.4.0-test, the max_bombs value has been eliminated so I can't change it. I was hoping that that meant that the algorithm had been improved. Unfortunately, the benchmarks don't show any improvement. -- Robert Cohen Unix Support, TLTSU Australian National University - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ --- linux/kernel/sched.c.orig Sun Sep 3 10:03:35 2000 +++ linux/kernel/sched.cMon Sep 4 09:23:07 2000 @@ -508,6 +508,7 @@ if (tq_scheduler) goto handle_tq_scheduler; tq_scheduler_back: + run_task_queue(tq_disk); prev = current; this_cpu = prev-processor;
Re: [PATCH] old+new RAID for 2.2.17+
i strongly disagree. It's a nightmare to have three variants of the same code at once. (mdtools, raidtools and raidtools2.) This mess has been cleaned up in 2.4, and we shouldnt touch 2.2's RAID code beyond bugfixes. This is not support for 'old hardware', it's support for the very same thing. moving the RAID files into a separate directory is a natural cleanup in the context of 2.4, but it's just causing confusion in 2.2. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
refill_inactive()
i'm wondering about the following piece of code in refill_inactive(): if (current-need_resched (gfp_mask __GFP_IO)) { __set_current_state(TASK_RUNNING); schedule(); } shouldnt this be __GFP_WAIT? It's true that __GFP_IO implies __GFP_WAIT (because IO cannot be done without potentially scheduling), so the code is not buggy, but the above 'yielding' of the CPU should be done in the GFP_BUFFER case as well. (which is __GFP_WAIT but not __GFP_IO) Objections? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
__GFP_IO shrink_[d|i]cache_memory()?
i've seen a couple of GFP_BUFFER allocation deadlocks in an atypical system which had lots of RAM allocated to inodes. The reason for the deadlock is that the shrink_*() functions cannot be called if __GFP_IO is not set. Nothing else can be freed at that point, so the try_again: loop in page_alloc() gets into an infinite loop. as an immediate solution the previous __GFP_WAIT suggestion solves the deadlock - because the GFP_BUFFER allocator yields the CPU and kswapd can run and do the dcache/icache shrinking. [i cannot reproduce any deadlocks after doing this change.] as a longer term solution, i'm wondering how hard it would be to propagate gfp_mask into the shrink_*() functions, and prevent recursion similarly to the swap-out logic? This way even GFP_BUFFER allocators could touch/free the dcache/icache. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: __GFP_IO shrink_[d|i]cache_memory()?
On Sun, 24 Sep 2000, Linus Torvalds wrote: [...] I don't think shrinking the inode cache is actually illegal when GPF_IO isn't set. In fact, it's probably only the buffer cache itself that has to avoid recursion - the other stuff doesn't actually do any IO. i just found this out by example, i'm running the shrink_[i|d]cache stuff even if __GFP_IO is not set, and no problems so far. (and much better balancing behavior) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] vmfixes-2.4.0-test9-B2
the attached vmfixes-B2 patch adds the following fixes/cleanups: vmscan.c: - check for __GFP_WAIT not __GFP_IO when yielding the CPU. This fixes GFP_BUFFER deadlocks. In fact since no caller to do_try_to_free_pages() can expect that function to not block, we dont test for __GFP_WAIT either. [GFP_KSWAPD is the only caller without __GFP_WAIT set.] - do shrink_[d|i]cache_memory() even if !__GFP_IO. This improves balance. - push the __GFP_IO test into shm_swap(). - after shm_swap() do not test for !count but for = 0, because count could be negative if in the future the shrink_ functions return bigger than 1, and we could then get into an infinite loop. Same after swap_out() and refill_inactive_scan(). No performance penalty, test for zero is exchanged with test for sign. - kmem_cache_reap() is done within refill_inactive(), so it's unnecessery to call it at the beginning of do_try_to_free_pages(). Moved to the else branch. (i saw kmem_cache_reap() show up in profiles) - (small codestyle cleanup.) page_alloc.c: - in __alloc_pages(), the infinite allocation loop yields the CPU if necessery. This prevents a potential lockup on UP, and even on SMP it can prevent livelocks. (i saw this happen.) mm.h: - made the GFP_ flag definitions easier to parse for humans :-) - remove shrink_mmap() prototype, it doesnt exist anymore. shm.c: - the trivial test for __GFP_IO. swap_state.c, filemap.c: - (shrink_mmap doesnt exist anymore, it's refill_inactive.) (The patch applies and compiles cleanly, and is tested under various VM loads i use.) Ingo --- linux/mm/vmscan.c.orig Sun Sep 24 11:41:38 2000 +++ linux/mm/vmscan.c Sun Sep 24 12:20:27 2000 @@ -119,7 +119,7 @@ * our scan. * * Basically, this just makes it possible for us to do -* some real work in the future in "shrink_mmap()". +* some real work in the future in "refill_inactive()". */ if (!pte_dirty(pte)) { flush_cache_page(vma, address); @@ -159,7 +159,7 @@ * NOTE NOTE NOTE! This should just set a * dirty bit in 'page', and just drop the * pte. All the hard work would be done by -* shrink_mmap(). +* refill_inactive(). * * That would get rid of a lot of problems. */ @@ -891,7 +891,7 @@ do { made_progress = 0; - if (current-need_resched (gfp_mask __GFP_IO)) { + if (current-need_resched) { __set_current_state(TASK_RUNNING); schedule(); } @@ -899,34 +899,32 @@ while (refill_inactive_scan(priority, 1) || swap_out(priority, gfp_mask, idle_time)) { made_progress = 1; - if (!--count) + if (--count = 0) goto done; } - /* Try to get rid of some shared memory pages.. */ - if (gfp_mask __GFP_IO) { - /* -* don't be too light against the d/i cache since -* shrink_mmap() almost never fail when there's -* really plenty of memory free. -*/ - count -= shrink_dcache_memory(priority, gfp_mask); - count -= shrink_icache_memory(priority, gfp_mask); - /* -* Not currently working, see fixme in shrink_?cache_memory -* In the inner funtions there is a comment: -* "To help debugging, a zero exit status indicates -* all slabs were released." (-arca?) -* lets handle it in a primitive but working way... -* if (count = 0) -* goto done; -*/ + /* +* don't be too light against the d/i cache since +* refill_inactive() almost never fail when there's +* really plenty of memory free. +*/ + count -= shrink_dcache_memory(priority, gfp_mask); + count -= shrink_icache_memory(priority, gfp_mask); + /* +* Not currently working, see fixme in shrink_?cache_memory +* In the inner funtions there is a comment: +* "To help debugging, a zero exit status indicates +* all slabs were released." (-arca?) +* lets handle it in a primitive but working way... +* if (count = 0) +* goto done; +*/ - while (shm_swap(priority, gfp_mask)) { -
Re: [patch] vmfixes-2.4.0-test9-B2
On Sun, 24 Sep 2000, Andrea Arcangeli wrote: ext2_new_block (or whatever that runs getblk with the superlock lock acquired)-getblk-GFP-shrink_dcache_memory-prune_dcache- prune_one_dentry-dput-dentry_iput-iput-inode-i_sb-s_op- put_inode-ext2_discard_prealloc-ext2_free_blocks-lock_super-D nasty indeed, sigh. Shouldnt ext2_new_block drop the superblock lock in places where we might block? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 1023rd thread crashes 2.4.0-test8 from non-root user
On Mon, 25 Sep 2000, Mark Hahn wrote: The problem is large numbers of threads in 2.4.0-test8 can result in a hard crash of the entire kernel. This can be done as a non-root user. this appears to be reproducable (128M duron, haven't tried intel UP/SMP): i've done some experimentation, and to me it appears we overload the queued signal limit of bash, or something like that? The Ctrl-C thing definitely creates alot of signals. And the default limit for queued signals [kernel/signal.c:max_queued_signals] is 1024 ... so i think this is threading-unrelated, to me it (tentatively) looks like to be a signal handling bug. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 1023rd thread crashes 2.4.0-test8 from non-root user
indeed, after changing max_queued_signals to 4096, i cannot crash the kernel anymore with 2000 threads. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/