Re: [PATCH 1/1] irqchip: exynos-combiner: Save IRQ enable set on suspend
>>>>> "Javier" == Javier Martinez Canillas >>>>> writes: Javier> The Exynos interrupt combiner IP looses its state when the SoC s/looses/loses/ Peter C -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: ARM: EXYNOS: Describe boot loaders interface
>>>>> "Krzysztof" == Krzysztof Kozlowski writes: Krzysztof> Various boot loaders for Exynos based boards use certain Krzysztof> memory addresses during booting for different Krzysztof> purposes. Mostly this is one of following : 1. as a CPU Krzysztof> boot address, 2. for storing magic cookie related to low Krzysztof> power mode (AFTR, sleep). Krzysztof> The document, based solely on kernel source code, tries to Krzysztof> group the information scattered over different files. This Krzysztof> would help in the future when adding support for new SoC or Krzysztof> when extending features related to low power modes. Is it worth grabbing the info from u=boot and documenting it here (it's not documented other than in the hardkenel U=Boot source)? I can send you the info, or you can see it in https://github.com/hardkernel/u-boot/blob/odroidxu3-v2012.07/board/samsung/smdk5420/lowlevel_init.S at symbol nscode_base near line 104 -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] debug: Do not permit CONFIG_DEBUG_STACK_USAGE=y on IA64 or PARISC
>>>>> "Ingo" == Ingo Molnar writes: Ingo> * James Bottomley wrote: >> Since the problem is an invalid assumption about how the stack >> grows, why not just condition it on that. We actually have a >> config option for this: CONFIG_STACK_GROWSUP. But for some reason >> ia64 doesn't define this, why not, Tony? It looks deliberate >> because you have replaced a lot of >> >> #ifdef CONFIG_STACK_GROWSUP >> >> with >> >> #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64) >> >> but not all of them. Ingo> Yes, that's another possible solution, assuming that it's really Ingo> only about the up/down difference. Ingo> Thanks, IA64 has two stacks -- the standard one, that grows down, and the register stack engine backing store, that grows up. The usual mechanisms for stack growth are used, so only some of the bits predicated on `STACK_GROWSUP' are useful. Peter C -- Dr Peter Chubb peter.chubb AT nicta.com.au http://www.ssrg.nicta.com.au Software Systems Research Group/NICTA -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Fix compilation with gcc 4.2
gcc-4.2 is a lot more picky about its symbol handling. EXPORT_SYMBOL no longer works on symbols that are undefined or defined with static scope. For example, with CONFIG_PROFILE off, I see: kernel/profile.c:206: error: __ksymtab_profile_event_unregister causes a section type conflict kernel/profile.c:205: error: __ksymtab_profile_event_register causes a section type conflict This patch moves the EXPORTs inside the #ifdef CONFIG_PROFILE, so we only try to export symbols that are defined. Also, in kernel/kprobes.c there's an EXPORT_SYMBOL_GPL() for jprobes_return, which if CONFIG_JPROBES is undefined is a static inline and gives the same error. And in drivers/acpi/resources/rsxface.c, there's an ACPI_EXPORT_SYMBOPL() for a static symbol. If it's static, it's not accessible from outside the compilation unit, so should bot be exported. These three changes allow building a zx1_defconfig kernel with gcc 4.2 on IA64. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> Index: linux-2.6-git/kernel/profile.c === --- linux-2.6-git.orig/kernel/profile.c 2007-08-09 12:10:19.921216500 +1000 +++ linux-2.6-git/kernel/profile.c 2007-08-09 12:10:26.061162039 +1000 @@ -199,11 +199,11 @@ EXPORT_SYMBOL_GPL(register_timer_hook); EXPORT_SYMBOL_GPL(unregister_timer_hook); EXPORT_SYMBOL_GPL(task_handoff_register); EXPORT_SYMBOL_GPL(task_handoff_unregister); +EXPORT_SYMBOL_GPL(profile_event_register); +EXPORT_SYMBOL_GPL(profile_event_unregister); #endif /* CONFIG_PROFILING */ -EXPORT_SYMBOL_GPL(profile_event_register); -EXPORT_SYMBOL_GPL(profile_event_unregister); #ifdef CONFIG_SMP /* Index: linux-2.6-gie/kernel/kprobes.c === --- linux-2.6-git.orig/kernel/kprobes.c 2007-08-09 12:14:48.898830198 +1000 +++ linux-2.6-git/kernel/kprobes.c 2007-08-09 14:09:50.180322576 +1000 @@ -1063,6 +1063,8 @@ EXPORT_SYMBOL_GPL(register_kprobe); EXPORT_SYMBOL_GPL(unregister_kprobe); EXPORT_SYMBOL_GPL(register_jprobe); EXPORT_SYMBOL_GPL(unregister_jprobe); -EXPORT_SYMBOL_GPL(jprobe_return); + +#ifdef CONFIG_KPROBES EXPORT_SYMBOL_GPL(register_kretprobe); EXPORT_SYMBOL_GPL(unregister_kretprobe); +#endif Index: linux-2.6-git/drivers/acpi/resources/rsxface.c === --- linux-2.6-git.orig/drivers/acpi/resources/rsxface.c 2007-08-09 13:06:59.040346772 +1000 +++ linux-2.6-git/drivers/acpi/resources/rsxface.c 2007-08-09 13:12:03.125801491 +1000 @@ -474,8 +474,6 @@ acpi_rs_match_vendor_resource(struct acp return (AE_CTRL_TERMINATE); } -ACPI_EXPORT_SYMBOL(acpi_rs_match_vendor_resource) - /*** * * FUNCTION:acpi_walk_resources -- Dr Peter Chubb http://www.gelato.unsw.edu.au [EMAIL PROTECTED] http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Deferred interrupt handling.
The problem you're having is essentially the same as the user-level interrupt handler problem I've been dealing with for ages. The basic rule is: don't share interrupts between devices on the host and devices in the guest. But you *can* share interrupts between devices in a single guest. If you want the code, see http://www.gelato.unsw.edu.au/cgi-bin/viewvc.cgi/cvs/kernel/usrdrivers/latest/ and look at generic-irq.patch and fasync (which adds asynchronous notifications) For the KVM work it'll need modifying a little, but the basic infrastructure is there. We've currently got this working to pass interrupts to a type-II (hosted) virtual machine monitor running a guest kernel with native drivers. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-ia64 build warning messages
>>>>> "Russ" == Russ Anderson <[EMAIL PROTECTED]> writes: Russ> Tony Luck wrote: >> > I used the sn2_defconfig in the tree :) >> >> So there is something odd happening. Russ complained that he was >> still seeing several errors from the sn2_defconfig build too when I >> posted the "last fix" to Len. But I don't see them when I build. Russ> An additional data point. I have a copy of Tony's test tree Russ> pulled down on March 30th that builds without the warning Russ> messages. The copy of Tony's test tree pulled down on May 22nd Russ> does have warning messages. I'm building both with the same Russ> compiler (etc). I'm fairly certain a tree I pulled down in Russ> April built without warnings. I've since blown away that tree. Change request 85bd2fddd68e757da8e1af98f857f61a3c9ce647 introduced section-mismatch checking for vmlinux, which caused all these warnings to become visible. It looks as if gcc can create references from .sdata to .init.sdata depending on what optimisations it chooses to do. Ideally we could teach gcc to put its constants in the same section they reference. But I'm no gcc guru. The alternative is to get modpost to ignore such references, at the cost of perhaps missing a real problem somewhere. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: sleeping function called from invalid context at kernel/fork.c:385
I see many many section mismatches when compiling with gcc 4.1 and binutils 2.17.50.20070426 They appear to be from .sdata to .init.data. This is with basic zx1_defconfig with a few mods. The reason appears to be compiler weirdness.. WARNING: init/built-in.o(.sdata+0x30): Section mismatch: reference to .init.data:ino (after 'root_mountflags') (initramfs.s contains a 32-word table `head'. Code like: static __initdata struct hash {..} *head[32]; for (p = head; p < head + 32; p++) is generating: .section .sdata L24: .data8 head#+256 Rather than adding 256 to head at run time, the compiler loads L24 and uses that for the comparison. This triggers the warning. WARNING: arch/ia64/kernel/built-in.o(.sdata+0x110): Section mismatch: reference to .init.data:rsvd_region (between 'ia64_sal' and 'ia64_i_cache_stride_shift') WARNING: mm/built-in.o(.sdata+0x48): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x50): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x58): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x60): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x68): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x70): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x78): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x80): Section mismatch: reference to .init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0) WARNING: mm/built-in.o(.sdata+0x3c8): Section mismatch: reference to .init.data: (between 'swap_list' and 'slab_early_init') WARNING: mm/built-in.o(.sdata+0x3d8): Section mismatch: reference to .init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init') WARNING: mm/built-in.o(.sdata+0x3e0): Section mismatch: reference to .init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init') WARNING: drivers/built-in.o(.data.rel.local+0x20a8): Section mismatch: reference to .init.text:acpi_processor_start (between 'acpi_processor_driver' and 'acpi_thermal_driver') WARNING: drivers/built-in.o(.data.rel+0x1d80): Section mismatch: reference to .init.text:serial8250_console_setup (between 'serial8250_console' and 'dpm_active') WARNING: drivers/built-in.o(.sdata+0x788): Section mismatch: reference to .init.data: (between 'first.20152' and 'enabled') WARNING: drivers/built-in.o(.sdata+0x790): Section mismatch: reference to .init.data: (between 'first.20152' and 'enabled') WARNING: drivers/built-in.o(.sdata+0xa18): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xa20): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xa28): Section mismatch: reference to .init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo') WARNING: drivers/built-in.o(.sdata+0xac8): Section mismatch: reference to .init.data: (between 'Symbios_trailer.24436' and 'try_direct_io') WARNING: drivers/built-in.o(.sdata+0xb00): Section mismatch: reference to .init.data: (between 'st_max_sg_segs' and 'osst_version') WARNING: arch/ia64/hp/common/built-in.o(.data.rel.local+0xa8): Section mismatch: reference to .init.text:acpi_sba_ioc_add (between 'acpi_sba_ioc_driver' and 'ioc_seq_ops') WARNING: arch/ia64/hp/common/built-in.o(.sdata+0x0): Section mismatch: reference to .init.data:__setup_str_sba_page_override before 'reserve_sba_gart' (at offset -0x204c2613) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend] - SN: validate smp_affinity mask on intr redirect
Jack> } Jack> + Jack> +bool is_affinity_mask_valid(cpumask_t cpumask) Jack> +{ Jack> + if (ia64_platform_is("sn2")) { Jack> + /* Only allow one CPU to be specified in the smp_affinity mask */ Jack> + if (cpus_weight(cpumask) != 1) Jack> + return false; Why not just: return cpus_weight(cpumask) == 1; It's a Boolean; treat it as one. (If you thought the average kernel programmer (who's s/he?) understood the logical implication rule it could be: return !ia64_platform_is("sn2") || cpus_weight(cpumask) == 1; ) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [QUICKLIST 0/4] Arch independent quicklists V2
> "Jeremy" == Jeremy Fitzhardinge <[EMAIL PROTECTED]> writes: Jeremy> And do the same in pte pages for actual mapped pages? Or do Jeremy> you think they would be too densely populated for it to be Jeremy> worthwhile? We've been doing some measurements on how densely clumped ptes are. On 32-bit platforms, they're pretty dense. On IA64, quite a bit sparser, depending on the workload of course. I think that's mostly because of the larger pagesize on IA64 -- with 64k pages, you don't need very many to map a small object. I'm hoping IanW can give more details. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ski for huge page size !
>>>>> "sudhnesh" == sudhnesh adapawar <[EMAIL PROTECTED]> writes: sudhnesh> Hey all ! I am thinking to use ski simulator as I can get sudhnesh> the ia64 (Itanium 2)simulated on ia32 archiSo can I use sudhnesh> this product for the project related to huge page size ??? sudhnesh> Will the problems related to huge pages such as sudhnesh> swapping,IO,etc...will be covered if I use ski with 2.6 sudhnesh> kernel image configured for ia64 archi with huge page size sudhnesh> support ? Should work perfectly. We've been using Ski for similar work, looking at SuperPage support. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to boot 2.6 kernel using hp ski simulator ???
Please check out http://www.gelato.unsw.edu.au/IA64wiki/SkiSimulator for lots of info on Ski. It works fine with Linux 2.6; and hugepage work too. > 1) I used 'make ARCH=ia64 menuconfig' to configure and followed the > steps to get kernel image of version 2.6 ! I also selected the generic > type as Ski-simulator and also selected the HP-ski drivers something > simscsi,etc.etc. I suggest you start with make sim_defconfig Your symptoms look like a misconigured or misbuilt vmlinux. The sim_defconfig If you're running on IA32, then you need something like: make CROSS_COMPILE=ia64-linux-gnu ARCH=ia64 boot to build kernel and bootloader. You need to get or build yourself a disk image. Instructions for building at http://www.gelato.unsw.edu.au/IA64wiki/skidiskimage -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: ip_contrack refuses to load if built UP as a module on IA64
This patch makes UP and SMP do the same thing as far as module per-cpu data go. Unfortunately it affects core code. To repeat the problem: IA64 keeps per-cpu data in a small data area that is referenced by a 22-bit offset, for both UP and SMP cases. If a module defines per-cpu data, it too will end up in the small-data area. But the module loader at present special-cases the UP treatment of per-cpu data, assumes that it is in the GP-relative data area, and does nothing (for SMP it allocates space, and copies initialised data items into it) The effect is that modules defining per-cpu data fail to load if they're built UP, because of an impossible relocation. The appended patch makes the treatment of per-cpu data uniform between UP and SMP cases. For most architectures, the per-cpu data section will be empty for UP, and so the per-cpu setup code will not be invoked. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> diff --git a/arch/ia64/kernel/module.c b/arch/ia64/kernel/module.c --- a/arch/ia64/kernel/module.c +++ b/arch/ia64/kernel/module.c @@ -951,4 +951,10 @@ percpu_modcopy (void *pcpudst, const voi if (cpu_possible(i)) memcpy(pcpudst + __per_cpu_offset[i], src, size); } +#else +void +percpu_modcopy (void *pcpudst, const void *src, unsigned long size) +{ + memcpy(pcpudst, src, size); +} #endif /* CONFIG_SMP */ diff --git a/kernel/module.c b/kernel/module.c --- a/kernel/module.c +++ b/kernel/module.c @@ -209,7 +209,6 @@ static struct module *find_module(const return NULL; } -#ifdef CONFIG_SMP /* Number of blocks used and allocated. */ static unsigned int pcpu_num_used, pcpu_num_allocated; /* Size of each block. -ve means used. */ @@ -352,29 +351,7 @@ static int percpu_modinit(void) return 0; } __initcall(percpu_modinit); -#else /* ... !CONFIG_SMP */ -static inline void *percpu_modalloc(unsigned long size, unsigned long align, - const char *name) -{ - return NULL; -} -static inline void percpu_modfree(void *pcpuptr) -{ - BUG(); -} -static inline unsigned int find_pcpusec(Elf_Ehdr *hdr, - Elf_Shdr *sechdrs, - const char *secstrings) -{ - return 0; -} -static inline void percpu_modcopy(void *pcpudst, const void *src, - unsigned long size) -{ - /* pcpusec should be 0, and size of that section should be 0. */ - BUG_ON(size != 0); -} -#endif /* CONFIG_SMP */ + #ifdef CONFIG_MODULE_UNLOAD #define MODINFO_ATTR(field)\ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
'mdio_bus_exit' in discarded section .text.exit
When building with CONFIG_PHYLIB=y on Itanium, I see: `mdio_bus_exit' referenced in section `.init.text' of drivers/built-in.o: defined in discarded section `.exit.text' of drivers/built-in.o I believe that mdio_bus_exit should not be declared __exit, because it is referencesd from __init sections in, say, phy_init(). Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c @@ -170,7 +170,7 @@ int __init mdio_bus_init(void) return bus_register(&mdio_bus_type); } -void __exit mdio_bus_exit(void) +void mdio_bus_exit(void) { bus_unregister(&mdio_bus_type); } -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the performance bottleneck?
>>>>> "Holger" == Holger Kiehl <[EMAIL PROTECTED]> writes: Holger> Hello I have a system with the following setup: (4-way CPUs, 8 spindles on two controllers) Try using XFS. See http://scalability.gelato.org/DiskScalability_2fResults --- ext3 is single threaded and tends not to get the full benefit of either the multiple spindles nor the multiple processors. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Include assembly entry points in TAGS
As it stands, etags doesn't find labels in the IA64 or i386 assembler source code, because they're disguised inside a preprocessor macro. I propose the attached fix, which adds a regular expression to enable labels disguised by ENTRY() and GLOBAL_ENTRY() macros. There's a similar problem for MIPS, which needs to match LEAF(entrypoint) Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> diff --git a/Makefile b/Makefile --- a/Makefile +++ b/Makefile @@ -1187,7 +1187,7 @@ cscope: FORCE $(call cmd,cscope) quiet_cmd_TAGS = MAKE $@ -cmd_TAGS = $(all-sources) | etags - +cmd_TAGS = $(all-sources) | etags --regex='{asm}/\(GLOBAL_\)?ENTRY(\([^)]+\))/\2/' - # Exuberant ctags works better with -I - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fcntl(F_GETLEASE) semantics??
>>>>> "Trond" == Trond Myklebust <[EMAIL PROTECTED]> writes: Trond> to den 11.08.2005 Klokka 09:48 (+1000) skreiv Peter Chubb: >> Hi, The LTP test fcntl23 is failing. It does, in essence, fd = >> open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK) >> == -1) fail; >> >> fcntl always returns EAGAIN here. The manual page says that a read >> lease causes notification when `another process' opens the file for >> writing or truncates it. The kernel implements `any process' >> (including the current one). >> >> Which semantics are correct? Personally I think that what the >> kernel implements is correct (you can't get a read lease unsless >> there are no writers _at_ _all_) Trond> A read lease should mean that there are no writers at all. Trond> If we were to allow the current process to open for write, then Trond> that would still mean that nobody else can get a lease. In Trond> effect you have been granted a lease with exclusive semantics Trond> (i.e. a write lease). You might as well request that instead of Trond> pretending it is a read lease. So the manual page is wrong. Fine. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
fcntl(F_GETLEASE) semantics??
Hi, The LTP test fcntl23 is failing. It does, in essence, fd = open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1) fail; fcntl always returns EAGAIN here. The manual page says that a read lease causes notification when `another process' opens the file for writing or truncates it. The kernel implements `any process' (including the current one). Which semantics are correct? Personally I think that what the kernel implements is correct (you can't get a read lease unsless there are no writers _at_ _all_) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to get the physical page addresses from a kernel virtual address for DMA SG List?
You may want to take a look at the user-mode driver infrastructure patches, which do almost exactly what you're trying to do. Get them from http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/usrdrivers/kernel-2.6.12-rc3/ -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangcheck problem
>>>>> "Noah" == Noah Silverman <[EMAIL PROTECTED]> writes: Noah> Sorry 2.6.7 Noah> Burton Windle wrote: >> Kernel version? Are you running on an x86 machine without TSC, e.g., a 486? the Hangcheck timer then devolves into using jiffies, and a single jiffy error gives you the printout you mention. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to measure time accurately.
>>>>> "Chris" == Chris Friesen <[EMAIL PROTECTED]> writes: Chris> krishna wrote: >> Hi All, >> >> Can any one tell me how to measure time accurately for a block of C >> code in device drivers. For example, If I want to measure the time >> duration of firmware download. Chris> Most cpus have some way of getting at a counter or decrementer Chris> of various frequencies. Usually it requires low-level hardware Chris> knowledge and often it needs assembly code. As a device driver is inside the linux kernel (unless you're writein a user-mode device driver :-)) you can use the getcycles() macro that's defined for most architectures. It provides a snapshot of the cycle-counter. Caveats: 1. If you're running with power management, the cycle counter ticks at a variable rate. 2. If you're on a multiprocessor, the cycle counters of different processors need not be synchronised. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LBD/filesystems over 2TB: is it safe?
>>>>> "jniehof" == jniehof <[EMAIL PROTECTED]> writes: jniehof> Someone posted to the LBD list last December regarding some jniehof> supposedly horrible bugs in large filesystems: jniehof> https://www.gelato.unsw.edu.au/archives/lbd/2004-December/75.html jniehof> https://www.gelato.unsw.edu.au/archives/lbd/2004-December/74.html The changes in those emails are irrelevant --- they fail to take into account the properties of the filesystems that they modify, that mean that the 32-bit quantities being shifted will not overflow. They're typically of the form: - iblock = index << (PAGE_CACHE_SHIFT - inode->i_blkbits); + iblock = (sector_t) index << (PAGE_CACHE_SHIFT - inode->i_blkbits); Now, on a 32-bit processor with 4k pages, PAGE_CACHE_SHIFT is 12, and i_blkbits is also 12 if you're using 4k blocks (which you have to to get a large filesystem). So this does nothing and is safe. The on-disk format for ext[23] uses 32-bit block numbers, so your maximum filesystem size is 16TB, and your maximum value of iblock is 2^32-1. Please do benchmark XFS and ext3 on your system before choosing. Our tests (to be published in Linux.Conf.Au next month) show that XFS is significantly faster for some workloads. Also its scalability to very large filesystems is much more mature than ext3. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: forkbombing Linux distributions
>>>>> "William" == William Beebe <[EMAIL PROTECTED]> writes: William> Sure enough, I created the following script and ran it as a William> non-root user: William> #!/bin/bash $0 & $0 & There are two approaches to fixing this. 1. Rate limit fork(). Unfortunately some legitimate usges do a lot of forking, and you don't really want to slow them down. 2. Limit (per user) the number of processes allowed. This is what's currently done; and if you as administrator want to you can set RLIMIT_NPROC in /etc/security/limits.conf On an almost-single-user system such as most desktops, there isn't much point in setting this. On shared systems, it can be useful. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm_dirty_ratio seems a bit large.
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes: Andrew> Robin Holt <[EMAIL PROTECTED]> wrote: >> One other issue we have is the vm_dirty_ratio and background_ratio >> adjustments are a little coarse with these memory sizes. Since our >> minimum adjustment is 1%, we are adjusting by 40GB on the largest >> configuration from above. The hardware we are shipping today is >> capable of going to far greater amounts of memory, but we don't >> have customers demanding that yet. I would like to plan ahead for >> that and change vm_dirty_ratio from a straight percent into a >> millipercent (thousandth of a percent). Would that type of change >> be acceptable? Andrew> Oh drat. I think such a change would require a new set of Andrew> /proc entries. No, you could just extend them to understand fixed point. Keep printing integers as integers, print non-integers with one (or two: will we ever need 0.01% increments?) decimal places. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Can no longer build ipv6 built-in (2.6.11, today's BK head)
Changeset [EMAIL PROTECTED]|ChangeSet|20050310043957|06845 added cleanup to ipv6_init(), which calls ip6_route_cleanup() ip6_route_cleanup() is marked __exit so cannot be called from an __init section -- it's discarded by the linker from the image (although it'll be retained in a module). You get errors like this: ip6_route_cleanup: discarded in section `.exit.text' from net/built-in.o xfrm6_fini: discarded in section `.exit.text' from net/built-in.o fib6_gc_cleanup: discarded in section `.exit.text' from net/built-in.o ipv6_packet_cleanup: discarded in section `.exit.text' from net/built-in.o A simple fix is to delete the __exit from the various functions now that they're called other than at module_exit. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> Index: linux-2.5-import/net/ipv6/route.c === --- linux-2.5-import.orig/net/ipv6/route.c 2005-03-16 10:12:44.742595387 +1100 +++ linux-2.5-import/net/ipv6/route.c 2005-03-16 13:01:50.246678866 +1100 @@ -2116,7 +2116,7 @@ #endif } -void __exit ip6_route_cleanup(void) +void ip6_route_cleanup(void) { #ifdef CONFIG_PROC_FS proc_net_remove("ipv6_route"); Index: linux-2.5-import/net/ipv6/ipv6_sockglue.c === --- linux-2.5-import.orig/net/ipv6/ipv6_sockglue.c 2005-03-16 10:12:44.736736056 +1100 +++ linux-2.5-import/net/ipv6/ipv6_sockglue.c 2005-03-16 13:24:19.095793200 +1100 @@ -698,7 +698,7 @@ dev_add_pack(&ipv6_packet_type); } -void __exit ipv6_packet_cleanup(void) +void ipv6_packet_cleanup(void) { dev_remove_pack(&ipv6_packet_type); } Index: linux-2.5-import/net/ipv6/ip6_fib.c === --- linux-2.5-import.orig/net/ipv6/ip6_fib.c2005-03-15 12:28:44.819748921 +1100 +++ linux-2.5-import/net/ipv6/ip6_fib.c 2005-03-16 13:27:46.423351526 +1100 @@ -1218,7 +1218,7 @@ panic("cannot create fib6_nodes cache"); } -void __exit fib6_gc_cleanup(void) +void fib6_gc_cleanup(void) { del_timer(&ip6_fib_timer); kmem_cache_destroy(fib6_node_kmem); Index: linux-2.5-import/net/ipv6/xfrm6_policy.c === --- linux-2.5-import.orig/net/ipv6/xfrm6_policy.c 2005-03-15 12:28:44.853928319 +1100 +++ linux-2.5-import/net/ipv6/xfrm6_policy.c2005-03-16 13:53:28.890552848 +1100 @@ -276,7 +276,7 @@ xfrm_policy_register_afinfo(&xfrm6_policy_afinfo); } -static void __exit xfrm6_policy_fini(void) +static void xfrm6_policy_fini(void) { xfrm_policy_unregister_afinfo(&xfrm6_policy_afinfo); } @@ -287,7 +287,7 @@ xfrm6_state_init(); } -void __exit xfrm6_fini(void) +void xfrm6_fini(void) { //xfrm6_input_fini(); xfrm6_policy_fini(); Index: linux-2.5-import/net/ipv6/xfrm6_state.c === --- linux-2.5-import.orig/net/ipv6/xfrm6_state.c2005-03-15 12:28:44.854904874 +1100 +++ linux-2.5-import/net/ipv6/xfrm6_state.c 2005-03-16 13:29:30.183337361 +1100 @@ -129,7 +129,7 @@ xfrm_state_register_afinfo(&xfrm6_state_afinfo); } -void __exit xfrm6_state_fini(void) +void xfrm6_state_fini(void) { xfrm_state_unregister_afinfo(&xfrm6_state_afinfo); } -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Tue, 15 Mar 2005 14:47:42 +1100, Peter Chubb Jon> <[EMAIL PROTECTED]> wrote: >> What I really want to do is deprivilege the driver code as much as >> possible. Whatever a driver does, the rest of the system should >> keep going. That way malicious or buggy drivers can only affect >> the processes that are trying to use the device they manage. >> Moreover, it should be possible to kill -9 a driver, then restart >> it, without the rest of the system noticing more than a hiccup. To >> do this, step one is to run the driver in user space, so that it's >> subject to the same resource management control as any other >> process. Step two, which is a lot harder, is to connect the driver >> back into the kernel so that it can be shared. Tun/Tap can be used >> for network devices, but it's really too slow -- you need zero-copy >> and shared notification. Jon> Have you considered running the drivers in a domain under Xen? See the paper presented by Karlsruhr at OSDI: Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Götz: Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines. OSDI '04. They're using L4, rather than Xen as the paravirtualisation layer. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb Jon> <[EMAIL PROTECTED]> wrote: >> >>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: >> >> >> The scenario I'm thinking about with these patches are things >> like >> low-latency user-level networking between nodes in a >> cluster, where >> for good performance even with a kernel driver >> you don't want to >> share your interrupt line with anything else. >> Jon> The code needs to refuse to install if the IRQ line is shared. >> It does. The request_irq() call explicitly does not include >> SA_SHARED in its flags, so if the line is shared, it'll return an >> error to user space when the driver tries to open the file >> representing the interrupt. Jon> Please put some big comments warning people about adding Jon> SA_SHARED. I can easily see someone thinking that they are fixing Jon> a bug by adding it. I'd probably even write a paragraph about Jon> what will happen if SA_SHARED is added. Will do. The main problem here is X86, as other architectures either don't care, or have enough interrupt lines. And the people who are paying me for this kind of thing all run IA64 What I really want to do is deprivilege the driver code as much as possible. Whatever a driver does, the rest of the system should keep going. That way malicious or buggy drivers can only affect the processes that are trying to use the device they manage. Moreover, it should be possible to kill -9 a driver, then restart it, without the rest of the system noticing more than a hiccup. To do this, step one is to run the driver in user space, so that it's subject to the same resource management control as any other process. Step two, which is a lot harder, is to connect the driver back into the kernel so that it can be shared. Tun/Tap can be used for network devices, but it's really too slow -- you need zero-copy and shared notification. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
inode_lock heavily contended in 2.6.11
When running reaim7 on a 12-way IA64 on an ext2 filesystem on a ram disc, I see very heavy contention on inode_lock. lockstat output shows: SPINLOCKS HOLDWAIT UTIL CONMEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 46.8% 52.4% 1.9us( 130us) 20us(8073us)(21.5%) 5072151 47.6% 52.4%0% inode_lock 15.9% 59.5% 3.8us( 61us) 18us(7067us)( 3.9%)852983 40.5% 59.5%0% __sync_single_inode+0xf0 9.2% 59.0% 1.2us( 25us) 20us(8073us)( 7.8%) 1596487 41.0% 59.0%0% generic_osync_inode+0xe0 (etc). Is anyone else seeing this on more realistic workloads? -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb Jon> <[EMAIL PROTECTED]> wrote: >> >>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: >> >> >> The scenario I'm thinking about with these patches are things >> like >> low-latency user-level networking between nodes in a >> cluster, where >> for good performance even with a kernel driver >> you don't want to >> share your interrupt line with anything else. Jon> Instead of making up a new API what about making a library of Jon> calls that emulates the common entry points used by device Jon> drivers. The version I did for UML could take the same driver and Jon> run it in user space or the kernel without changing source Jon> code. I found this very useful. The in-kernel device drivers interface is very large --- I want to start with something a bit simpler. We do have a compatibility library, as yet unreleased, that allows the same drivers to run in-kernel or in user space. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Sat, 12 Mar 2005 10:11:18 -0700 (MST), Zwane Mwaikambo Jon> <[EMAIL PROTECTED]> wrote: >> Alan's proposal sounds very plausible and additionally if we find >> that we have an irq line screaming we could use the same supplied >> information to disable userspace interrupt handled devices first. Jon> I like it too and it would help Xen. Now we just need to modify Jon> 800 device drivers to use it. It's incomplete. But you probably knew that... The main problem I see is that even with the proposed interface, you'd need to disable the interrupt in the interrupt controller, because merely acknowledging an interrupt to a device doesn't stop it from interrupting. And you really want the device to stop asserting the interrupt before doing an EOI, unless you're going to mask the interrupt. So you'd need to have an interface that not only acknowledged the current interrupt but also prevented the device from interrupting. That typically means reading a status register (slow!) and then setting one or more bits in one or more control registers. Also for a user level driver you really want to do the EIO before invoking user space. Otherwise, depending on the interrupt controller, lower numbered interrupts could be masked until the user space returns --- which might be a long time off. Reading the status register is typically one of the slowest single parts of a device driver (latency can be > 2 usec), so you don't really want to have to read it again within the driver... so you'd probably want to pass it as part of the interrupt arguments to the driver. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: >> The scenario I'm thinking about with these patches are things like >> low-latency user-level networking between nodes in a cluster, where >> for good performance even with a kernel driver you don't want to >> share your interrupt line with anything else. Jon> The code needs to refuse to install if the IRQ line is shared. It does. The request_irq() call explicitly does not include SA_SHARED in its flags, so if the line is shared, it'll return an error to user space when the driver tries to open the file representing the interrupt. Jon> Also what about SMP, if you shut the IRQ off on one CPU isn't it Jon> still enabled on all of the others? Nope. disable_irq_nosync() talks to the interrupt controller, which is common to all the processors. The main problem is that it's slow, because it has to go off-chip. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Fri, 11 Mar 2005 11:29:20 +0100, Pavel Machek <[EMAIL PROTECTED]> Jon> wrote: >> Hi! >> >> > As many of you will be aware, we've been working on >> infrastructure for > user-mode PCI and other drivers. The first >> step is to be able to > handle interrupts from user >> space. Subsequent patches add > infrastructure for setting up DMA >> for PCI devices. >> > >> > The user-level interrupt code doesn't depend on the other >> patches, and > is probably the most mature of this patchset. >> >> Okay, I like it; it means way easier PCI driver development. Jon> It won't help with PCI driver development. I tried implementing Jon> this for UML. If your driver has any bugs it won't get the Jon> interrupts acknowledged correctly and you'll end up rebooting. That's not actually true, at least when we developed drivers here. The only times we had to reboot were the times we mucked up the dma register settings, and dma'd all over the kernel by mistake... -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes: Jon> On Fri, 11 Mar 2005 14:36:10 +1100, Peter Chubb Jon> <[EMAIL PROTECTED]> wrote: >> As many of you will be aware, we've been working on infrastructure >> for user-mode PCI and other drivers. The first step is to be able >> to handle interrupts from user space. Subsequent patches add >> infrastructure for setting up DMA for PCI devices. Jon> I've tried implementing this before and could not get around the Jon> interrupt problem. Most interrupts on the x86 architecture are Jon> shared. Disabling the IRQ at the PIC blocks all of the shared Fortunately, most interrupts on IA64, ARM, etc., are unshared. And with PCI-Express, the problem will go away. Even on X86, things aren't all bad: one can usually find a PCI slot which doesn't share interrupts with anything you care about. The scenario I'm thinking about with these patches are things like low-latency user-level networking between nodes in a cluster, where for good performance even with a kernel driver you don't want to share your interrupt line with anything else. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
>>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes: Greg> On Fri, Mar 11, 2005 at 07:34:46PM +1100, Peter Chubb wrote: >> >>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes: >> Greg> On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote: >> >> +/* + * The PCI subsystem is implemented as yet-another pseudo >> >> filesystem, + * albeit one that is never mounted. + * This is >> its >> magic number. + */ +#define USR_PCI_MAGIC (0x12345678) >> Greg> If you make it a real, mountable filesystem, then you don't need Greg> to have any of your new syscalls, right? Why not just do that Greg> instead? >> >> >> The only call that would go is usr_pci_open() -- you'd still need >> usr_pci_map() Greg> see mmap(2) mmap maps a file's contents into your own virtual memory. usr_pci_map maps part of your own virtual memory into pci bus space for a particular device (using the IOMMU if your machine has one), and returns a scatterlist of bus addresses to hand to the device. Different semantics entirely. Greg> In fact, both of the above can be done today from /proc/bus/pci/ Greg> right? Nope. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)
On Gwe, 2005-03-11 at 03:36, Peter Chubb wrote: > +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs > *regs) > +{ > + struct irq_proc *idp = (struct irq_proc *)vidp; > + > + BUG_ON(idp->irq != irq); > + disable_irq_nosync(irq); > + atomic_inc(&idp->count); > + wake_up(&idp->q); > + return IRQ_HANDLED; Alan> You just deadlocked the machine in many configurations. You can't use Alan> disable_irq for this trick you have to tell the kernel how to handle it. Can you elaborate, please? In particular, why doesn't essentially the same action (disabling an interrupt before the EOI) in note_interrupt() not lock up the machine? I can see there'd be problems if the code allowed shared interrupts, but it doesn't. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Microstate Accounting for 2.6.11
>>>>> "Andi" == Andi Kleen <[EMAIL PROTECTED]> writes: Andi> Andrew Morton <[EMAIL PROTECTED]> writes: >> Why does the kernel need this feature? >> >> Have you any numbers on the overhead? Andi> It does RDTSC and lots of complicated stuff twice for each Andi> system call. On P4 this will be extremly slow (> 1000cycles Andi> combined) It is pretty unlikely that whatever it does justifies Andi> this extreme overhead in a critical fast path. Not really `lots of complicated stuff'. Just swap a timer and set a flag on entry: msp->timers[msp->laststate] += now - msp->lastchange msp->lastchange = now msp->laststate = ONCPU_SYS msp->cflags |= MSA_SYS And swap timers and clear the flag on exit. The flag's needed to force return to ONCPU_SYS rather than ONCPU_USR if the task preempted or interrupted while in a system call. If there's a simpler, cheaper, faster way to track time spent in system calls (as opposed to time spent in interrupt handlers, or on the run queue) thn I'd like to know what it is. And I recognise there're are lots of people who don't want this --- but there are some who do. I've maintained this patch since mid 2003, and have seen a steady trickle of downloads --- one or two a week. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
>>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes: Greg> On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote: >> +/* + * The PCI subsystem is implemented as yet-another pseudo >> filesystem, + * albeit one that is never mounted. + * This is its >> magic number. + */ +#define USR_PCI_MAGIC (0x12345678) Greg> If you make it a real, mountable filesystem, then you don't need Greg> to have any of your new syscalls, right? Why not just do that Greg> instead? The only call that would go is usr_pci_open() -- you'd still need usr_pci_map(), usr_pci_unmap() and usr_pci_get_consistent(). -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Microstate Accounting for 2.6.11
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes: Andrew> Peter Chubb <[EMAIL PROTECTED]> wrote: >> Timing data on threads at present is pretty crude: when the timer >> interrupt occurs, a tick is added to either system time or user >> time for the currently running thread. Thus in an unpacthed kernel >> one can distinguish three timed states: On-cpu in userspace, on-cpu >> in system space, and not running. >> >> The actual number of states is much larger. A thread can be on a >> runqueue or the expired queue (i.e., ready to run but not running), >> sleeping on a semaphore or on a futex, having its time stolen to >> service an interrupt, etc., etc. >> >> This patch adds timers per-state to each struct task_struct, so >> that time in all these states can be tracked. This patch contains >> the core code do the timing, and to initialise the timers. >> Subsequent patches enable the code (by adding Kconfig options) and >> add hooks to track state changes. Andrew> Why does the kernel need this feature? I find that it's useful when trying to work out why a thread is going more slowly than it needs to. Userspace tools in the CVS repository at gelato.unsw.edu.au let you graph in real time the time spent in each state, so you get graphs like this: http://gelato.unsw.edu.au/patches/snapshot.png which shows mplay skipping because of a slow disk/filesystem. Andrew> Have you any numbers on the overhead? Around 5% on LMbench context switch numbers for uniprocessor, negligeable on SMP (but SMP context switch results are horrible at the moment according to LMbench2 -- almost 16usec); select on 10 fd goes from 1.665 usec to 1.701; Andrew> The preempt_disable() in sys_msa() seems odd. Yes I only added that yesterday. It's to prevent migration while updating the current timer. All the other places where the current timer are updated are naturally protected this. It should probably be a local_irq_disable() instead. Peter C - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate accounting, IA64 support
Microstate Accounting: Add suppoort for IA64. linux-2.6-ustate/arch/ia64/Kconfig | 25 +++ linux-2.6-ustate/arch/ia64/kernel/entry.S| 44 +++ linux-2.6-ustate/arch/ia64/kernel/irq_ia64.c | 21 +++- linux-2.6-ustate/arch/ia64/kernel/ivt.S |8 +++- linux-2.6-ustate/include/asm-ia64/msa.h | 33 linux-2.6-ustate/include/asm-ia64/unistd.h |1 7 files changed, 129 insertions(+), 5 deletions(-) Index: linux-2.6-ustate/arch/ia64/Kconfig === --- linux-2.6-ustate.orig/arch/ia64/Kconfig 2005-03-10 09:13:01.780632777 +1100 +++ linux-2.6-ustate/arch/ia64/Kconfig 2005-03-10 09:16:14.593655619 +1100 @@ -302,6 +302,31 @@ little bigger and slows down execution a bit, but it is generally a good idea to turn this on. If you're unsure, say Y. +config MICROSTATE + bool "Microstate accounting" + help + This option causes the kernel to keep very accurate track of + how long your threads spend on the runqueues, running, or asleep or + stopped. It will slow down your kernel. + Times are reported in /proc/pid/msa and through a new msa() + system call. +choice + depends on MICROSTATE + prompt "Microstate timing source" + default MICROSTATE_ITC + help + On IA64 one can use two timeing sources for the microstate + accounting; the on-chip interval counter, or Linux's + time-of-day clock. The first is very cheap; the other is + more accurate on SMP systems. + +config MICROSTATE_ITC + bool "Use on-chip ITC for microstate timing" + +config MICROSTATE_TOD + bool "Use time-of-day clock for microstate timings" +endchoice + config IA64_PALINFO tristate "/proc/pal support" help Index: linux-2.6-ustate/include/asm-ia64/msa.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/include/asm-ia64/msa.h 2005-03-10 09:16:14.594632174 +1100 @@ -0,0 +1,33 @@ +/ + * asm-ia64/msa.h + * + * Provide an architecture-specific clock. + */ + +#ifndef _ASM_IA64_MSA_H +#define _ASM_IA64_MSA_H + +#include +#include +#include + + +# if defined(CONFIG_MICROSTATE_ITC) +# define MSA_NOW(now) do { now = (clk_t)get_cycles(); } while (0) + +# define MSA_TO_NSEC(clk) ((10*clk) / cpu_data(smp_processor_id())->itc_freq) + +# elif defined(CONFIG_MICROSTATE_TOD) +static inline void msa_now(clk_t *nsp) { + struct timeval tv; + do_gettimeofday(&tv); + *nsp = tv.tv_sec * 100 + tv.tv_usec; +} +# define MSA_NOW(x) msa_now(&x) +# define MSA_TO_NSEC(clk) ((clk) * 1000) + +# else +# include +# endif + +#endif /* _ASM_IA64_MSA_H */ Microstate Accounting: Track time in system calls for IA64 arch/ia64/kernel/entry.S | 44 arch/ia64/kernel/ivt.S |8 ++-- 2 files changed, 50 insertions(+), 2 deletions(-) Index: linux-2.6-ustate/arch/ia64/kernel/entry.S === --- linux-2.6-ustate.orig/arch/ia64/kernel/entry.S 2005-03-10 09:13:01.149778160 +1100 +++ linux-2.6-ustate/arch/ia64/kernel/entry.S 2005-03-10 09:16:15.157128068 +1100 @@ -589,6 +589,46 @@ .ret4: br.cond.sptk ia64_leave_kernel END(ia64_strace_leave_kernel) +#ifdef CONFIG_MICROSTATE +/* + * preserve input registers, + * and r8 + */ +GLOBAL_ENTRY(invoke_msa_end_syscall) + .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) + alloc loc1=ar.pfs,8,4,0,0 + mov loc0=rp + .body + ;; + mov loc2=ret0 + mov loc3=ret2 + br.call.sptk.many rp=msa_end_syscall +1: mov rp=loc0 + mov ret0=loc2 + mov ret2=loc3 + mov ar.pfs=loc1 + br.ret.sptk.many rp +END(invoke_msa_end_syscall) +/* + * Preserves in0-7, and all callee-save registers. + */ +GLOBAL_ENTRY(invoke_msa_start_syscall) + .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8) + alloc loc1=ar.pfs,8,4,0,0 + mov loc0=rp + .body + mov loc2=r3 + mov loc3=r15 + ;; + br.call.sptk.many rp=msa_start_syscall +1: mov r15=loc3 + mov r3=loc2 + mov ar.pfs=loc1 + mov rp=loc0 + br.ret.sptk.many rp +END(invoke_msa_start_syscall) +#endif /* CONFIG_MICROSTATE */ + GLOBAL_ENTRY(ia64_ret_from_clone) PT_REGS_UNWIND_INFO(0) { /* @@ -671,6 +711,10 @@ */ ENTRY(ia64_leave_syscall) PT_REGS_UNWIND_INFO(0) +#ifdef CONFIG_MICROSTATE + br.call.sptk.many rp=invoke_msa_end_syscall +1: +#endif /* * work.need_resched etc. mustn't get changed by this CPU before it returns to * user- or fsys-mode, hence we di
Microstate Accounting for 2.6.11, patch 4/6
Microstate accounting: Account for time in interrupt handlers for I386. arch/i386/kernel/irq.c | 13 - 1 files changed, 12 insertions(+), 1 deletion(-) Index: linux-2.6-ustate/arch/i386/kernel/irq.c === --- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 09:13:00.115606274 +1100 +++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 +1100 @@ -55,6 +55,8 @@ #endif irq_enter(); + msa_start_irq(irq); + #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { @@ -101,6 +103,7 @@ #endif __do_IRQ(irq, regs); + msa_finish_irq(irq); irq_exit(); return 1; @@ -221,10 +224,18 @@ seq_printf(p, "%3d: ",i); #ifndef CONFIG_SMP seq_printf(p, "%10u ", kstat_irqs(i)); +#ifdef CONFIG_MICROSTATE + seq_printf(p, "%10llu", msa_irq_time(0, i)); +#endif #else for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) + if (cpu_online(j)) { seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); +#ifdef CONFIG_MICROSTATE + seq_printf(p, "%10llu", msa_irq_time(j, i)); +#endif + } + #endif seq_printf(p, " %14s", irq_desc[i].handler->typename); seq_printf(p, " %s", action->name); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 6/6
Microstate accounting: Track time spent asleep while paging, in poll() or select(), or on a futex separately from other sleeps. fs/select.c |2 ++ kernel/futex.c |2 ++ mm/memory.c |6 +- Index: linux-2.6-ustate/mm/memory.c === --- linux-2.6-ustate.orig/mm/memory.c 2005-03-10 09:12:59.492564100 +1100 +++ linux-2.6-ustate/mm/memory.c2005-03-10 09:16:16.583875465 +1100 @@ -2079,6 +2079,7 @@ if (is_vm_hugetlb_page(vma)) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + msa_next_state(current, PAGING_SLEEP); /* * We need the page table lock to synchronize with kswapd * and the SMP-safe atomic PTE updates. @@ -2098,10 +2099,13 @@ if (!pte) goto oom; - return handle_pte_fault(mm, vma, address, write_access, pte, pmd); + int ret = handle_pte_fault(mm, vma, address, write_access, pte, pmd); + msa_next_state(current, MSA_UNKNOWN); + return ret; oom: spin_unlock(&mm->page_table_lock); + msa_next_state(current, MSA_UNKNOWN); return VM_FAULT_OOM; } Index: linux-2.6-ustate/kernel/futex.c === --- linux-2.6-ustate.orig/kernel/futex.c2005-03-10 09:12:58.843154938 +1100 +++ linux-2.6-ustate/kernel/futex.c 2005-03-10 09:16:17.109262256 +1100 @@ -39,6 +39,7 @@ #include #include #include +#include #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) @@ -571,6 +572,7 @@ * wakes us up. */ + msa_next_state(current, FUTEX_SLEEP); /* add_wait_queue is the barrier after __set_current_state. */ __set_current_state(TASK_INTERRUPTIBLE); add_wait_queue(&q.waiters, &wait); Index: linux-2.6-ustate/fs/select.c === --- linux-2.6-ustate.orig/fs/select.c 2005-03-10 09:12:59.182996124 +1100 +++ linux-2.6-ustate/fs/select.c2005-03-10 09:16:16.843639194 +1100 @@ -256,6 +256,7 @@ retval = table.error; break; } + msa_next_state(current, POLL_SLEEP); __timeout = schedule_timeout(__timeout); } __set_current_state(TASK_RUNNING); @@ -447,6 +448,7 @@ count = wait->error; if (count) break; + msa_next_state(current, POLL_SLEEP); timeout = schedule_timeout(timeout); } __set_current_state(TASK_RUNNING); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 5/6
Microstate accounting: Add the I386 system call. arch/i386/kernel/entry.S |2 +- include/asm-i386/unistd.h |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6-ustate/arch/i386/kernel/entry.S === --- linux-2.6-ustate.orig/arch/i386/kernel/entry.S 2005-03-10 09:16:14.888575341 +1100 +++ linux-2.6-ustate/arch/i386/kernel/entry.S 2005-03-10 09:16:15.446188457 +1100 @@ -876,7 +876,7 @@ .long sys_mq_getsetattr .long sys_ni_syscall/* reserved for kexec */ .long sys_waitid - .long sys_ni_syscall/* 285 */ /* available */ + .long sys_msa /* 285 */ /* available */ .long sys_add_key .long sys_request_key .long sys_keyctl Index: linux-2.6-ustate/include/asm-i386/unistd.h === --- linux-2.6-ustate.orig/include/asm-i386/unistd.h 2005-03-10 09:13:00.813843194 +1100 +++ linux-2.6-ustate/include/asm-i386/unistd.h 2005-03-10 09:16:15.448141568 +1100 @@ -290,7 +290,7 @@ #define __NR_mq_getsetattr (__NR_mq_open+5) #define __NR_sys_kexec_load283 #define __NR_waitid284 -/* #define __NR_sys_setaltroot 285 */ +#define __NR_sys_msa 285 #define __NR_add_key 286 #define __NR_request_key 287 #define __NR_keyctl288 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 3/6
Microstate accounting: Provide I386-dependent MSA clocks, and Kconfig options. arch/i386/Kconfig | 39 ++- include/asm-i386/msa.h | 49 + 2 files changed, 87 insertions(+), 1 deletion(-) Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> Index: linux-2.6-ustate/arch/i386/Kconfig === --- linux-2.6-ustate.orig/arch/i386/Kconfig 2005-03-11 09:59:38.773632446 +1100 +++ linux-2.6-ustate/arch/i386/Kconfig 2005-03-11 09:59:38.777538666 +1100 @@ -923,8 +923,45 @@ If unsure, say Y. Only embedded should say N here. -endmenu +config MICROSTATE + bool "Microstate accounting" + help + This option causes the kernel to keep very accurate track of +how long your threads spend on the runqueues, running, or asleep or +stopped. It will slow down your kernel. +Times are reported in /proc/pid/msa and through a new msa() +system call. + +choice + depends on MICROSTATE + prompt "Microstate timing source" + default MICROSTATE_TSC + +config MICROSTATE_PM + bool "Use Power-Management timer for microstate timings" + depends on X86_PM_TIMER + help +If your machine is ACPI enabled and uses power-management, then the +TSC runs at a variable rate, which will distort the +microstate measurements. This timer, although having +slightly more overhead, and a lower resolution (279 +nanoseconds or so) will always run at a constant rate. + +config MICROSTATE_TSC + bool "Use on-chip TSC for microstate timings" + depends on X86_TSC + help + If your machine's clock runs at constant rate, then this timer +gives you cycle precision in measureing times spent in microstates. + +config MICROSTATE_TOD + bool "Use time-of-day clock for microstate timings" + help + If none of the other timers are any good for you, this timer +will give you micro-second precision. +endchoice +endmenu menu "Power management options (ACPI, APM)" depends on !X86_VOYAGER Index: linux-2.6-ustate/include/asm-i386/msa.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/include/asm-i386/msa.h 2005-03-11 09:59:38.779491777 +1100 @@ -0,0 +1,49 @@ +/ + * asm-i386/msa.h + * + * Provide an architecture-specific clock. + */ + +#ifndef _ASM_I386_MSA_H +# define _ASM_I386_MSA_H + +# include + + +# if defined(CONFIG_MICROSTATE_TSC) +/* + * Use the processor's time-stamp counter as a timesource + */ +# include +# include + +# define MSA_NOW(now) rdtscll(now) + +extern unsigned long cpu_khz; +# define MSA_TO_NSEC(clk) ({ clk_t _x = ((clk) * 100ULL); do_div(_x, cpu_khz); _x; }) + +# elif defined(CONFIG_MICROSTATE_PM) +/* + * Use the system's monotonic clock as a timesource. + * This will only be enabled if the Power Management Timer is enabled. + */ +unsigned long long monotonic_clock(void); +# define MSA_NOW(now) do { now = monotonic_clock(); } while (0) +# define MSA_TO_NSEC(clk) (clk) + +# elif defined(CONFIG_MICROSTATE_TOD) +/* + * Fall back to gettimeofday. + * This one is incompatible with interrupt-time measurement on some processors. + */ +static inline void msa_now(clk_t *nsp) { + struct timeval tv; + do_gettimeofday(&tv); + *nsp = tv.tv_sec * 100 + tv.tv_usec; +} +# define MSA_NOW(x) msa_now(&x) +# define MSA_TO_NSEC(clk) ((clk) * 1000) +# endif + + +#endif /* _ASM_I386_MSA_H */ I386 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 2/6
Microstate Accounting: Add hooks into the scheduler to track state changes. Arrange for parent process's child times to be updated at process exit. kernel/sched.c |8 kernel/exit.c |3 +++ Index: linux-2.6-ustate/kernel/sched.c === --- linux-2.6-ustate.orig/kernel/sched.c2005-03-11 09:59:31.109628035 +1100 +++ linux-2.6-ustate/kernel/sched.c 2005-03-11 09:59:31.116463921 +1100 @@ -635,6 +635,7 @@ */ static inline void __activate_task(task_t *p, runqueue_t *rq) { + msa_set_timer(p, ONACTIVEQUEUE); enqueue_task(p, rq->active); rq->nr_running++; } @@ -1238,6 +1239,7 @@ if (unlikely(!current->array)) __activate_task(p, rq); else { + msa_set_timer(p, ONACTIVEQUEUE); p->prio = current->prio; list_add_tail(&p->run_list, ¤t->run_list); p->array = current->array; @@ -2422,6 +2424,7 @@ if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { + msa_next_state(p, ONEXPIREDQUEUE); enqueue_task(p, rq->expired); if (p->static_prio < rq->best_expired_prio) rq->best_expired_prio = p->static_prio; @@ -2733,6 +2736,7 @@ array = rq->active; rq->expired_timestamp = 0; rq->best_expired_prio = MAX_PRIO; + msa_flip_expired(prev); } else schedstat_inc(rq, sched_noswitch); @@ -2773,6 +2777,8 @@ rq->curr = next; ++*switch_count; + msa_switch(prev, next); + prepare_arch_switch(rq, next); prev = context_switch(rq, prev, next); barrier(); @@ -3693,6 +3699,8 @@ */ if (rt_task(current)) target = rq->active; + else + msa_next_state(current, ONEXPIREDQUEUE); if (current->array->nr_active == 1) { schedstat_inc(rq, yld_act_empty); Index: linux-2.6-ustate/kernel/exit.c === --- linux-2.6-ustate.orig/kernel/exit.c 2005-03-11 09:59:36.360564796 +1100 +++ linux-2.6-ustate/kernel/exit.c 2005-03-11 09:59:36.364471017 +1100 @@ -93,6 +93,9 @@ } sched_exit(p); + + msa_update_parent(p->parent, p); + write_unlock_irq(&tasklist_lock); spin_unlock(&p->proc_lock); proc_pid_flush(proc_dentry); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11, patch 3/
Microstate Accounting: Track time in system calls and interrupts, i386 code. Signed-off-by; Peter Chubb <[EMAIL PROTECTED]> arch/i386/kernel/entry.S | 16 arch/i386/kernel/irq.c | 13 - Index: linux-2.6-ustate/arch/i386/kernel/entry.S === --- linux-2.6-ustate.orig/arch/i386/kernel/entry.S 2005-03-10 09:13:01.448604031 +1100 +++ linux-2.6-ustate/arch/i386/kernel/entry.S 2005-03-10 09:16:14.888575341 +1100 @@ -222,10 +222,18 @@ /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ testw $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp) jnz syscall_trace_entry +#ifdef CONFIG_MICROSTATE + pushl %eax + call msa_start_syscall + popl%eax +#endif cmpl $(nr_syscalls), %eax jae syscall_badsys call *sys_call_table(,%eax,4) movl %eax,EAX(%esp) +#ifdef CONFIG_MICROSTATE + call msa_end_syscall +#endif cli movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx @@ -250,9 +258,17 @@ cmpl $(nr_syscalls), %eax jae syscall_badsys syscall_call: +#ifdef CONFIG_MICROSTATE + pushl %eax + call msa_start_syscall + popl%eax +#endif call *sys_call_table(,%eax,4) movl %eax,EAX(%esp) # store the return value syscall_exit: +#ifdef CONFIG_MICROSTATE + call msa_end_syscall +#endif cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret Index: linux-2.6-ustate/arch/i386/kernel/irq.c === --- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 09:13:00.115606274 +1100 +++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 +1100 @@ -55,6 +55,8 @@ #endif irq_enter(); + msa_start_irq(irq); + #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { @@ -101,6 +103,7 @@ #endif __do_IRQ(irq, regs); + msa_finish_irq(irq); irq_exit(); return 1; @@ -221,10 +224,18 @@ seq_printf(p, "%3d: ",i); #ifndef CONFIG_SMP seq_printf(p, "%10u ", kstat_irqs(i)); +#ifdef CONFIG_MICROSTATE + seq_printf(p, "%10llu", msa_irq_time(0, i)); +#endif #else for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) + if (cpu_online(j)) { seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); +#ifdef CONFIG_MICROSTATE + seq_printf(p, "%10llu", msa_irq_time(j, i)); +#endif + } + #endif seq_printf(p, " %14s", irq_desc[i].handler->typename); seq_printf(p, " %s", action->name); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Microstate Accounting for 2.6.11
Microstate Accounting - Timing data on threads at present is pretty crude: when the timer interrupt occurs, a tick is added to either system time or user time for the currently running thread. Thus in an unpacthed kernel one can distinguish three timed states: On-cpu in userspace, on-cpu in system space, and not running. The actual number of states is much larger. A thread can be on a runqueue or the expired queue (i.e., ready to run but not running), sleeping on a semaphore or on a futex, having its time stolen to service an interrupt, etc., etc. This patch adds timers per-state to each struct task_struct, so that time in all these states can be tracked. This patch contains the core code do the timing, and to initialise the timers. Subsequent patches enable the code (by adding Kconfig options) and add hooks to track state changes. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> include/asm-generic/msa.h | 21 ++ include/linux/msa-kernel.h | 99 + include/linux/msa.h| 46 include/linux/sched.h |4 kernel/Makefile|2 kernel/fork.c |2 kernel/msa.c | 472 + 7 files changed, 645 insertions(+), 1 deletion(-) Index: linux-2.6-ustate/kernel/msa.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-ustate/kernel/msa.c 2005-03-11 09:58:20.574030768 +1100 @@ -0,0 +1,472 @@ +/* + * Microstate accounting. + * Try to account for various states much more accurately than + * the normal code does. + * + * Copyright (c) Peter Chubb 2005 + * UNSW and National ICT Australia + * This code is released under the Gnu Public Licence, version 2. + */ + + +#include +#include +#include +#include +#ifdef CONFIG_MICROSTATE +#include +#include +#include +#include + +#include + +/* + * Track time spend in interrupt handlers. + */ +struct msa_irq { + clk_t times; + clk_t last_entered; +}; + +/* + * When the scheduler last swapped active and expired queues + */ +static DEFINE_PER_CPU(clk_t, queueflip_time); + +/* + * Time spent in interrupt handlers + */ +static DEFINE_PER_CPU(struct msa_irq[NR_IRQS+1], msa_irq); + + +/** + * msa_switch: Update microstate timers when switching from one task to another. + * @prev, @next: The prev task is coming off the processor; + *the new task is about to run on the processor. + * + * Update the times in both prev and next. It may be necessary to infer the + * next state for each task. + * + */ +void +msa_switch(struct task_struct *prev, struct task_struct *next) +{ + struct microstates *msprev = &prev->microstates; + struct microstates *msnext = &next->microstates; + clk_t now; + enum thread_state next_state; + int interrupted = msprev->cur_state == INTERRUPTED; + + preempt_disable(); + + MSA_NOW(now); + + if (msprev->flags & QUEUE_FLIPPED) { + __get_cpu_var(queueflip_time) = now; + msprev->flags &= ~QUEUE_FLIPPED; + } + + /* +* If the queues have been flipped, +* update the state as of the last flip time. +*/ + if (msnext->cur_state == ONEXPIREDQUEUE) { + clk_t qfp = per_cpu(queueflip_time, msnext->lastqueued); + msnext->cur_state = ONACTIVEQUEUE; + msnext->timers[ONEXPIREDQUEUE] += qfp - msnext->last_change; + msnext->last_change = qfp; + } + + msprev->timers[msprev->cur_state] += now - msprev->last_change; + msnext->timers[msnext->cur_state] += now - msnext->last_change; + + /* Update states */ + switch (msprev->next_state) { + case MSA_UNKNOWN: + /* +* Infer from actual state +*/ + switch (prev->state) { + case TASK_INTERRUPTIBLE: + next_state = INTERRUPTIBLE_SLEEP; + break; + + case TASK_UNINTERRUPTIBLE: + next_state = UNINTERRUPTIBLE_SLEEP; + break; + + case TASK_STOPPED: + next_state = STOPPED; + break; + + case EXIT_DEAD: + case EXIT_ZOMBIE: + next_state = ZOMBIE; + break; + + case TASK_RUNNING: + next_state = ONACTIVEQUEUE; + break; + + default: + next_state = MSA_UNKNOWN; + break; + + } + break; + + case PAGING_SLEEP: /* + * Sleep states
User mode drivers: part 2: PCI device handling (patch 2/2 for 2.6.11)
User-level drivers: Add system calls for I386 and IA64. Signed-Off-By: Peter Chubb <[EMAIL PROTECTED]> # # arch/i386/kernel/entry.S |4 # arch/ia64/kernel/entry.S |8 # include/asm-i386/unistd.h |6 +- # include/asm-ia64/unistd.h |4 # 4 files changed, 17 insertions(+), 5 deletions(-) # Index: linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S === --- linux-2.6.11-usrdrivers.orig/arch/ia64/kernel/entry.S 2005-03-11 13:59:28.940744950 +1100 +++ linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S2005-03-11 13:59:41.236542676 +1100 @@ -1577,10 +1577,10 @@ data8 sys_add_key data8 sys_request_key data8 sys_keyctl - data8 sys_ni_syscall - data8 sys_ni_syscall// 1275 - data8 sys_ni_syscall - data8 sys_ni_syscall + data8 sys_usr_pci_open + data8 sys_usr_pci_mmap // 1275 + data8 sys_usr_pci_munmap + data8 sys_usr_pci_get_consistent data8 sys_ni_syscall data8 sys_ni_syscall Index: linux-2.6.11-usrdrivers/include/asm-i386/unistd.h === --- linux-2.6.11-usrdrivers.orig/include/asm-i386/unistd.h 2005-03-11 13:59:28.942698059 +1100 +++ linux-2.6.11-usrdrivers/include/asm-i386/unistd.h 2005-03-11 13:59:41.245331667 +1100 @@ -294,8 +294,12 @@ #define __NR_add_key 286 #define __NR_request_key 287 #define __NR_keyctl288 +#define __NR_usr_pci_open 289 +#define __NR_usr_pci_mmap (__NR_usr_pci_open+1) +#define __NR_usr_pci_munmap(__NR_usr_pci_open+2) +#define __NR_usr_pci_get_consistent(__NR_usr_pci_open+3) -#define NR_syscalls 289 +#define NR_syscalls 293 /* * user-visible error numbers are in the range -1 - -128: see Index: linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h === --- linux-2.6.11-usrdrivers.orig/include/asm-ia64/unistd.h 2005-03-11 13:59:28.942698059 +1100 +++ linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h 2005-03-11 13:59:41.247284776 +1100 @@ -263,6 +263,10 @@ #define __NR_add_key 1271 #define __NR_request_key 1272 #define __NR_keyctl1273 +#define __NR_usr_pci_open 1274 +#define __NR_usr_pci_mmap 1275 +#define __NR_usr_pci_unmap 1276 +#define __NR_usr_pci_get_consistent 1277 #ifdef __KERNEL__ Index: linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S === --- linux-2.6.11-usrdrivers.orig/arch/i386/kernel/entry.S 2005-03-11 13:59:28.941721505 +1100 +++ linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S2005-03-11 13:59:41.248261330 +1100 @@ -864,5 +864,9 @@ .long sys_add_key .long sys_request_key .long sys_keyctl + .long sys_usr_pci_open + .long sys_usr_pci_mmap /* 290 */ + .long sys_usr_pci_munmap + .long sys_usr_pci_get_consistent syscall_table_size=(.-sys_call_table) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
USER LEVEL DRIVERS: enable PCI device drivers at user space. This patch adds the capability for suitably privileged user-level processes to enable a PCI device, and set up DMA for it. A subsequent patch hooks up the actual system calls. There are three new system calls: long usr_pci_open(int bus, int slot, int function, __u64 dma_mask); Returns a filedescriptor for the PCI device described by bus,slot,function. It also enables the device, and sets it up as a bus-mastering DMA device, with the specified dma mask. Error codes are: ENOMEM: insufficient kernel memory to fulfil your request ENOENT: the specified device doesn't exist, or is otherwise invisible to Linux. EBUSY: Another driver has claimed the device EIO: The specified dma mask is invalid for this device. ENFILE: too many open files long usr_pci_get_consistent(int fd, size_t size, void **vaddrp, unsigned long *dmaaddrp) Call pci_alloc_consistent() to get size worth of pci consistent memory (currently an error if size != PAGESIZE); map the allocated memory into the user's address space; return the virtual user address in *vaddrp, and the bus address in *dmaaddrp ERRORS: EINVAL: the filedescriptor was not one obtained from usr_pci_open(), or size != PAGESIZE ENOMEM: insufficient appropriate memory or insufficient free virtual address space in the user program. EFAULT: vaddrp or dmaaddrp didn't point to writeable memory. The mapping obtained can be cleaned up with munmap(). long usr_pci_mmap(int fd, struct mapping_info *mp) -- map some memory for DMA to/from the device represented by fd, which was obtained from usr_pci_open(). struct mapping_info contains: void *virtaddr -- the virtual address to dma to int size -- how many bytes to set up struct usr_pci_sglist *sglist -- a pointer to a scatterlist int nents -- how many entries in the scatterlist enum dma_data_direction direction --- which way the dma is going to happen. The scatterlist should be sized at least size/PAGESIZE + 2. usr_pci_mmap() will call pci_map_sg() on the virtual region, then copy the resulting scatterlist into *sglist. The nents field will be updated with the actual number of scatterlist entries filled in. Failure codes are: EINVAL: the fd wasn't obtained from usr_pci_open, or direction wasn't one of DMA_TO_DEVICE, DMA_FROM_DEVICE or DMA_BIDIRECTIONAL, or the size of the scatterlist is insufficient to map the region. EFAULT: mp was a bad pointer, or the region of memory spanned by (virtaddr, virtaddr + size) was not all mapped. ENOMEM: insufficient appropriate memory long usr_pci_munmap(int fd, struct mapping_info *mp) Unmap a dma region mapped by usr_pci_map(). Struct mapping info is the same one used in usr_pci_mmap(). Error codes are: EINVAL: : the fd wasn't obtained from usr_pci_open, or the struct mapping_info was never mapped for this device Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> # # drivers/Makefile |3 # drivers/pci/Kconfig|6 # drivers/usr/Makefile |2 # drivers/usr/sys.c | 952 + # include/linux/usrdrv.h | 63 +++ # 5 files changed, 1026 insertions(+) # Index: linux-2.6.11-usrdrivers/drivers/Makefile === --- linux-2.6.11-usrdrivers.orig/drivers/Makefile 2005-03-11 12:25:29.169139978 +1100 +++ linux-2.6.11-usrdrivers/drivers/Makefile2005-03-11 12:25:41.159270471 +1100 @@ -13,6 +13,9 @@ # was used and do nothing if so obj-$(CONFIG_PNP) += pnp/ +# User level device drivers +obj-$(CONFIG_USRDEV) += usr/ + # char/ comes before serial/ etc so that the VT console is the boot-time # default. obj-y += char/ Index: linux-2.6.11-usrdrivers/drivers/usr/Makefile === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.11-usrdrivers/drivers/usr/Makefile2005-03-11 12:25:41.160247026 +1100 @@ -0,0 +1,2 @@ +obj-y += sys.o +obj-$(CONFIG_USRBLKDEV) += blkdev.o Index: linux-2.6.11-usrdrivers/drivers/usr/sys.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.11-usrdrivers/drivers/usr/sys.c 2005-03-11 14:15:59.897394833 +1100 @@ -0,0 +1,952 @@ +/* + * Expose PCI-DMA interface to user mode. + * + * Copyrig
User mode drivers: part 1, interrupt handling (patch for 2.6.11)
As many of you will be aware, we've been working on infrastructure for user-mode PCI and other drivers. The first step is to be able to handle interrupts from user space. Subsequent patches add infrastructure for setting up DMA for PCI devices. The user-level interrupt code doesn't depend on the other patches, and is probably the most mature of this patchset. This patch adds a new file to /proc/irq// called irq. Suitably privileged processes can open this file. Reading the file returns the number of interrupts (if any) that have occurred since the last read. If the file is opened in blocking mode, reading it blocks until an interrupt occurs. poll(2) and select(2) work as one would expect, to allow interrupts to be one of many events to wait for. (If you didn't like the file, one could have a special system call to return the file descriptor). Interrupts are usually masked; while a thread is in poll(2) or read(2) on the file they are unmasked. All architectures that use CONFIG_GENERIC_HARDIRQ are supported by this patch. A low latency user level interrupt handler would do something like this, on a CONFIG_PREEMPT kernel: int irqfd; int n_ints; struct sched_param sched_param; irqfd = open("/proc/irq/513/irq", O_RDONLY); mlockall() sched_param.sched_priority = sched_get_priority_max(SCHED_FIFO) - 10; sched_setscheduler(0, SCHED_FIFO, &sched_param); while(read(irqfd, n_ints, sizeof n_ints) == sizeof nints) { ... talk to device to handle interrupt } If you don't care about latency, then forget about the mlockall() and setting the priority, and you don't need CONFIG_PREEMPT. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> kernel/irq/proc.c | 163 ++ 1 files changed, 153 insertions(+), 10 deletions(-) Index: linux-2.6.11-usrdrivers/kernel/irq/proc.c === --- linux-2.6.11-usrdrivers.orig/kernel/irq/proc.c 2005-03-11 10:30:57.875619102 +1100 +++ linux-2.6.11-usrdrivers/kernel/irq/proc.c 2005-03-11 10:45:07.146928168 +1100 @@ -9,6 +9,8 @@ #include #include #include +#include +#include "internals.h" static struct proc_dir_entry *root_irq_dir, *irq_dir[NR_IRQS]; @@ -90,27 +92,168 @@ action->dir = proc_mkdir(name, irq_dir[irq]); } +struct irq_proc { + unsigned long irq; + wait_queue_head_t q; + atomic_t count; + char devname[TASK_COMM_LEN]; +}; + +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs *regs) +{ + struct irq_proc *idp = (struct irq_proc *)vidp; + + BUG_ON(idp->irq != irq); + disable_irq_nosync(irq); + atomic_inc(&idp->count); + wake_up(&idp->q); + return IRQ_HANDLED; +} + + +/* + * Signal to userspace an interrupt has occured. + */ +static ssize_t irq_proc_read(struct file *filp, char __user *bufp, size_t len, loff_t *ppos) +{ + struct irq_proc *ip = (struct irq_proc *)filp->private_data; + irq_desc_t *idp = irq_desc + ip->irq; + int pending; + + DEFINE_WAIT(wait); + + if (len < sizeof(int)) + return -EINVAL; + + pending = atomic_read(&ip->count); + if (pending == 0) { + if (idp->status & IRQ_DISABLED) + enable_irq(ip->irq); + if (filp->f_flags & O_NONBLOCK) + return -EWOULDBLOCK; + } + + while (pending == 0) { + prepare_to_wait(&ip->q, &wait, TASK_INTERRUPTIBLE); + pending = atomic_read(&ip->count); + if (pending == 0) + schedule(); + finish_wait(&ip->q, &wait); + if (signal_pending(current)) + return -ERESTARTSYS; + } + + if (copy_to_user(bufp, &pending, sizeof pending)) + return -EFAULT; + + *ppos += sizeof pending; + + atomic_sub(pending, &ip->count); + return sizeof pending; +} + + +static int irq_proc_open(struct inode *inop, struct file *filp) +{ + struct irq_proc *ip; + struct proc_dir_entry *ent = PDE(inop); + int error; + + ip = kmalloc(sizeof *ip, GFP_KERNEL); + if (ip == NULL) + return -ENOMEM; + + memset(ip, 0, sizeof(*ip)); + strcpy(ip->devname, current->comm); + init_waitqueue_head(&ip->q); + atomic_set(&ip->count, 0); + ip->irq = (unsigned long)ent->data; + + error = request_irq(ip->irq, + irq_proc_irq_handler, + SA_INTERRUPT, + ip->devname, + ip); + if (error < 0) { +
Re: binary drivers and development
> "John" == John Richard Moser <[EMAIL PROTECTED]> writes: John> I've done more thought, here's a small list of advantages on John> using binary drivers, specifically considering UDI. You can John> consider a different implementation for binary drivers as well, John> with most of the same advantages. Almost all these advantages are also present for user-mode drivers... and getting drivers out of the kernel, where possible, is a much better approach IMHO than trying to maintain a leaky in-kernel interface. The problem with in-kernel interfaces, even if set in concrete, is that any binary driver can go outside the interface --- there's no encapsulation --- and so break when the kernel changes. Peter C - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reading large /proc entry from kernel module
>>>>> "Kristian" == Kristian Sørensen <[EMAIL PROTECTED]> writes: Kristian> Hi all! I have some trouble reading a 2346 byte /proc entry Kristian> from our Umbrella kernel module. Kristian> static int umb_proc_write(struct file *file, const char *buffer, Kristian> unsigned long count, void *data) { Kristian> char *policy; Kristian> int *lbuf; Kristian> int i; Here's your problem: lbuf should be a char * not an int *. When you look lbuf[0] you'll get the first four characters packed into the int. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fixing address space lock contention in 2.6.11
Sorry, forgot the `signed-off-by'... Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fixing address space lock contention in 2.6.11
Hi, As part of the Gelato scalability focus group, we've been running OSDL's Re-AIM7 benchmark with an I/O intensive load with varying numbers of processors. The current kernel shows severe contention on the tree_lock in the address space structure when running on tmpfs or ext2 on a RAM disk. Lockstat output for a 12-way: SPINLOCKS HOLDWAIT UTIL CONMEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 5.5% 0.4us(3177us) 28us( 20ms)(44.2%) 131821954 94.5% 5.5% 0.00% *TOTAL* 72.3% 13.1% 0.5us( 9.5us) 29us( 20ms)(42.5%) 50542055 86.9% 13.1%0% find_lock_page+0x30 23.8%0% 385us(3177us)0us23235 100%0%0% exit_mmap+0x50 11.5% 0.82% 0.1us( 101us) 17us(5670us)( 1.6%) 50665658 99.2% 0.82%0% dnotify_parent+0x70 Replacing the spinlock with a multi-reader lock fixes this problem, without unduly affecting anything else. Here are the benchmark results (jobs per minute at a 50-client level, average of 5 runs, standard deviation in parens) on an HP Olympia with 3 cells, 12 processors, and dnotify turned off (after this spinlock, the spinlock in dnotify_parent is the worst contended for this workload). tmpfs... ext2... #CPUsspinlock rwlock spinlock rwlock 1 7556(15) 7588(17) +0.42% 3744(20) 3791(16) +1.25% 213743(31) 13791(33) +0.35% 6405(30) 6413(24) +0.12% 423334(111)22881(154) -2%9648(51) 9595(50) -0.55% 833580(240)36163(190) +7.7% 13183(63)13070(68) -0.85% 1228748(170)44064(238)+53% 12681(49) 14504(105)+14% And on a pentium3 single processsor: 14177(4)4169(2) -0.2%3811(4) 3820(3) +0.23% I'm not sure what's happening in the 4-processor case. The important thing to note is that with a spinlock, the benchmark shows worse performance for a 12 than for an 8-way box; with the patch, the 12 way performs better, as expected. We've done some runs with 16-way as well; without the patch below, the 16-way performs worse than the 12-way. Anyway, here's the patch to convert the address space lock to a rwlock, and allow multiple processes to scan an address-space's radix tree at once. = drivers/mtd/devices/block2mtd.c 1.4 vs edited = --- 1.4/drivers/mtd/devices/block2mtd.c 2005-02-02 19:27:37 +11:00 +++ edited/drivers/mtd/devices/block2mtd.c 2005-02-22 14:28:23 +11:00 @@ -59,7 +59,7 @@ void cache_readahead(struct address_spac end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); - spin_lock_irq(&mapping->tree_lock); + read_lock_irq(&mapping->tree_lock); for (i = 0; i < PAGE_READAHEAD; i++) { pagei = index + i; if (pagei > end_index) { @@ -71,16 +71,16 @@ void cache_readahead(struct address_spac break; if (page) continue; - spin_unlock_irq(&mapping->tree_lock); + read_unlock_irq(&mapping->tree_lock); page = page_cache_alloc_cold(mapping); - spin_lock_irq(&mapping->tree_lock); + read_lock_irq(&mapping->tree_lock); if (!page) break; page->index = pagei; list_add(&page->lru, &page_pool); ret++; } - spin_unlock_irq(&mapping->tree_lock); + read_unlock_irq(&mapping->tree_lock); if (ret) read_cache_pages(mapping, &page_pool, filler, NULL); } = fs/buffer.c 1.271 vs edited = --- 1.271/fs/buffer.c 2005-02-18 20:44:07 +11:00 +++ edited/fs/buffer.c 2005-02-22 14:31:41 +11:00 @@ -875,7 +875,7 @@ int __set_page_dirty_buffers(struct page spin_unlock(&mapping->private_lock); if (!TestSetPageDirty(page)) { - spin_lock_irq(&mapping->tree_lock); + read_lock_irq(&mapping->tree_lock); if (page->mapping) {/* Race with truncate? */ if (!mapping->backing_dev_info->memory_backed) inc_page_state(nr_dirty); @@ -883,7 +883,7 @@ int __set_page_dirty_buffers(struct page page_index(page), PAGECACHE_TAG_DIRTY); } - spin_unlock_irq(&mapping->tree_lock); + read_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } = fs/inode.c 1.143 vs edited = --- 1.143/fs/inode.c2005-01-21 16:02:13 +11:00 +++ edited/fs/inode.c 2005-02-22 14:16:33 +11:00 @@ -196,7 +196,7 @@ void inode_init_once(struct inode *inode sema_init(&inode->i_sem, 1); init_rwsem(&inode->i_alloc_sem); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
Re: [PATCH] Linux-2.6.11-rc5: kernel/sys.c setrlimit() RLIMIT_RSS cleanup
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes: Andrew> <[EMAIL PROTECTED]> wrote: >> $ ulimit -m 10 bash: ulimit: max memory size: cannot modify >> limit: Function not implemented Andrew> I don't know about this. The change could cause existing Andrew> applications and scripts to fail. Sure, we'll do that Andrew> sometimes but this doesn't seem important enough. What's more, there have been (and still are) out-of-tree patches to enforce rlimit-RSS in various ways. There just hasn't been consensus yet on the best implementation. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Xterm Hangs - Possible scheduler defect?
>>>>> "Chad" == Chad N Tindel <[EMAIL PROTECTED]> writes: Chad> I would make the following assertion for any kernel: Chad> No single userspace thread of execution running on an SMP system Chad> should be able to hose a box by going CPU-bound, bug in the Chad> software or no bug. Any kernel should be able to handle this Chad> case and shift general work over to other processors. In many Unices, crucial kernel threads run at realtime priority with a static priority higher than is accessible to user code. That being said, however, you've got to be a privileged user to set real time very high priority on a thread, and if you do, you'd better know what you're doing. Any SCHED_FIFO thread should run for a time, then sleep for a time, or it *will* DOS everything else on the processor. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Help enabling PCI interrupts on Dell/SMP and Sun/SMP systems.
>>>>> "Alan" == Alan Kilian <[EMAIL PROTECTED]> writes: Alan> kernel: SSE: Found a DeCypher card. kernel: ACPI: PCI Alan> interrupt :13:03.0[A] -> GSI 36 (level, low) -> IRQ 217 If ACPI has set this device up to use interrupt 217, why are you registering it on IRQ 5? -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Repeatable hang with XFS under 2.6.11-rc4
Running Reaim-7 on a 4G ram disk with 4 processors on Itanium... Every few runs, as the multiprocessing level increases, we see 22 processes hung in sync(), all except one waiting in sync_filesystems() and that one waiting in pagebuf_iowait(). There's lots of free memory, the ram-disk is not full, ... Load average is low; nothing in the logs or on the console. [EMAIL PROTECTED]:/proc# vmstat 2 procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 0 0 0 23027552 1091472 218496 00 1 42107 12 6 1 21 78 0 0 0 0 23027552 1091472 218496 00 0 0 411010 0 0 100 0 0 0 0 23027552 1091472 218496 00 0 0 4109 8 0 0 100 0 0 0 0 23027488 1091472 218496 00 032 411415 0 0 100 0 0 0 0 23027488 1091472 218496 00 0 0 4110 9 0 0 100 0 0 0 0 23027488 1091472 218496 00 0 0 4109 9 0 0 100 0 [EMAIL PROTECTED]:/proc/fs/xfs# df /mnt/ram-disk Filesystem 1K-blocks Used Available Use% Mounted on /dev/ram1 1038336127800910536 13% /mnt/ram-disk -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
JBD problems in linux 2.6.11 rc3
sp=e000165af810 bsp=e000165a9520 [] die_if_kernel+0x40/0x60 sp=e000165af810 bsp=e000165a94f0 [] ia64_bad_break+0x220/0x340 sp=e000165af810 bsp=e000165a94c8 [] ia64_leave_kernel+0x0/0x260 sp=e000165af8a0 bsp=e000165a94c8 [] cascade+0xf0/0x100 sp=e000165afa70 bsp=e000165a9468 [] run_timer_softirq+0x370/0x460 sp=e000165afa70 bsp=e000165a93d8 [] __do_softirq+0x200/0x240 sp=e000165afa90 bsp=e000165a9338 [] do_softirq+0x80/0xe0 sp=e000165afa90 bsp=e000165a92d8 [] irq_exit+0x80/0xa0 sp=e000165afa90 bsp=e000165a92c0 [] ia64_handle_irq+0x110/0x140 sp=e000165afa90 bsp=e000165a9288 [] ia64_leave_kernel+0x0/0x260 sp=e000165afa90 bsp=e000165a9288 [] ia64_spinlock_contention+0x20/0x60 sp=e000165afc60 bsp=e000165a9288 [] _spin_lock+0x40/0x60 sp=e000165afc60 bsp=e000165a9280 [] journal_dirty_data+0x1b0/0x760 sp=e000165afc60 bsp=e000165a9230 [] ext3_journal_dirty_data+0x30/0xa0 sp=e000165afc60 bsp=e000165a9200 [] walk_page_buffers+0x160/0x180 sp=e000165afc60 bsp=e000165a9180 [] ext3_ordered_commit_write+0x70/0x180 sp=e000165afc60 bsp=e000165a9128 [] generic_file_buffered_write+0x520/0xca0 sp=e000165afc60 bsp=e000165a9030 [] __generic_file_aio_write_nolock+0x420/0x6e0 sp=e000165afd10 bsp=e000165a8fb8 [] generic_file_aio_write+0xd0/0x240 sp=e000165afd30 bsp=e000165a8f60 [] ext3_file_write+0x60/0x220 sp=e000165afd40 bsp=e000165a8f28 [] do_sync_write+0x130/0x180 sp=e000165afd40 bsp=e000165a8ee8 [] vfs_write+0x1d0/0x2a0 sp=e000165afe20 bsp=e000165a8ea0 [] sys_write+0x80/0xe0 sp=e000165afe20 bsp=e000165a8e20 [] ia64_ret_from_syscall+0x0/0x20 sp=e000165afe30 bsp=e000165a8e20 <0>Kernel panic - not syncing: Aiee, killing interrupt handler! -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
OOPS when using UDF fs
When I try to write to a UDF fs on a USB-connected Ricoh dvd-burner, (specificly, create a directory) I get: usb-storage: Attempting to get CSW... usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes usb-storage: Status code 0; transferred 13/13 usb-storage: -- transfer complete usb-storage: Bulk status result = 0 usb-storage: Bulk Status S 0x53425355 T 0x80b f R 0 Stat 0x0 usb-storage: -- Result from auto-sense is 0 usb-storage: -- code: 0x70, key: 0x5, ASC: 0x2c, ASCQ: 0x0 usb-storage: (Unknown Key): (unknown ASC/ASCQ) usb-storage: scsi cmd done, result=0x2 usb-storage: *** thread sleeping. end_request: I/O error, dev sr0, sector 1096 Unable to handle kernel paging request at virtual address 2101 printing eip: e1a3562e *pde = Oops: [#1] PREEMPT Modules linked in: loop udf usb_storage sr_mod orinoco_cs orinoco hermes pcmcia ehci_hcd uhci_hcd yenta_socket rsrc_nonstatic pcmcia_core snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd snd_page_alloc i2c_i801 CPU:0 EIP:0060:[pg0+558581294/1067963392] Not tainted VLI EFLAGS: 00010293 (2.6.11-rc3) EIP is at udf_get_filelongad+0x1e/0x50 [udf] eax: 21c1 ebx: 2101 ecx: ce301e30 edx: 000d2d2a esi: 2101 edi: 2101 ebp: ce301e30 esp: ce301d84 ds: 007b es: 007b ss: 0068 Process cp (pid: 4869, threadinfo=ce30 task=cdefda00) Stack: 0112 d03e5c6c e1a2dadc 0001 01301d9c ce301d9c c81b4740 ca82714c db5f8400 c0155f3f 0002 ce301e28 d03e5ca4 ce301e40 ce301e30 e1a2d9a6 ce301e34 ce301e3c ce301e40 0001 Call Trace: [pg0+558549724/1067963392] udf_current_aext+0xcc/0x1b0 [udf] [__wait_on_buffer+47/64] __wait_on_buffer+0x2f/0x40 [pg0+558549414/1067963392] udf_next_aext+0x46/0xb0 [udf] [pg0+558577131/1067963392] udf_discard_prealloc+0xcb/0x2b0 [udf] [d_rehash+116/144] d_rehash+0x74/0x90 [pg0+558533743/1067963392] udf_clear_inode+0x2f/0x40 [udf] [clear_inode+180/208] clear_inode+0xb4/0xd0 [pg0+558529176/1067963392] udf_new_block+0xc8/0xda [udf] [generic_forget_inode+270/320] generic_forget_inode+0x10e/0x140 [iput+83/112] iput+0x53/0x70 [pg0+558533527/1067963392] udf_new_inode+0x337/0x34a [udf] [do_no_page+413/832] do_no_page+0x19d/0x340 [pg0+558557856/1067963392] udf_mkdir+0x0/0x220 [udf] [generic_forget_inode+270/320] generic_forget_inode+0x10e/0x140 [iput+83/112] iput+0x53/0x70 [pg0+558533527/1067963392] udf_new_inode+0x337/0x34a [udf] [do_no_page+413/832] do_no_page+0x19d/0x340 [pg0+558557856/1067963392] udf_mkdir+0x0/0x220 [udf] [pg0+558557925/1067963392] udf_mkdir+0x45/0x220 [udf] [__d_lookup+161/384] __d_lookup+0xa1/0x180 [dput+30/576] dput+0x1e/0x240 [cached_lookup+29/128] cached_lookup+0x1d/0x80 [pg0+558557856/1067963392] udf_mkdir+0x0/0x220 [udf] [vfs_mkdir+95/160] vfs_mkdir+0x5f/0xa0 [sys_mkdir+145/224] sys_mkdir+0x91/0xe0 [syscall_call+7/11] syscall_call+0x7/0xb Code: e0 74 a3 e1 e8 f4 63 6e de eb ea 89 f6 83 ec 08 85 c0 89 5c 24 04 89 c3 74 33 85 c9 74 2f 8b 01 85 c0 78 21 83 c0 10 39 d0 77 1a <8b> 13 85 d2 74 14 8b 54 24 0c 85 d2 74 02 89 01 89 d8 8b 5c 24 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Support for Large Block Devices
>>>>> "Maciej" == Maciej Soltysiak <[EMAIL PROTECTED]> writes: Maciej> Hi, I was wondering... Why is "Support for Large Block Maciej> Devices" still an option? Maciej> Shouldn't it be compiled in always? Or maybe there are some Maciej> cons like incompatibility or something? It's not compiled in always on 32-bit platforms, because 1. Most people don't have more than 2TB in a single block device 2. 64-bit sizes mean increased size of various structures (i.e., less cache-friendly), and slightly slower operations. On 64-bit platforms it *is* always enabled. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]sched: Isochronous class v2 for unprivileged soft rt scheduling
>>>>> "Jack" == Jack O'Quin <[EMAIL PROTECTED]> writes: Jack> Looks like we need to do another study to determine which Jack> filesystem works best for multi-track audio recording and Jack> playback. XFS looks promising, but only if they get the latency Jack> right. Any experience with that? The nice thing about audio/video and XFS is that if you know ahead of time the max size of a file (and you usually do -- because you know ahead of time how long a take is going to be) you can precreadte the file as a contiguous chunk, then just fill it in, for minimum disc latency. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] 'spinlock/rwlock fixes' V3 [1/1]
>>>>> "Chris" == Chris Wedgwood <[EMAIL PROTECTED]> writes: Chris> On Wed, Jan 19, 2005 at 07:01:04PM -0800, Andrew Morton wrote: Chris> It still isn't enough to rid of the rwlock_read_locked and Chris> rwlock_write_locked usage in kernel/spinlock.c as those are Chris> needed for the cpu_relax() calls so we have to decide on Chris> suitable names still... I suggest reversing the sense of the macros, and having read_can_lock() and write_can_lock() Meaning: read_can_lock() --- a read_lock() would have succeeded write_can_lock() --- a write_lock() would have succeeded. IA64 implementation: #define read_can_lock(x) (*(volatile int *)x >= 0) #define write_can_lock(x) (*(volatile int *)x == 0) Then use them as !read_can_lock(x) where you want the old semantics. The compiler ought to be smart enough to optimise the boolean ops. --- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Horrible regression with -CURRENT from "Don't busy-lock-loop in preemptable spinlocks" patch
>>>>> "Ingo" == Ingo Molnar <[EMAIL PROTECTED]> writes: Ingo> * Peter Chubb <[EMAIL PROTECTED]> wrote: >> Here's a patch that adds the missing read_is_locked() and >> write_is_locked() macros for IA64. When combined with Ingo's >> patch, I can boot an SMP kernel with CONFIG_PREEMPT on. >> >> However, I feel these macros are misnamed: read_is_locked() returns >> true if the lock is held for writing; write_is_locked() returns >> true if the lock is held for reading or writing. Ingo> well, 'read_is_locked()' means: "will a read_lock() succeed" Fail, surely? -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Horrible regression with -CURRENT from "Don't busy-lock-loop in preemptable spinlocks" patch
Here's a patch that adds the missing read_is_locked() and write_is_locked() macros for IA64. When combined with Ingo's patch, I can boot an SMP kernel with CONFIG_PREEMPT on. However, I feel these macros are misnamed: read_is_locked() returns true if the lock is held for writing; write_is_locked() returns true if the lock is held for reading or writing. Signed-off-by: Peter Chubb <[EMAIL PROTECTED]> Index: linux-2.6-bklock/include/asm-ia64/spinlock.h === --- linux-2.6-bklock.orig/include/asm-ia64/spinlock.h 2005-01-18 13:46:08.138077857 +1100 +++ linux-2.6-bklock/include/asm-ia64/spinlock.h2005-01-19 08:58:59.303821753 +1100 @@ -126,8 +126,20 @@ #define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0 } #define rwlock_init(x) do { *(x) = RW_LOCK_UNLOCKED; } while(0) + #define rwlock_is_locked(x)(*(volatile int *) (x) != 0) +/* read_is_locked -- - would read_trylock() fail? + * @lock: the rwlock in question. + */ +#define read_is_locked(x) (*(volatile int *) (x) < 0) + +/** + * write_is_locked - would write_trylock() fail? + * @lock: the rwlock in question. + */ +#define write_is_locked(x) (*(volatile int *) (x) != 0) + #define _raw_read_lock(rw) \ do { \ rwlock_t *__read_lock_ptr = (rw); \ -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] VM fixes + RSS limits 2.4.0-test13-pre5
Ingo wrote: > On Wed, Jan 03, 2001 at 09:43:54AM -0200, Rik van Riel wrote: > > On Fri, 28 Dec 2000, Mike Sklar wrote: > > > If I wanted to adjust the rlim_cur value of a running > > > processes, is there any sort of interface for that? > > > > Hmmm, I don't think there is an interface to adjust the > > per-process ulimit settings on-the-fly ... > > > > Does anybody know if there's an interface for this ? > If you don't mean "kill -TERM", no there isn't. It would be evil > to the process anyway. The RSS limits patch I sent to linux-kernel some time ago provided an experimental /proc interface to allow exactly this. The patch against 2.2.16 is still on our FTP server at ftp://ftp-au.aurema.com/private/aurpjc31/linux-2216-rsslimit.diff.bz2 Here's the patch against 2.4.0. The main differences between this and Rik's patch are: -- you choose soft or hard limits at kernel config time with my patch; with Rik's you get both (rlim_cur is `soft' rlim_max is `hard') -- Rik's patch does some extra stuff to the VM code as well as the RSS limits -- Rik's patch doesn't affect swap behaviour (except in so far as processes over their RSS limit will tend to swap, which reduces memory pressure on all other processes); my patch means that processes over RSS limit suffer somewhat -- My patch puts the limit into the struct mm for slightly more cache-friendly behaviour, and to allow later interfacing with per-user resource-management software (it should be possible to write a kernel module to adjust RSS limits to implement per-user limits without affecting per-process RLIMIT values) -- My patch has a /proc interface to allow setting rlimit[RLIMIT_RSS] -- my patch implements the rss accounting fields so that time -v gives reasonable output Index: linux-2.4.0/CREDITS === RCS file: /wrk/CVSROOT/linux-2.4/CREDITS,v retrieving revision 1.1.1.5 diff -u -b -u -r1.1.1.5 CREDITS --- linux-2.4.0/CREDITS 2001/01/04 23:02:54 1.1.1.5 +++ linux-2.4.0/CREDITS 2001/01/08 04:41:41 @@ -491,6 +491,24 @@ S: Stanford, California 94305 S: USA +N: Kingsley Cheung +E: [EMAIL PROTECTED] +D: Page fault calculation +D: /proc//rss support +D: kswapd improvements regarding process RSS limits +S: Aurema Pty Limited +S: PO Box 305, Strawberry Hills NSW 2012, +S: Australia + +N: Peter Chubb +E: [EMAIL PROTECTED] +D: Page fault calculation +D: /proc//rss support +D: kswapd improvements regarding process RSS limits +S: Aurema Pty Limited +S: PO Box 305, Strawberry Hills NSW 2012, +S: Australia + N: Juan Jose Ciarlante W: http://juanjox.kernelnotes.org/ E: [EMAIL PROTECTED] Index: linux-2.4.0/Documentation/Configure.help === RCS file: /wrk/CVSROOT/linux-2.4/Documentation/Configure.help,v retrieving revision 1.1.1.6 diff -u -b -u -r1.1.1.6 Configure.help --- linux-2.4.0/Documentation/Configure.help2001/01/07 21:44:33 1.1.1.6 +++ linux-2.4.0/Documentation/Configure.help2001/01/08 04:41:41 @@ -16955,6 +16955,50 @@ another UltraSPARC-IIi-cEngine boardset with a 7-segment display, you should say N to this option. +RSS Softlimits (EXPERIMENTAL) +CONFIG_RSS_SOFTLIMIT + If you want the setrlimit(RLIMIT_RSS, ...) system call to work, say + Y either here or for RSS Hardlimits. If you don't understand this + you don't need it, so say N. + + RSS Softlimits will make it more likely that pages will be stolen + from processes that have a resident set size (i.e., real memory + footprint) greater than their limit. Processes with a limit set + that is below their actual need may still exceed their limits, and + in this instance kswapd may work excessively hard. + + Because of the way that RSS is measured and controlled, the limit is + approximate only. + + It is harmless to have RSS Softlimits and RSS Hardlimits both set. + +RSS Hardlimits (EXPERIMENTAL) +CONFIG_RSS_HARDLIMIT + If you want the setrlimit(RLIMIT_RSS, ...) system call to work, say + Y either here or for RSS Softlimits. If you don't understand this + you don't need it, so say N. + + RSS Hardlimits changes the behaviour of the kernel at page-fault + time. If a process is over its RSS limit when it wants to get a new + page, then with this configuration option enabled the process's + memory space will be reduced before the page-fault continues. + + Because of the way that RSS is measured and controlled, the actual + memory footprint of a process may exceed the set limit for a short + time. + + It is harmless to have RSS Softlimits and RSS Hardlimits both set. + +Support for /proc/pid/rss (EXPERIMENTAL) +CONFIG_PROC_RSS +