Re: Subtle MM bug
Andrea Arcangeli [EMAIL PROTECTED] writes: On Wed, Jan 10, 2001 at 11:46:03AM +, David Woodhouse wrote: So the VM code spends a fair amount of time scanning lists of pages which it really can't do anything about? Yes. Would it be possible to put such pages on different list, so that the VM Currently to unmap the other pages we have to waste time on those unfreeable pages as well. Once I or other developer finishes with the reverse lookup from page to pte-chain (an implementation from DaveM just exists) we'll be able to put them in a separate lru, but it's certainly not a 2.4.1-pre2 thing. Why do we even want to do reverse page tables? It seems everyone is assuming this is a good thing and except for being a touch more flexible I don't see what this buys us (besides more locked memory). My impression with the MM stuff is that everyone except linux is trying hard to clone BSD instead of thinking through the issues ourselves. And because of the extra overhead this doesn't look to be a win on a heavily loaded box with no swap. And probably only glibc mmaped. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Subtle MM bug
Ralf Baechle [EMAIL PROTECTED] writes: On Thu, Jan 11, 2001 at 12:56:57AM +0100, David Weinehall wrote: The MMU on these systems is a CAM, and the mmu table is thus backwards to convention. (It also means you can notionally map two physical addresses to one virtual but thats undefined in the implementation ;)) Are there any other (not yet supported) platforms with similar (or other unrelated, but hard to support because of the current architecture of the kernel) problems? (No, I have no secret trumps up my sleeve, I'm just curious.) Having a reverse mappings is the least sucky way to handle virtual aliases of certain types of MIPS caches. Hmm. I would think that increasing the logical page size in the kernel would be the trivial way to handle virtual aliases. (i.e.) with a large enough page size you can't actually have a virtual alias. You could also play some games with simply allocating pages only with the proper proper high bits. These games might also be useful on architectures for L2 caches who have significant physical bits than PAGE_SHIFT bits. But how does a reverse mapping help to handle virtual aliases? What are those caches doing? The only model in my head is having a virtually indexed cache where you have more index bits than PAGE_SHIFT bits. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Subtle MM bug
Ralf Baechle [EMAIL PROTECTED] writes: On Fri, Jan 12, 2001 at 09:11:43PM +, Russell King wrote: Eric W. Biederman writes: Hmm. I would think that increasing the logical page size in the kernel would be the trivial way to handle virtual aliases. (i.e.) with a large enough page size you can't actually have a virtual alias. There are types of caches out there that no matter how large the page size, you will always have alias issues. These are ones where the cache lines are indexed independent of virtual address (and therefore can have funny cache line replacement algorithms). And yes, you guessed which processor has it. ;) Odd. Does this affect correctness? I recently spoke with some CPU architecture researcher at some university about cache architectures; I suspect in the near future we'll see more funny cache indexing and replacment algorithems ... But I doubt many of those will run incorrectly if just less efficiently if the OS doesn't help you avoid aliases. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Caches, page coloring, virtual indexed caches, and more
Ralf Baechle [EMAIL PROTECTED] writes: On Fri, Jan 12, 2001 at 09:10:54AM -0700, Eric W. Biederman wrote: Having a reverse mappings is the least sucky way to handle virtual aliases of certain types of MIPS caches. Hmm. I would think that increasing the logical page size in the kernel would be the trivial way to handle virtual aliases. (i.e.) with a large enough page O.k. I stepped back and took a little refresher to make certain I know what is going on. The only problem besides context switches with a virtually mapped cache is that without some care you can have multiple cache blocks for the same data. This is what we must avoid to be correct. I admit that using a reverse mapping is one way we could prevent these duplicate blocks. #define VIRT_INDEX_BITS 18 /* number of bits in the L1 virtually indexed cache */ These are the places I know of in the kernel that create page mappings. fork, anonymous pages, mmap, sysv shared memory, mremap, kmap fork just duplicates something that is already there but in a different mm, so no bad virtual aliases are created. anonymous pages only belong to one process, and have effectively only one mapping so again not a problem. Unless you need kmap. To make that work well we'd have to make the restriction that the swap cache index and the virtual address are identical in their VIRT_INDEX_BITS. That's better than doing it in alloc_pages especially as you never alloc high order swap pages but it worries me a little. This is fairly close to what we do with swap clustering but it's still a pain. shared mmap. This is the important one. Since we have a logical backing store this is easy to handle. We just enforce that the virtual address in a process that we mmap something to must match the logical address to VIRT_INDEX_BITS. The effect is the same as a larger page size with virtually no overhead. sysv shared memory is exactly the same as shared mmap. Except instead of a file offset you have an offset into the sysv segment. mremap. Linux specific but pretty much the same as mmap, but easier. We just enforce that the virtual address of the source of mremap, and the destination of mremap match on VIRT_INDEX_BITS. kmap is a little different. using VIRT_INDEX_BITS is a little subtle but should work. Currently kmap is used only with the page cache so we can take advantage of the page-index field. From page-index we can compute the logical offset of the page and make certain the page mapped with all VIRT_INDEX_BITS the same as a mmap alias. kmap and the swap cache are a little different. Since index holds the location of a page on the swap file we'd have to make that index be the same for VIRT_INDEX_BITS as well. size you can't actually have a virtual alias. That's a possible solution; I'm not clear how bad the overhead would be. Right now a virtual alias is a relativly rare event and we don't want the common case of no virtual alias to make pay a high price. Or? I guess the question is how big would these logical pages need to be? Answer big enough to turn your virtually indexed cache into a physically indexed cache. Which means they would have to be cache size. Increasing PAGE_SIZE a few bits shouldn't be bad but going up two orders of magnitude would likely skewer your swapping, and memory management performance. You'd just have way to few pages. But I have a better suggestion so see above. You could also play some games with simply allocating pages only with the proper proper high bits. These games might also be useful on architectures for L2 caches who have significant physical bits than PAGE_SHIFT bits. An alternative but less efficient solution. I tried to implement it; I ran into problems with running out of larger pages soon as I had to split order 2 pages into 4 order 0 pages to implement this; the fragmentation was _really_ bad. O.k. this is scratched off my list of possible good ideas. Duh. This fails for exactly the same reason as increasing as increasing page size. at 256K cache and 4K PAGE_SIZE you'd need 256/4 = 64 different types of pages, fairly nasty. But how does a reverse mapping help to handle virtual aliases? What are those caches doing? You leave only mappings of one color accessible. All other mappings are made unaccessible in the page table, so accessing will result in a TLB fault. The TLB fault handler then flushes the active mappings, makes them unaccessible by clearing the MIPS hw dirty / accessible bits, then makes the mapping of the new color accessible in the page table. This is already possible right now but doing the necessary reverse mappings can be rather inefficient as is. Hmm. This doesn't sound right. And this sounds like a silly way to use reverse mappings anyway, since you can do it up front in mmap and their kin. Which means you don't have to slow any of the page fault logic up. The only model in my head is having a virtually
Re: Caches, page coloring, virtual indexed caches, and more
Ralf Baechle [EMAIL PROTECTED] writes: On Mon, Jan 15, 2001 at 01:41:06AM -0700, Eric W. Biederman wrote: (Cc list truncated since probably not so many people do care ...) shared mmap. This is the important one. Since we have a logical backing store this is easy to handle. We just enforce that the virtual address in a process that we mmap something to must match the logical address to VIRT_INDEX_BITS. The effect is the same as a larger page size with virtually no overhead. I'm told this is going to break software. Bad since it's otherwise it'd be such a nice silver bullet solution. Heck if we wanted to we could even lie about PAGE_SIZE, and say it was huge. I'd have to have a clear example before I give it up that easily. mmap has never allowed totally arbitrary offsets, and mmap(MAP_FIXED) is highly discouraged so I'd like to see it. And on architectures that don't need this it should compile out with no overhead. sysv shared memory is exactly the same as shared mmap. Except instead of a file offset you have an offset into the sysv segment. No, it's simpler in the MIPS case. The ABI guys were nice and did define that the virtual addresses have to be multiple of 256kbyte which is more than sufficient to kill the problem. If VIRT_INDEX_BITS == 18 and because you can only map starting at the beginning of a sysv shared memory segment this is exactly what my code boils down to. mremap. Linux specific but pretty much the same as mmap, but easier. We just enforce that the virtual address of the source of mremap, and the destination of mremap match on VIRT_INDEX_BITS. Correct and as mremap doesn't take any address argument we won't break any expecations on the properties of the returned address in mmap. kmap is a little different. using VIRT_INDEX_BITS is a little subtle but should work. Currently kmap is used only with the page cache so we can take advantage of the page-index field. From page-index we can compute the logical offset of the page and make certain the page mapped with all VIRT_INDEX_BITS the same as a mmap alias. Yup. It gets somewhat tricker due to the page cache being in in KSEG0, an memory area which is essentially like a 512mb page that is hardwired in the CPU. It's preferable to stick with since it means we never take any TLB faults for pages in the page cache on MIPS. Good. Then we don't need (at least for mips) to worry about this case. I was just thinking through the general case. kmap and the swap cache are a little different. Since index holds the location of a page on the swap file we'd have to make that index be the same for VIRT_INDEX_BITS as well. That's a possible solution; I'm not clear how bad the overhead would be. Right now a virtual alias is a relativly rare event and we don't want the common case of no virtual alias to make pay a high price. Or? I guess the question is how big would these logical pages need to be? Depending of the CPU 8kb to 32kb; the hardware supports page sizes 4kb, 16kb, 64kb ... 16mb. If all you need is 32kb that is better than the 256K number I had in my head. Still as far as an application is concerned the results are the same as my silver bullet above. Answer big enough to turn your virtually indexed cache into a physically indexed cache. Which means they would have to be cache size. For above mentioned CPU versions which have 8kb rsp. 16kb per primary cache we want 32kb as mentioned. Increasing PAGE_SIZE a few bits shouldn't be bad but going up two orders of magnitude would likely skewer your swapping, and memory management performance. You'd just have way to few pages. But I have a better suggestion so see above. O.k. this is scratched off my list of possible good ideas. Duh. This fails for exactly the same reason as increasing as increasing page size. at 256K cache and 4K PAGE_SIZE you'd need 256/4 = 64 different types of pages, fairly nasty. You say it; yet it seems like it could be part of a good solution. Just forcefully allocating a single page by splitting a large page and before that even swapping until we can actually allocate a higher order page is bad. I totally agree. Larger pages don't suck but are unnecessary. At least I haven't been convinced otherwise yet. Hmm. This doesn't sound right. And this sounds like a silly way to use reverse mappings anyway, since you can do it up front in mmap and their kin. Which means you don't have to slow any of the page fault logic up. Then how do you handle something like: fd = open(TESTFILE, O_RDWR | O_CREAT, 664); res = write(fd, one, 4096); mmap(addr, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); mmap(addr + PAGE_SIZE, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); If both mappings are immediately created accessible you'll directly endup with aliases. There is no choice
Re: Caches, page coloring, virtual indexed caches, and more
Anton Blanchard [EMAIL PROTECTED] writes: At least for sparc it's already supported. Right now I don't feel like looking into the 2.4 solution but checkout srmmu_vac_update_mmu_cache in the 2.2 kernel. I killed that hack now that we align all shared mmaps to the same virtual colour :) Nice. Where do you do this? And how do you handle the case of aliases with kseg, the giant kernel mapping. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Caches, page coloring, virtual indexed caches, and more
Anton Blanchard [EMAIL PROTECTED] writes: Hi, Where do you do this? And how do you handle the case of aliases with kseg, the giant kernel mapping. Aliases between user and kernel mappings of a page are handled by flush_page_to_ram the old interface) or {copy,clear}_user_page, flush_dcache_page and update_mmu_cache (new interface). Sparc64 already uses the new interface and there are patches for ppc and ia64 to use it. The new interface allows flushes to be avoided, leading to rather nice performance increases. See Documentation/cachetlb.txt for more info. Thanks, Well they are a step in the right direction But they are still racy, especially on SMP. The bad case is: Process A in kernel space calls flush_dcache_page. Then process B in a separate thread writes to the first word in a cache line. The Process A writes to the last word in the cache line. Assuming the virtual addresses from Process A and Process B are of a different color this gives two non overlapping writes with a well defined meaning, which the kernel gets wrong. In particular the ram will only see one write or the other not both. What it looks like to me is that SHMLBA needs to be extended to normal mmapings, making all pages in user space (page-index PAGE_SHIFT) % SHMLBA virtually aligned. And whenever we access a page in the page cache that is not appropriately virtually aligned in the fixed kernel mapping, we can use the kmap infrastructure to map it to a better kernel location. If we reuse the same optimizations from flush_dcache_page it shouldn't be any worse, and in the pathological cases it will be faster. While removing the races seen above. Any thoughts? Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Werner Almesberger [EMAIL PROTECTED] writes: I agree writing the code to understand the table may be a significant issue. On the other hand I still think it is worth a look, being able to unify option parsing for multiple platforms is not a small gain, nor is getting out from short sighted vendor half standards. Well, you certainly have a point where stupid vendors and BIOS nonsense are concerned. However, if we ignore LinuxBIOS for a moment, each platform already has a set of configuration parameter passing conventions imposed by the firmware. So we need to be able to handle this anyway, and most of the information is highly platform-specific. Well, I never intended for my UBE stuff to handle probeable information, and after thinking about it. It does seem reasonable to say that a table in a firmware rom (or generated by one) is as probeable as a table in a device rom. LinuxBIOS is a special case, because you have your own firmware. But what you're suggesting is basically yet another parameter format, which needs to incorporate and possibly unify much of the information contained in all those platform-specific formats. I'm not sure it's worth the effort. Well I half agree. I think where I'm going to go is to propose some new BIOS tables, as there are some truly broken platforms out there. In particular on alpha you can't even build a variant motherboard where the only change is the connection of interrupts to PCI slots without needing a kernel patch. Agreed with BIOS bugs ;-) Where probing is possible, is it reliable ? I hereby define probing as only being possible where you get reliable results. Thus PCI is included, pnp-ISA probably is, and straight-ISA is not. The one thing I am most against is having to make BIOS calls. It is entirely too easy for a firmware constructor to be in a rush and mess it up, and to crash the whole boot process. It'd take some baroque BIOS parameter table over yet another mandatory boot command line parameter any time ... Definitely. Hmm. I wonder how hard it would be to add -fPIC to the compilation line for that file. But I'm not certain that would do what I want in this instance... Are there actually architectures where the compiler generates position-dependent code even if you're careful ? (I.e. all functions inlined, only auto variables.) O.k. I have looked, (I'm just polishing up my port to alpha). And yes this can happen. It is not so much as the code being position dependent as the code depending on the relative positions on the text and data segments. On the alpha there a pointer to a globals area and even using sufficiently large constants is enough to cause an access to a static variable. As for always having all functions inline and using only auto variables, and no string constants, that is just asking for trouble. When something goes wrong it is way to tempting to insert a bit of debugging code and boom the code is broken. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: limit on number of kmapped pages
David Wragg [EMAIL PROTECTED] writes: While testing some kernel code of mine on a machine with CONFIG_HIGHMEM enabled, I've run into the limit on the number of pages that can be kmapped at once. I was surprised to find it was so low -- only 2MB/4MB of address space for kmap (according to the value of LAST_PKMAP; vmalloc gets a much more generous 128MB!). kmap is for quick transitory mappings. kmap is not for permanent mappings. At least that was my impression. The persistence is intended to just kill error prone cases. My code allocates a large number of pages (4MB-worth would be typical) to act as a buffer; interrupt handlers/BHs copy data into this buffer, then a kernel thread moves filled pages into the page cache and replaces them with newly allocated pages. To avoid overhead on IRQs/BHs, all the pages in the buffer are kmapped. But with CONFIG_HIGHMEM if I try to kmap 512 pages or more at once, the kernel locks up (fork() starts blocking inside kmap(), etc.). This may be a reasonable use, I'm not certain. It wasn't the application kmap was designed to deal with though... There are ways I could work around this (either by using kmap_atomic, or by adding another kernel thread that maintains a window of kmapped pages within the buffer). But I'd prefer not to have to add a lot of code specific to the CONFIG_HIGHMEM case. Why do you need such a large buffer? And why do the pages need to be kmapped? If you are doing dma there is no such requirement... And unless you are running on something faster than a PCI bus I can't imagine why you need a buffer that big. My hunch is that it makes sense to do the kmap, and the i/o in the bottom_half. What is wrong with that? kmap should be quick and fast because it is for transitory mappings. It shouldn't be something whose overhead you are trying to avoid. If kmap is that expensive then kmap needs to be fixed, instead of your code working around a perceived problem. At least that is what it looks like from here. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: limit on number of kmapped pages
David Wragg [EMAIL PROTECTED] writes: I'd still like to know what the basis for the current kmap limit setting is. Mostly at one point kmap_atomic was all there was. It was only the difficulty of implementing copy_from_user with kmap_atomic that convinced people we needed something more. So actually if we can kmap several megabyte at once the kmap limit is quite high. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vma limited swapin readahead
Marcelo Tosatti [EMAIL PROTECTED] writes: On Wed, 31 Jan 2001, Stephen C. Tweedie wrote: Hi, On Wed, Jan 31, 2001 at 01:05:02AM -0200, Marcelo Tosatti wrote: However, the pages which are contiguous on swap are not necessarily contiguous in the virtual memory area where the fault happened. That means the swapin readahead code may read pages which are not related to the process which suffered a page fault. Yes, but reading extra sectors is cheap, and throwing the pages out of memory again if they turn out not to be needed is also cheap. The on-disk swapped pages are likely to have been swapped out at roughly the same time, which is at least a modest indicator of being of the same age and likely to have been in use at the same time in the past. You're throwing away pages from memory to do the readahead. This pages might be more useful than the pages which you're reading from swap. Possibly. However the win (lower latency) from getting swapin readahead is probably even bigger. And you are throwing out the least desirable pages in memory. I'd like to see at lest some basic performance numbers on this, though. I'm not sure if limiting the readahead the way my patch does is a better choice, too. A better choice is probably to make certain the read and write paths are in sync so that you can know the readahead is going to do you some good. This is a little tricky though. Unless you can see a big performance win somewhere please don't submit this to Linus for inclusion. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vma limited swapin readahead
David Gould [EMAIL PROTECTED] writes: Hmmm, arguably reading pages we do not want is a mistake. I should think that if a big performance win is required to justify a design choice, it should be especially required to show such a win for doing something that on its face is wrong. The case for files and has already been justified. The performance gain of reading pages that are contiguous on disk has been justified. The only problem thing that has not been shown is that swap pages that are used together are located near each other in swap. As for design choices simplicity, maintainability and comprehensiblility, tend to be more important than absolute performance. This lets bugs be fixed, and the big changes that tend to be the biggest wins happen. I am skeptical of the argument that we can win by replacing "the least desirable" pages with pages were even less desireable and that we have no recent indication of any need for. It seems possible under heavy swap to discard quite a portion of the useful pages in favor of junk that just happenned to have a lucky disk address. I won't argue that. My gut just says we should work to improve the disk addresses, so it isn't luck. ;) And only if we fail in that hack up the efficient simple policy, that we have for reading disk data in. Of course since I'm not actually writing the code at the moment this is all hot air :) Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Question] Explanation of zero-copy networking
Jamie Lokier [EMAIL PROTECTED] writes: Richard B. Johnson wrote: However, PCI to memory copying runs at about 300 megabytes per second on modern PCs and memory to memory copying runs at over 1,000 megabytes per second. In the future, these speeds will increase. That would be big expensive modern PCs then. Our clusters of 700MHz boxes are strictly limited to 132 megabytes per second over PCI... 300 Megabytes per second is definitely an odd number for a PCI bus. But 132 Megabytes per second is actually high, the continuous burst speeds are: 32bit 33Mhz: 33*1000*1000*32/(1024*1024*8) = 125.8 Megabytes/second 64bit 33Mhz: 33*1000*1000*64/(1024*1024*8) = 251.7 Megabytes/second 32bit 66Mhz: 66*1000*1000*32/(1024*1024*8) = 251.7 Megabytes/second 64bit 66Mhz: 66*1000*1000*64/(1024*1024*8) = 503.4 Megabytes/second The possibility of getting a continuous bursts is actually low, if nothing else you have an interrupt acknowledgement 100 times per second. But if you are pushing the bus it should deliver close to it's burst potential. But the ISA traffic doing subtractive decode can be nasty because you get 4 PCI cycles before you even get acknowledgement from the PCI/ISA bridge that you there is something to transfer to. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible PCI subsystem bug in 2.4
Maciej W. Rozycki [EMAIL PROTECTED] writes: On 4 May 2001, Eric W. Biederman wrote: The example that sticks out in my head is we rely on the MP table to tell us if the local apic is in pic_mode or in virtual wire mode. When all we really have to do is ask it. You can't. IMCR is write-only and may involve chipset-specific side-effects. Then even if IMCR exists, a system's firmware might have chosen the virtual wire mode for whatever reason (e.g. broken hardware). Admittedly you can't detect directly detect IMCR state. But triggering an interrupt on the bootstrap processor local apic, and failing to receive it should be proof the IMCR is at work. Alternatively if I'm wrong about the wiring disabling all interrupts at the apic level and receiving one is a second proof that IMCR is at work. Further I don't think a processor with an onboard apic, works with an IMCR register. What I was thinking of earlier is that you can detect an apic or ioapic in virtual wire mode, which the current code and the intel MP spec treats as the opposite possibility. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH: Enable IP PNP for 2.4.4-ac8
H . J . Lu [EMAIL PROTECTED] writes: On Fri, May 11, 2001 at 04:28:05PM -0700, David S. Miller wrote: H . J . Lu writes: 2.4.4-ac8 disables IP auto config by default even if CONFIG_IP_PNP is defined. Here is a patch. It doesn't make any sense to enable this unless parameters are given to the kernel via the kernel command line or from firmware settings. From Configure.help: IP: kernel level autoconfiguration CONFIG_IP_PNP This enables automatic configuration of IP addresses of devices and of the routing table during kernel boot, based on either information supplied on the kernel command line or by BOOTP or RARP protocols. You need to say Y only for diskless machines requiring network access to boot (in which case you want to say Y to Root file system on NFS as well), because all other machines configure the network in their startup scripts. It works fine for 2.4.4. However, in 2.4.4-ac8, even if I select CONFIG_IP_PNP, I have to pass ip= to kernel, in addition to nfsroot=x.x.x.x:/foo/bar. With 2.4.4, I can just pass nfsroot=x.x.x.x:/foo/bar to kernel. O.k. Configure.help needs to be updated. ip=on or ip=bootp or ip=dhcp work fine. I wonder if I forgot to forward port the docs? This same situation exists for 2.2.18 2.2.19 as well. The only way to get long term stability out of this is to write a user space client, you can put in a ramdisk. One of these days... Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LANANA: To Pending Device Number Registrants
Daniel Phillips [EMAIL PROTECTED] writes: On Tuesday 15 May 2001 23:20, Nicolas Pitre wrote: Personally, I'd really like to see /dev/ttyS0 be the first detected serial port on a system, /dev/ttyS1 the second, etc. There are well-defined rules for the first four on PC's. The ttySx better match the labels the OEM put on the box. Actually it would be better to have the OEM put a label in the firmware, and then have a way to query the device for it's label. The legacy rules are nice but serial ports are done with superio chips now. And superio chips are almost all ISA PNP chips without device enumeration, and isolation. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH: Enable IP PNP for 2.4.4-ac8
H . J . Lu [EMAIL PROTECTED] writes: On Sun, May 13, 2001 at 01:24:18PM -0600, Eric W. Biederman wrote: H . J . Lu [EMAIL PROTECTED] writes: It doesn't make any senses. When I specify CONFIG_IP_PNP and BOOTP/DHCP, I want a kernel with IP config using BOOTP/DHCP. I would expect IP config is turned for BOOTP/DHCP by default. You can turn it off by passing ip=off to kernel. Did I miss something? Since you have to set the command line anyway ip=dhcp is no extra burden and it lets you use the same kernel to boot of the harddrive etc. Why do I have to set ip=dhcp? If I have selected CONFIG_IP_PNP and DHCP in my kernel configuration, should it be on by default? I agree it isn't intuitive, and if nfsroot=xxx is specified it should probably turn on if there is missing information. But if you have to select the command line anyway Mostly I like the situation where I can compile it in and turn it on when I need it, instead of having to do thing differently if it is compiled in or not. ip=on is all it really takes. This same situation exists for 2.2.18 2.2.19 as well. The only way to get long term stability out of this is to write a user space client, you can put in a ramdisk. One of these days... It doesn't work with diskless machines which don't support ramdisk during boot. I don't believe that is a real world situation. I boot diskless all of time and supporting a ramdisk is trivial. You just a have a program that slaps a kernel a ramdisk, and some command line arguments into a single image, along with a touch of adapter code to set the kernel parameters correctly and then boot that. Let me guess. Your diskless machines are mostly x86. Mostly, but not exclusively. Have you tried ramdisk on diskless alpha, arm, m68k, mips, ppc, sh, sparc, booting over network? First the booting situation on linux with respect to multiple platform sucks. We pass parameters in weird ways on every platform. The command line is the only thing that stays mostly the same. I'm looking at what it takes to clean that up, so we can have multiplatform bootloaders. I have implemented what it takes to attach a ramdisk, and if you can boot an arbitrary kernel it isn't hard to have a program that attaches a ramdisk. Now although I believe this is the right direction to go, you will notice I ported the dhcp IP auto configuration from 2.2.19 to to 2.4.x Buying a little more time to get this working. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LANANA: To Pending Device Number Registrants
Jonathan Lundell [EMAIL PROTECTED] writes: At 10:42 AM +0200 2001-05-19, Kai Henningsen wrote: Jeff Garzik's ethtool extension at least tells me the PCI bus/dev/fcn, though, and from that I can write a userland mapping function to the physical location. I don't see how PCI bus/dev/fcn lets you do that. I know from system documentation, or can figure out once and for all by experimentation, the correspondence between PCI bus/dev/fcn and physical locations. Jeff's extension gives me the mapping between eth# and PCI bus/dev/fcn, which is not otherwise available (outside the kernel). Just a second let me reenumerate your pci busses, and change all of the bus numbers. Not that this is a bad thought. It is just you need to know the tree of PCI busses/bridges up to the root on the machine in question. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
Ben LaHaise [EMAIL PROTECTED] writes: Hey folks, The work-in-progress patch for-demonstration-purposes-only below consists of 3 major components, and is meant to start discussion about the future direction of device naming and its interaction block layer. The main motivations here are the wasting of minor numbers for partitions, and the duplication of code between user and kernel space in areas such as partition detection, uuid location, lvm setup, mount by label, journal replay, and so on... 1. Generic lookup method and argument parsiing (fs/lookupargs.c) This code implements a lookup function which is for demonstration purposes used in fs/block_dev.c. The general idea is to pass additional parameters to device drivers on open via a comma seperated list of options following the device's name. Sample uses: /dev/sda/raw- open sda in raw mode. /dev/sda/limit=102400 - open sda with a limit of 100K /dev/sda/offset=1024,limit=2048 - open a device that gives a view of sda at an offset of 1KB to 2KB GAhh!! Ben please think /proc/sys. One value per ``file''. 3. Userspace partition code proposal Given the above two bits, here's a brief explaination of a proposal to move management of the partitioning scheme into userspace, along with portions of raid startup, lvm, uuid and mount by label code needed for mounting the root filesystem. Consider that the device node currently known as /dev/hda5 can also be viewed as /dev/hda at offset 512000 with a limit of 10GB. With the extensions in fs/block_dev.c, you could replace /dev/hda5 with /dev/hda/offset=512000,limit=1024000. Now, by putting the partition parsing code into a libpart and binding mount to a libpart, the root filesystem mounting code can be run out of an initrd image. The use of mount gives us the ability to mount filesystems by UUID, by label or other exotic schemes without having to add any additional code to the kernel. But you need to use uclibc or a similar library to get the code size down small enough, so you don't quadruple the size of your boot image. As for wasting minors. If you are going to rework partitions they should have dynamic device numbers. That are assigned when the partition is discovered by the system. I admit a hot-plug partition sounds incongruous but it should be fairly simple to implement. If your real root is on a ``hot-plug'' device then it does look like you need an initrd to help select your root partition. Hmm. the code is simple enough code in the kernel shouldn't be bad. And the interface can be simple as well. Have: /dev/sda/partitions/1 /dev/sda/partitions/2 /dev/sda/partitions/3 /dev/sda/partitions/4 /dev/sda/partitions/5 and also /dev/sda/partitions/1/uuid /dev/sda/partitions/1/label /dev/sda/partitions/1/offset /dev/sda/partitions/1/limit To expose what the kernel found it's initial scan of the partitions. For creating partitions you might want to do: cat 1024 2048 /dev/sda/newpartition Though if you could do it with create that would be nicer, and writes to offset and limit, that would be a little nicer. Al would it work to have the lookup method for /dev/sda automatically mount an instance of scsifs on /dev/hda (from an internal mount), and then have dput drop that mount. I skimmed the code and it looks possible. Soft mounting a fs isn't strictly necessary but for the case above but it looks simplest to keep the list of partitions permanently in the dcache. We would also need to modify permission to take a vfsmnt argument so your permissions to a device file could vary depending on which device file you start with. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH: Enable IP PNP for 2.4.4-ac8
H . J . Lu [EMAIL PROTECTED] writes: It doesn't make any senses. When I specify CONFIG_IP_PNP and BOOTP/DHCP, I want a kernel with IP config using BOOTP/DHCP. I would expect IP config is turned for BOOTP/DHCP by default. You can turn it off by passing ip=off to kernel. Did I miss something? Since you have to set the command line anyway ip=dhcp is no extra burden and it lets you use the same kernel to boot of the harddrive etc. This same situation exists for 2.2.18 2.2.19 as well. The only way to get long term stability out of this is to write a user space client, you can put in a ramdisk. One of these days... It doesn't work with diskless machines which don't support ramdisk during boot. I don't believe that is a real world situation. I boot diskless all of time and supporting a ramdisk is trivial. You just a have a program that slaps a kernel a ramdisk, and some command line arguments into a single image, along with a touch of adapter code to set the kernel parameters correctly and then boot that. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH: Enable IP PNP for 2.4.4-ac8
David Woodhouse [EMAIL PROTECTED] writes: [EMAIL PROTECTED] said: There wasn't even DHCP support before so yes you did. As you can't get the nfs mount point from bootp. Wasn't there a default? The Indy behind me seems to try to mount /tftpboot/172.16.18.195, so I put a filesystem there just to make it happy. It's a 2.4.3 kernel. Duh. I forgot about the default path. Well I think in the CONFIG_BLK_DEV=n case it might wind up being a ramfs or tmpfs image. Something like a simplified version of tar. Well, if it stops working and stays broken, I suppose I'll just have to hack up a built-in command line option. ISTR ARM already has such an option. I'd rather it didn't break, though. The clean way to handle it, and I'll take a look it to have root=/dev/nfs (and the rdev equivalent) to set ip=on if it isn't already. The current 2.4.4 behavior of root=/dev/hda3 doing ip autoconfig when the code is compiled into the kernel is just bad. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Stephen C. Tweedie [EMAIL PROTECTED] writes: Hi, On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote: On Wed, 23 May 2001, Stephen C. Tweedie wrote: that the filesystems already do. And you can do it a lot _better_ than the current buffer-cache-based approach. Done right, you can actually do all IO in page-sized chunks, BUT fall down on sector-sized things for the cases where you want to. Right, but you still lose the caching in that case. The write works, but the cache becomes nothing more than a buffer. No. It is still cached. You find the buffer with page-buffer, and when all of them are up-to-date (whether from read-in or from having written to them all), you just mark the whole page up-to-date. It works, but *only* if the application writes a whole page worth of data. From the previous emails I had the understanding that this application is writing small data items in random 512-byte blocks. It is not writing the rest of the page. The page never becomes uptodate. That in itself isn't a problem, but readpage() can't tell the underlying layers that only a part of the page is wanted, so there's no way to tell readpage that the page is in fact partially uptodate. And just telling the application to write the rest of the page too isn't going to cut it, because the rest of the page may contain other objects which aren't in cache so we can't write them without first reading the page. The only alternative is to change the on-disk layout, forcing a minimum PAGESIZE on the IO chunks. This _works_. Try it on ext2 or NFS today. Not for this workload. Now, maybe it's not an interesting workload. But shifting the uptodate granularity from buffer to page sized _does_ impact the effectiveness of the cache for such an application. So in short: the page cache supports _today_ all the optimizations. For write, perhaps; but for subsequent read, generic_read_page doesn't see any of the data in the page unless the whole page has been written. generic_read_page??? block_read_full_page seems to handle this correctly. At least with respect to keeping the data around, and not doing the I/O on data we already have. But it still reads in the unpopulated parts of the page even if it is unnecessary. The case we don't get quite right are partial reads that hit cached data, on a page that doesn't have PG_Uptodate set. We don't actually need to do the I/O on the surrounding page to satisfy the read request. But we do because generic_file_read doesn't even think about that case. For the small random read case we could use a mapping-a_ops-readpartialpage function that sees if a request can be satisfied entirely from cached data. But this is just to allow generic_file_read to handle this, case. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Linus Torvalds [EMAIL PROTECTED] writes: On 25 May 2001, Eric W. Biederman wrote: For the small random read case we could use a mapping-a_ops-readpartialpage No, if so I'd prefer to just change readpage() to take the same kinds of arguments commit_page() does, namely the beginning and end of the read area. No. I obviously picked a bad name, and a bad place to start. int data_uptodate(struct page *page, unsigned offset, unsigned len) This is really an extension to PG_uptodate, not readpage. It should never ever do any I/O. It should just implement a check to see if we have all of the data wanted already in the page in the page cache. As simply a buffer checking entity it will likely share virtualy 0 code with read_page. Filesystems could choose to ignore the arguments completely, and just act the way they already do - filling in the whole page. OR a filesystem might know that the page is partially up-to-date (because of a partial write), and just return an immediate this area is already uptodate return code or something. Or it could even fill in the page partially, and just unlock it (but not mark it up-to-date: the reader then has to wait for the page and then look at PG_error to decide whether the partial read succeeded or not). First mm/filemap.c has generic cache management, so it should make the decision. The logic is does this page have the data in cache? If so just return it. Otherwise read all that you can at once. So we either want a virtual function that can make the decision on a per filesystem bases if we have the data we need in the page cache. Or we need to convert the buffer_head into a more generic entity so everyone can use it. I don't think it really matters, I have to say. It would be very easy to implement (all the buffer-based filesystems already use the common fs/buffer.c readpage, so it would really need changes in just one place, along with some expanded prototypes with ignored arguments in some other places). But it _could_ be a performance helper for some strange loads (write a partial page and immediately read it back - what a stupid program), and more importantly Al Viro felt earlier that a partial read approach might help his metadata-in-page-cache stuff because metadata tends to sometimes be scattered wildly across the disk. Maybe I think despite the similarities (partial pages) Al and I are looking at two entirely different problems. So then we'd have int (*readpage)(struct file *, struct page *, unsigned offset, unsigned len); and the semantics would be: - the function needs to start IO for _at_least_ the page area [offset, offset+len[ - return error code for _immediate_ errors (ie not asynchronous) - if there was an asynchronous read error, we set PG_error - if the page is fully populated, we set PG_uptodate - if the page was not fully populated, but the partial read succeeded, the filesystem needs to have some way of keeping track of the partial success (page-buffers is obviously the way for a block-based one), and must _not_ set PG_uptodate. - after the asynchronous operation (whether complete, partial or unsuccessful), the page is unlocked to tell the reader that it is done. Now, this would be coupled with: - generic_file_read() does the read-ahead decisions, and may decide that we really only need a partial page. But NOTE! The above is meant to potentially avoid unnecessary IO and thus speed up the read-in. HOWEVER, it _will_ slow down the case where we first would read a small part of the page and then soon afterwards read in the rest of the page. I suspect that is the common case by far, and that the current whole-page approach is the faster one in 99% of all cases. So I'm not at all convinced that the above is actually worth it. I don't want partial I/O at all. And I always want to see reads reading in all of the data for a page. I just want an interface where we can say hey we don't actually have to do any I/O for this read request, give them back their data. If somebody can show that the above is worth it and worth implementing (ie the Al Viro kind of I have a real-life schenario where I'd like to use it), and implements it (should be a fairly trivial exercise), then I'll happily accept new semantics like this. But I do _not_ want to see another new function (partialread()), and I do _not_ want to see synchronous interfaces (Al's first suggestion). My naming mistake I don't want to see this logic combined with readpage. That is an entirely different case. I can't see how adding a slow case to PageUptodate to check for a partially uptodate page could hurt our performance. And I can imagine how it could help. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http
Re: getting include-files from arch/arch/subdir
"Heusden, Folkert van" [EMAIL PROTECTED] writes: Hi, ADC why not ADC #include arch/i386/etc.h ADC Amit Since that is not cross-platform. I like a solution which does the #include transparantly for alpha/i386/etc. Umm. Then the include file should probably rest under the include hierarchy. Say: #includeasm/i386/etc.h That make it clear the code is exported to someone else... Going down into the arch tree looks ugly. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
Rik van Riel [EMAIL PROTECTED] writes: On Wed, 21 Mar 2001, Patrick O'Rourke wrote: Since the system will panic if the init process is chosen by the OOM killer, the following patch prevents select_bad_process() from picking init. One question ... has the OOM killer ever selected init on anybody's system ? I think that the scoring algorithm should make sure that we never pick init, unless the system is screwed so badly that init is broken or the only process left ;) Is there ever a case where killing init is the right thing to do? My impression is that if init is selected the whole machine dies. If you can kill init and still have a machine that mostly works, then I guess it makes some sense not to kill it. Guaranteeing not to select init can buy you piece of mind because init if properly setup can put the machine back together again, while not special casing init means something weird might happen and init would be selected. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
Guest section DW [EMAIL PROTECTED] writes: On Wed, Mar 21, 2001 at 08:48:54PM -0300, Rik van Riel wrote: On Wed, 21 Mar 2001, Patrick O'Rourke wrote: Since the system will panic if the init process is chosen by the OOM killer, the following patch prevents select_bad_process() from picking init. There is a dozen other processes that must not be killed. Init is just a random example. Not killing init provides enough for recovery if you truly hit an out of memory situation. With 2.4.x at least it is a box misconfiguration that causes it. The 2.2.x VM doesn't always try to swap, and free things up hard enough, before reporting out of memory. But even the 2.2.x problems are rare. One question ... has the OOM killer ever selected init on anybody's system ? Last week I installed SuSE 7.1 somewhere. During the install: "VM: killing process rpm", leaving the installer rather confused. (An empty machine, 256MB, 144MB swap, I think 2.2.18.) swap RAM. ouch! This is a misconfiguration on a machine that actually starts swapping, and where out of memory problems are a reality. The fact an installer would trigger swapping on a 256MB machine is a second problem. Last month I had a computer algebra process running for a week. Killed. But this computation was the only task this machine had. Its sole reason of existence. Too bad - zero information out of a week's computation. (I think 2.4.0.) It looks like you didn't have enough resources on that machine period. I pretty much trust 2.4.x in this department. Did that machine also have it's swap misconfigured? Clearly, Linux cannot be reliable if any process can be killed at any moment. I am not happy at all with my recent experiences. Hmm. It should definitely not be at any moment. It should only be when resources are exhausted. So putting enough swap on a machine should be enough, to stop this from ever happening. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: Linux rebooting directly into linux.
I have recently developed a patch that allows linux to directly boot into another linux kernel. With the code freeze it appears inappropriate to submit it at this time. Linus in principal do you have any trouble with this kind of functionality? The immediate applications of this code, are: - Clusters can network can network boot over arbitrary network interfaces, and the network driver only needs to be written and maintained in one place. - Multiplatform boot loaders can be written. - The Linux kernel can be included in a boot ROM and you can still boot other linux kernels. - Kernel developers can have a fast interface for booting into a development kernel. The interface is designed to be simple and inflexible yet very powerful. To that end the code just takes an elf binary, and a command line. The started image also takes an environment generated by the kernel of all of the unprobeable hardware details. ELF was picked for it's multiplatform support and the sheer simplicity of it's program header. Plus you can use standard tools to generate elf images fairly easily. The environment passed to a loaded image is designed to expand and handle new data types without breaking old decoders. They just break because the don't support the new hardware :) Linus the path I envision is that this code gets integrated early in 2.5. This includes cleaning up the boot paths so all our C code has to deal with is this new format. Then backporting the functionality to 2.4 and possibly 2.2. The kernel patches can be found in: ftp://ftp.linuxnetworx.com/pub/kexec-patches-1.0.tar.gz (This is a patchset with 4 patches 1 Ingo Molanar's improved apic support 2 My enhancements upon it so we restore the apics to their boot state when we shut down. 3 My 2 line patch to make certain that in smp_send_stop the last cpu running is the boot cpu. (Required by the MP spec...) 4 The code to support execing a new kernel. ) The code to generate a image bootable by this new syscall is in: ftp://ftp.linuxnetworx.com/pub/mkelfImage-1.0.tar.gz (This is a perl script that takes a kernel and possibly a ramdisk and a command line and generates an elfimage suitable to be booted in this new infrastructure) Eric p.s. Linus the code is not included inline because I don't expect it to be included just yet. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Better testing of hardware (was: Defective Read Hat)
"Stephen Gutknecht (linux-kernel)" [EMAIL PROTECTED] writes: A Linux Kernel compile test does a really good job of testing the hard disk, RAM, and CPU... as it executes all types of instructions and the final output depends on all prior steps completing correctly. On a really fast system ( 900Mhz) might make sense to run it twice, once to "warm up" the CPU and other components. Most "benchmarks" just test speed, not the actual stability or data integrity (they write results to a device but don't check for data corruption, or they test only one device at a time, not all at once). Also note that a Linux Kernel compile stresses memory because of the very pointer loaded data structures of gcc. This means that memory corruption is most likely to flip a bit in a pointer, and cause a bad pointer. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Ext2 Performances
Aaron Sethman [EMAIL PROTECTED] writes: You might want to take a look at using reiserfs on the 130GB partition, as its is journalled and doesn't need to be fsck'ed. No. All journaling filesystems need to be fsck'ed. A correctly operating one simply doesn't need to be fsck'ed because of unexpected loss of operating system.Which brings greatly reduce the probability. If an error is detected in the filesystem fsck is still what you have to do to correct it. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: LKCD from SGI
Peter Samuelson [EMAIL PROTECTED] writes: [Matt D. Robinson] Any way we can standardize 'make install' in the kernel? It's disturbing to have different install mechanisms per platform ... I can make the changes for a few platforms. 2.5 material, already on the todo list. What is the thought on this. There is an issue with different boot loaders needing rather dramatically different formats... Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Booting AMD Elan520 without BIOS
Ronald G Minnich [EMAIL PROTECTED] writes: On Fri, 24 Nov 2000, I+D wrote: I'm trying to boot an AMD Elan520 board without bios with kernel 2.4.0-test10 configured for i486 and PCI direct access. This kernel boots correctly from HD using the bios provided with the evaluation board but kernel 2.4.0-test8 and previous hang after "Ok booting the kernel". well, first I want your code for linuxbios :-) The last message I see is "Calibrating delay loop" (I see this thaks to the Jtag debugger for Elan520 because I haven't configured the VGA board yet). you don't have clock interrupts on. If you are able to single step you'll probably see it in the loop spinning on jiffies. This is one of our regular problems with a new mainboard. This can also easily be a misconfiguration of the local apic. I might need to be put into virtual wire mode. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: crashing kernels
Alan Cox [EMAIL PROTECTED] writes: benn compiled into the kernel, and not as a module) always gave the errors: eth0: Transmit timed out: status 0050 0090 at 134704418/134704432 eth0: Trying to restart the transmitter... Known problem. This one might be fixed in current 2.2.18pre. SOme people see it some dont I have another data point on this problem. I have seen it most with 2.4.0-test9. But I'll look at 2.2.18pre. I can trigger this bug fairly reliably by warm booting, several times in a row. With my linux warm booting directly into linux code triggers this one fairly reliably :) Also putting another nick in seems to help trigger it as well. The 2.4.0-testxxx watchdog seems eventually to handle this case but it takes 1/2 hour or so to actually kick in and reset the card. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: SCSI problem on aic7xxx on L440GX+ using LinuxBIOS
Ronald G Minnich [EMAIL PROTECTED] writes: Eric, here is the ksymoops (end of message) from that earlier failure. I'm just wondering if anyone out there has seen anything like this. Also, if anyone sees anything odd about the scsi configuration that would help too. Thanks in advance ... Ron. vger.rutgers.edu died a couple of months ago. vger.kernel.org is the new machine, the linux kernel mailing list is on. I'm forwarding this there. I don't know how much help we can get on a bug report against 2.4.0-test6 though. Eric ron On 30 Nov 2000, Eric W. Biederman wrote: Ronald G Minnich [EMAIL PROTECTED] writes: This is 2.4.0-test6, on an L440GX, running linuxbios. The node comes up and appears to run fine: (scsi0) Adaptec AIC-7896/7 Ultra2 SCSI host adapter found at PCI 0/12/0 (scsi0) Wide Channel A, SCSI ID=7, 32/255 SCBs (scsi0) Downloading sequencer code... 392 instructions downloaded (scsi1) Adaptec AIC-7896/7 Ultra2 SCSI host adapter found at PCI 0/12/1 (scsi1) Wide Channel B, SCSI ID=7, 32/255 SCBs (scsi1) Downloading sequencer code... 392 instructions downloaded scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0 Adaptec AIC-7896/7 Ultra2 SCSI host adapter scsi1 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0 Adaptec AIC-7896/7 Ultra2 SCSI host adapter scsi : 2 hosts. (scsi0:0:1:0) Synchronous at 40.0 Mbyte/sec, offset 31. Vendor: QUANTUM Model: ATLAS 10K 9SCARev: UCH0 Type: Direct-Access ANSI SCSI revision: 03 Detected scsi disk sda at scsi0, channel 0, id 1, lun 0 Vendor: VA Linux Model: Fullon 2x2Rev: 1.01 Type: Processor ANSI SCSI revision: 02 scsi : detected 1 SCSI disk total. SCSI device sda: hdwr sector= 512 bytes. Sectors= 17938986 [8759 MB] [8.8 GB] Partition check: sda: sda1 sda2 sda3 . . . Welcome to Red Hat Linux Press 'I' to enter interactive startup. Mounting proc filesystem [ OK ] Configuring kernel parameters [ OK ] hwclock: Can't open /dev/tty1, errno=19: No such device. Setting clock (utc): Thu Nov 30 23:07:43 /etc/localtime 2000 [ OK ] Loading default keymap/etc/rc.d/rc.sysinit: /dev/tty0: No such device [FAILED] Activating swap partitions [ OK ] Setting hostname rpc4 [ OK ] Checking root filesystem /dev/sda1 contains a file system with errors, check forced. /dev/sda1: Inode 84024 has illegal block(s). [/sbin/fsck.ext2 -- /] fsck.ext2 -a /dev/sda1 /dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options) --- But in the middle of an fsck Anyway, I'm wondering if anyone has seen anything like this at all on the aic7xxx driver. I also had a working L440GX that used IDE for /, and when i insmod aic7xxx.o I do see this same error. Any suggestions on this problem would be appreciated. Hmm. This looks like a kernel bug, probably triggered by lack of bios support. Could you run the oops through ksymoops so we have a clue what is wrong? If we knew where the kernel was crashing perhaps we could fix it. Sorry, here's the ksymoops Oops: CPU:0 EIP:0010:[c012b234] Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010206 eax: c141 ebx: 0002 ecx: 0012c8f2 edx: 08458b00 esi: 0008 edi: 0801 ebp: 0096 esp: cfb67d08 ds: 0018 es: 0018 ss: 0018 Process fsck.ext2 (pid: 49, stackpage=cfb67000) Stack: 0021 cfb67e9c cfb67f20 0801 107e c012fd73 0801 0012c8f2 0400 cfea9c00 ffea 0400 25004400 cbe56a20 0012c90d 0801 0001 cfb67e9c 4b234000 Call Trace: [c012fd73] [c01299c2] [c0129b5b] [c010ac4f] Code: 39 4a 04 75 10 0f b7 42 08 3b 44 24 24 75 06 66 39 7a 0c 74 EIP; c012b234 getblk+7c/124 = Trace; c012fd73 block_read+2df/540 Trace; c01299c2 sys_lseek+5e/94 Trace; c0129b5b sys_read+8b/a0 Trace; c010ac4f system_call+33/38 Code; c012b234 getblk+7c/124 _EIP: Code; c012b234 getblk+7c/124 = 0: 39 4a 04 cmp%ecx,0x4(%edx) = Code; c012b237 getblk+7f/124 3: 75 10 jne15 _EIP+0x15 c012b249 getblk+91/124 Code; c012b239 getblk+81/124 5: 0f b7 42 08 movzwl 0x8(%edx),%eax Code; c012b23d getblk+85/124 9: 3b 44 24 24 cmp0x24(%esp,1),%eax Code; c012b241 getblk+89/124 d: 75 06 jne15 _EIP+0x15 c012b249 getblk+91/124 Code; c012b243 getblk+8b/124 f: 66 39 7a 0c cmp%di,0xc(%edx) Code; c012b247 getblk+8f/124 13: 74 00 je 15 _EIP+0x15 c012b249 getblk+91/124 Unable to handle kernel NULL
Re: [patch] O_SYNC patch 3/3, add inode dirty buffer list support to ext2
"Jeff V. Merkey" [EMAIL PROTECTED] writes: Cool. ORACLE is going to **SMOKE** on EXT2 with this change. Pessimism Hmm I don't see how ORACLE is going to **SMOKE**. Last I looked ORACLE would need a query optimizer that always would find the best possible index and much less overhead to **SMOKE**. Last I looked table reads were 10x slower than file reads. /Pessimism Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds [EMAIL PROTECTED] writes: In short, I don't see _those_ kinds of issues. I do see error reporting as a major issue, though. If we need to do proper low-level block allocation in order to get correct ENOSPC handling, then the win from doing deferred writes is not very big. To get ENOSPC handling 99% correct all we need to do is decrement a counter, that remembers how many disks blocks are free. If we need a better estimate than just the data blocks it should not be hard to add an extra callback to the filesystem. There look to be some interesting cases to handle when we fill up a filesystem. Before actually failing and returning ENOSPC the filesystem might want to fsync itself. And see how correct it's estimates were. But that is the rare case and shouldn't affect performance. rant In the long term VFS support for deferred writes looks like a major win. Knowing how large a file is before we write it to disk allows very efficient disk organization, and fast file access (esp combined with an extent based fs). Support for compressing files in real time falls out naturally. Support for filesystems maintain coherency by never writing the same block back to the same disk location also appears. /rant One other thing to think about for the VFS/MM layer is limiting the total number of dirty pages in the system (to what disk pressure shows the disk can handle), to keep system performance smooth when swapping. All cases except mmaped files are easy, and they can be handled by a modified page fault handler that directly puts the dirty bit on the struct page. (Except that is buggy with respect to clearing the dirty bit on the struct page.) In reality we would have to create a queue of pointers to dirty pte's from the page fault handler and depending on a timer or memory pressure move the dirty bits to the actual page. Combined with the code to make sync and fsync to work on the page cache we msync would be obsolete? Of course the most important part is that when all of that is working, the VFS/MM layer it would be perfect. World domination would be achieved. For who would be caught using an OS with an imperfect VFS layer :) Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds [EMAIL PROTECTED] writes: On 30 Dec 2000, Eric W. Biederman wrote: One other thing to think about for the VFS/MM layer is limiting the total number of dirty pages in the system (to what disk pressure shows the disk can handle), to keep system performance smooth when swapping. This is a separate issue, and I think that it is most closely tied in to the "RSS limit" kind of patches because of the memory mapping issues. If you've seen the RSS rlimit patch (it's been posted a few times this week), then you could think of that modified by a "Resident writable pages Set Size" approach. Building on the RSS limit approach sounds much simpler then they way I was thinking. Not just for shared mappings - this is also an issue with limiting swapout. (I actually don't think that RSS is all that interesting, it's really the "potentially dirty RSS" that counts for VM behaviour - everything else can be dropped easily enough) Definitely. Now the only tricky bit is how do we sense when we are overloading the swap disks. Well that is the next step. I'll take a look and see what it takes to keep statistics on dirty mapped pages. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Happy new year^H^H^H^Hkernel..
Kai Germaschewski [EMAIL PROTECTED] writes: On Tue, 2 Jan 2001, Gerold Jury wrote: I have reversed the patches part by part, the only thing that makes a difference is the diversion services. The reason for this remains unknown for me. I think I found it. Could everybody who was getting the crash on ISDN line hangup try if the following patch fixes the problem? I think the problem was that we relied on divert_if being initialized to zero automatically, which didn't happen because it was not declared static and therefore not in .bss (*is this true?*). All variables with static storage (not with static scope) if not explicitly initialized are placed in the bss segment. In particular this means that adding/removing a static changes nothing. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Happy new year^H^H^H^Hkernel..
Russell King [EMAIL PROTECTED] writes: Kai Germaschewski writes: The patch is right, the explanation was wrong. Sorry, I didn't CC l-k when I found what was really going on. Other source files used a global initialized variable "divert_if" as well, so this became the same one as the one referenced in isdn_common.c. That's why it wasn't zero, it was explicitly initialized elsewhere. However, making divert_if static in isdn_common.c fixes the problem, because now it's really local to this file and therefore initialized to NULL. Maybe someone should compile the kernel with everything built in and -fno-common to catch stuff like this? Maybe we should always compile the kernel with -fno-common? Sounds good. We probably need to wait until after 2.4.0 is released to make the change though. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ramfs problem... (unlink of sparse file in D state)
Chris Wedgwood [EMAIL PROTECTED] writes: On Sat, Jan 06, 2001 at 03:58:20PM +, Alan Cox wrote: Ext2 handles large files almost properly. (properly on 2.2 + patches) NFSv3 handles large files but might be missing the O_LARGEFILE check. I believe reiserfs went to at least 4Gig. reiserfs 3.6.x under 2.4.x should go much higher unless i am reading something wrong pause yup, it does. as for NFS, I'm not sure how to pass O_LARGEFILE via the protocol and since NFS isn't really POSIX like anyhow decided we might as well just ingore it and have all sys_open calls for NFS look like O_LARGEFILE was specified Umm. No. The object of LFS stuff is so that programs that can't handle large files don't shoot themselves in the foot. You don't need to pass O_LARGEFILE over the protocol and knfsd doesn't need to handle it. But with out specifying O_LARGEFILE you should be limited to 2GB on 32bit systems. Moving some of the LFS checks into the VFS does sound good. When I looked at one of the BSD's a while ago, they had a max file size in (the superblock?) and the VFS did basic max file size checking. And I think it handled all of the LFS API at the VFS layer as well. Alan these are two seperate but related issues. Putting the LFS checks, max filesize checks into the VFS sounds right for 2.4.x because it fixes lots of filesystems, with just a couple of lines of code. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4 todo list update
Rik van Riel [EMAIL PROTECTED] writes: The following bugs _could_ be fixed ... I'm not 100% certain but they're probably gone (could somebody confirm/deny?): * mm-rss is modified in some places without holding the page_table_lock As of linux-2.4.0-test13-pre7 I can confirm that this bug still exists. The most obvious location is in zap_page_range, there may be others as well. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ramfs problem... (unlink of sparse file in D state)
Alan Cox [EMAIL PROTECTED] writes: Putting the LFS checks, max filesize checks into the VFS sounds right for 2.4.x because it fixes lots of filesystems, with just a couple of lines of code. Rather more than that, and it only fixes those using generic_file_* True. But it is noticeable fewer lines of code than doing it all once for each fs. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Patch (repost): cramfs memory corruption fix
Rik van Riel [EMAIL PROTECTED] writes: On Sun, 7 Jan 2001, Linus Torvalds wrote: On Sun, 7 Jan 2001, Alan Cox wrote: -ac has the rather extended ramfs with resource limits and stuff. That one also has rather more extended bugs 8). AFAIK none of those are in the vanilla ramfs code This is actually where I agree with whoever it was that said that ramfs as it stands now (without the limit checking etc) is much nicer simply because it can act as an example of how to do a simple filesystem. I wonder what to do about this - the limits are obviously useful, as would the "use swap-space as a backing store" thing be. At the same time I'd really hate to lose the lean-mean-clean ramfs. Sounds like a job for ... drum roll ... tmpfs!! If you need tmpfs the VFS layer is broken. For 99% of everything performance is determined by VFS layer caching. A fs that uses swap space as a backing store is not a big win. You just have a fs that doesn't support sync and you can add a mount option to a normal fs if you want that. I've written the filesystem and it was a dumb idea. Ramfs with (maybe) some basic limits has a place. tmpfs is just extra code to maintain. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Related VIA PCI crazyness?
Linus Torvalds [EMAIL PROTECTED] writes: On Sun, 7 Jan 2001, Albert Cranford wrote: Could anybody with a VIA chip who has the energy please do something for me: - enable DEBUG in arch/i386/kernel/pci-i386.h - do a "/sbin/lspci -xxvvv" on the interrupt routing chip (it's the "ISA bridge" chip - the VIA numbers are 82c586, 82c596, the PCI numbers for them are 1106:0586 and 1106:0596, I think) - do a cat /proc/pci Does this help. Ahh, no. A SMP kernel (or one with UP IO-APIC) is not going to be helpful for this, actually. SMP will take the irq data from the MP block, not the pirq table (that can be considered something of a misfeature right now, but getting the mixture of PCI irq redirection from the MP tables and the pirq irq routing information right together is probably not worth it - especially as I don't think any MS OS has ever done that either, so the BIOS writers have never experienced that combination - so it's almost guaranteed to result in strange results). pirq is specific to they legacy i8259 interrupt handler. MP is specific to some kind of IO-APIC. Right now when we enable the IO-APIC we disable the legacy i8259 controller. And I'm not even certain you can have them both enabled at the same time. So except for not having an option to disable use of the IO-APIC I don't see what we could do better. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Subtle MM bug
Zlatko Calusic [EMAIL PROTECTED] writes: Yes, but a lot more data on the swap also means degraded performance, because the disk head has to seek around in the much bigger area. Are you sure this is all OK? I don't think we have more data on the swap, just more data has an allocated home on the swap. With the earlier allocation we should (I haven't verified) allocate contiguous chunks of memory contiguously on the swap. And reusing the same swap pages helps out with this. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Subtle MM bug
Linus Torvalds [EMAIL PROTECTED] writes: On 8 Jan 2001, Eric W. Biederman wrote: Zlatko Calusic [EMAIL PROTECTED] writes: Yes, but a lot more data on the swap also means degraded performance, because the disk head has to seek around in the much bigger area. Are you sure this is all OK? I don't think we have more data on the swap, just more data has an allocated home on the swap. I think Zlatko's point is that because of the extra allocations, we will have worse locality (more seeks etc). Clearly we should not actually do any more actual IO. But the sticky allocation _might_ make the IO we do be more spread out. The tradeoff when implemented correctly is that writes will tend to be more spread out and reads should be better clustered together. To offset that, I think the sticky allocation makes us much better able to handle things like clustering etc more intelligently, which is why I think it's very much worth it. But let's not close our eyes to potential downsides. Certainly, keeping ours eyes open is a good a good thing. But it has been apparent for a long time that by doing allocation as we were doing it, that when it came to heavy swapping we were taking a performance hit. So I'm relieved that we are now being more aggressive. From the sounds of it what we are currently doing actually sucks worse for some heavy loads. But it still feels like the right direction. It's been my impression that work loads where we are actively swapping are a lot different from work loads where we really don't swap. To the extent that it might make sense to make the actively swapping case a config option to get our attention in the code. It would be nice to have a linux kernel for once that handles heavy swapping (below the level of thrashing) gracefully. :) Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Dan Kegel [EMAIL PROTECTED] writes: It's harder to write correct programs that use edge-triggered events. Huh? The race between when an event is reported, and when you take action on it effectively means all events are edge triggered. So making the interface clearly edge triggered seems to be a win for correctness. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Updated 2.4 TODO List -- new addition WAS(test9 PCI resource collisions (fwd)
"David S. Miller" [EMAIL PROTECTED] writes: Date: Tue, 24 Oct 2000 13:50:10 -0700 (PDT) From: Linus Torvalds [EMAIL PROTECTED] Does the above make it work for you? I don't know if PCI even has the notion of transparent bridging, and quite frankly I doubt it does. The above would be nothing but a hack that basically says "I don't understand the resources of this bridge, so I'll just say it bridges everything". I bet PCI allows no such thing, thus to be totally safe I would conditionalize this feature on the specific bridge. Ie. only allow it for this bridge type, because I bet it is just some bug in the the address comparators which makes the bridge interpret zero ranges as "forward and respond to everything". I'm not certain of the details but I do know that it is legal. To date I've only heard of it on ISA bridges, in particular the PIIXE. It's some kind of passive listening mode as opposed to actually claiming the bus cycles. This only would make sense if the bridge snooped config space access to devices behind it, so that it knew what addresses to forward and respond to. Just responding to "everything" would not work for obvious reasons. Right but I don't think you actually have to respond. Not that I think this is a good idea, but it does appear to be legal. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: guarantee_memory() syscall?
Raul Miller [EMAIL PROTECTED] writes: Can anyone tell me about the viability of a guarantee_memory() syscall? [I'm thinking: it would either kill the process, or allocate all virtual memory needed for its shared libraries, buffers, allocated memory, etc. Furthermore, it would render this process immune to the OOM killer, unless it allocated further memory.] Except for the OOM killer semantics mlockall already exists. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: /proc xml data
Joe [EMAIL PROTECTED] writes: I remember hearing about various debates about the /proc structure. I was wondering if anyone had ever considered storing some of the data in xml format rather than its current format? Things like /proc/meminfo and cpuinfo may work good in this format as then it would be easy to write a generic xml parser that could then be used to parse any of the data. "MemTotal: %8lu kB\n" In the case of the meminfo it would be a matter of changing the lines in fs/proc/array.c function get_meminfo(char * buffer) from "MemTotal: %8lu kB\n" to something like "memtotal%8lu kB/memtotal\n" The general consensus is that if we have a major reorganization, in proc the rule will be one value per file. And let directories do the grouping. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: non-gcc linux?
Ion Badulescu [EMAIL PROTECTED] writes: On Sun, 5 Nov 2000 23:42:25 +0100, Marc Lehmann [EMAIL PROTECTED] wrote: On Sun, Nov 05, 2000 at 04:06:37PM -0500, Jakub Jelinek [EMAIL PROTECTED] wrote: for SGI, or SGI would have to be willing to assign some code to FSF. Which is the standard procedure that the FSF requires for all it's programs to be able to defend them ... or sell them under a different license. Not that they would, but they could, if they really wanted to. The wording of the standard copyright assignment to the FSF binds the FSF so that it can only release the code under a free software license. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Persistent module storage [was Linux 2.4 Status / TODO page]
David Woodhouse [EMAIL PROTECTED] writes: The current situation is equivalent to stopping forwarding packets each time an app on the local machine decides it wants to send its own packets, after a period of inactivity. Defaulting to zero on boot is fine. Defaulting to zero after the module has been auto-unloaded and auto-loaded again is less good. Well we don't have auto unload. And module persistent data for the second load case causes chaos with the goal of having exactly the same code in modules and compiled in kernel code. It would probably be better (in this case) to increment the module count when the mixer settings go above 0, and decrement it when the settings go totally to 0. This prevents an unwanted unload. But for reliability and code simplicity there does not yet seem to be a case for persistent module storage. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Installing kernel 2.4
Horst von Brand [EMAIL PROTECTED] writes: I'd prefer to be a guinea pig for one of 3 or 4 generic kernels distributed in binary than of one of the hundreds of possibilities of patching a kernel together at boot, plus the (presumamby rather complex and fragile) machinery to do so *before* the kernel is booted, thank you very much. Plus I'm getting pissed off by how long a boot takes as it stands today... Just for reference I can Boot from Power on to Login prompt in 12 seconds. With Linux. The big change is nuking the BIOS They just want it to boot, and run with the same level of ease of use and stability they get with NT and NetWare and other stuff they are used to. This is an easy choice from where I'm sitting. Easy: i386. Or i486 (I very much doubt your customers run on less, and this should be geneic enough). It's also possible to do a two stage boot. Stage 1 i386 kernel stage 2 the specific kernel for the machine This adds about a second to the whole boot process. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: test11-pre2 compile error undefined reference to `bust_spinlocks' WHAT?!
Andrew Morton [EMAIL PROTECTED] writes: George Anzinger wrote: The notion of releasing a spin lock by initializing it seems IMHO, on the face of it, way off. Firstly the protected area is no longer protected which could lead to undefined errors/ crashes and secondly, any future use of spinlocks to control preemption could have a lot of trouble with this, principally because the locker is unknown. In the case at hand, it would seem that an unlocked path to the console is a more correct answer that gives the system a far better chance of actually remaining viable. Does bust_spinlocks() muck up the preemptive kernel's spinlock counting? Would you prefer spin_trylock()/spin_unlock()? It doesn't matter - if we call bust_spinlocks() the kernel is known to be dead meat and there is a fsck in your near future. We are still trying to find out why kumon@fujitsu's 8-way is crashing on the test10-pre5 sched.c. Looks like it's fixed in test11-pre2 but we want to know _why_ it's fixed. And at present each time he hits the bug, his printk() deadlocks. So bust_spinlocks() is a RAS feature :) A very important one - it's terrible when your one-in-a-trillion bug happens and there are no diagnostics. It's a work-in-progress. There are a lot of things which can cause printk to deadlock: - console_lock - timerlist_lock - global_irq_lock (console code does global_cli) - log_wait.lock - tasklist_lock (printk does wake_up) (*) - runqueue_lock (printk does wake_up) I'll be proposing a better patch for this in a few days. Hmm. I would like to suggest we look at non locking variants of things. i.e. Data structure version changing with swap. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Michael Rothwell [EMAIL PROTECTED] writes: "Eric W. Biederman" wrote: I have recently developed a patch that allows linux to directly boot into another linux kernel. This would rock. One place I can think of using it is with distro installers. The installer boots a generic i386 kernel, and then installs an optimized (i.e, PIII, etc.) kernel for run-time. This would rock? It already does. Of course the installers need to actually uses this. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
"H. Peter Anvin" [EMAIL PROTECTED] writes: Followup to: [EMAIL PROTECTED] By author:[EMAIL PROTECTED] (Eric W. Biederman) In newsgroup: linux.dev.kernel The interface is designed to be simple and inflexible yet very powerful. To that end the code just takes an elf binary, and a command line. The started image also takes an environment generated by the kernel of all of the unprobeable hardware details. Isn't this what milo does on alpha? Similar milo uses kernel drivers in it's own framework. This has proved to be a major maintenance problem. Milo is nearly a kernel fork. The design is for the long term to get this incorporated into the kernel, and even if not a small kernel patch should be easier to maintain that a harness for calling kernel drivers. I'm working on something similiar in "Genesis". It pretty much is (or rather, will be) a kernel *port*, not a fork; the port is such that it can run on top of a simple BIOS extender and thus access the boot media. Hmm. You must mean similiar to milo. Have fun. With linuxBIOS I'm working exactly the other way. Killing off the BIOS. And letting the initial firmware be just a boot loader. The reduction is complexity should make it more reliable. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Adam Lazur [EMAIL PROTECTED] writes: Eric W. Biederman ([EMAIL PROTECTED]) said: Michael Rothwell [EMAIL PROTECTED] writes: This would rock. One place I can think of using it is with distro installers. The installer boots a generic i386 kernel, and then installs an optimized (i.e, PIII, etc.) kernel for run-time. This would rock? It already does. Of course the installers need to actually uses this. Actually, along the lines of what Scyld uses two kernel monte for with their Beowulf2 distribution. They boot a network enabled kernel which pulls a kernel off of a server and then uses two kernel monte to boot with that one. This allows you to centrally admin your cluster with one server. Good stuff... Yep. You can also do this with etherboot flashed on one a nick card as well. I also intend to use my work for this functionality as well. FYI I work for linux networx which builds hardware for linux clusters. The fact that Scyld is using arp and a fixed network socket is a design decision I don't agree with. Truly slick will be when linuxBIOS is solid. Then you even get remote control of the BIOS, and remote booting all from within the BIOS. Only time will tell if it is worth the effort :) Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Adam Lazur [EMAIL PROTECTED] writes: Eric W. Biederman ([EMAIL PROTECTED]) said: I have recently developed a patch that allows linux to directly boot into another linux kernel. With the code freeze it appears inappropriate to submit it at this time. Aside from what looks to be support for SMP, how does this differ from the two kernel monte stuff at http://scyld.com/software/monte.html ? I admit that LOBOS, two kernel monte, and the one by by Werner Almsberg. Were all related work that I looked at. And I acknowledge there were some good ideas I pilfered from all of them. There are a couple of differences. But the big one is I'm trying to do it right. In particular this means fixing the problem where the problem is. Additionally I'm killing backwards compatibility with a lot of short sited things. And multiplatform support is in the plan. So long term this should run on alpha, and x86, and sparc and everything else out there that linux supports. This means that you can have a multiplatform boot loader. There will have to be glue code out there to get started from different firmware on different machines but that is it. Additionally mine is the only one that has a real chance of booting a non-linux kernel. Gathering the non probable hardware information is hard. Currently mine implementation is the only one to not simply copy the boot parameters page that is give to the linux kernel. Unlike 2 kernel monte mine deliberately has no reliance upon a BIOS. There is another major difference as well. kexec is part of work on the linuxBIOS project. Where the goal is to have a very minimal firmware before booting into linux. And to use that initial linux kernel as the firmware hardware drivers. What this means is kexec is being developed from a point of view that needs it. If you don't have a BIOS kexec is a must. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
"H. Peter Anvin" [EMAIL PROTECTED] writes: "Eric W. Biederman" wrote: Hmm. You must mean similiar to milo. Have fun. With linuxBIOS I'm working exactly the other way. Killing off the BIOS. And letting the initial firmware be just a boot loader. The reduction is complexity should make it more reliable. ... except that you have to handle every single motherboard architecture out there now. Agreed that is a bit of a risk. Mostly you just have to handle the chipset of the boards and there are a finite number of them. Only time will tell if this is truly feasible. I think it is certainly work a try. And I don't have to handle every single one just all of the ones I need it to run on :) With the my kexec patch I'm just getting the infrastructure ready, and that is functionality that can be used independently of linuxBIOS. If booting linux from linux would help with what you are doing I love to work together on that. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bzImage ~ 900K with i386 test11-pre2
Andrea Arcangeli [EMAIL PROTECTED] writes: On Sun, Nov 12, 2000 at 06:14:36AM -0700, Eric W. Biederman wrote: x86-64 doesn't load the segment registers at all before use. Yes, before switching to 64bit long mode we never do any data access. We do a stack access to clear eflags only while we still run in legacy mode with paging disabled and so we only rely on ss to be valid when the bootloader jumps at 0x10 for executing the head.S code (and not anymore on the gdt_48 layout). Nope you rely on cs ds as well. cs is just a duh the codes running so it must be valid. But ds is needed for lgdt. I can tell you don't have real hardware. The non obviousness I need to retract this a bit. You are still building a compressed image, and the code in the boot/compressed/head.S remains unchanged and loads segment registers, so it works by luck. If you didn't build a compressed image you would be in trouble. Current code definitely works fine on the simnow simulator so if current code shouldn't work because it's buggy then at least the simulator is sure buggy as well (and that isn't going to be the case as its behaviour is in full sync with the specs as far I can see). Add a target for a noncompressed image and then build. It should be interesting to watch. So while you load the gdt before you set a segment register later, which is good the more important part was still missed. Sorry but I don't see the missing part. Are you sure you're not missing this part of the x86-64 specs? Nope because what I was complaining about is in 32 bit mode. :) Data and Stack Segments: In 64-bit mode, the contents of the ES, DS, and SS segment registers are ignored. All fields (base, limit, and attribute) in the corresponding segment descriptor registers (hidden part) are also ignored. Hmm. I'll have to look and see if FS GS are also ignored. Address calculations in 64-bit mode that reference the ES, DS, or SS segments, are treated as if the segment base is zero. Rather than perform limit checks, the processor instead checks that all virtual-address references are in canonical form. Cool I like this bit. The segments are finally dead. O.k. on monday I'll dig up my patch and that clears this up. Sure, go ahead if you weren't missing that basic part of the long mode specs. Thanks. Nope. Though I suspect we should do the switch to 64bit mode in setup.S and not have these issues pollute head.S at all. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bzImage ~ 900K with i386 test11-pre2
Andrea Arcangeli [EMAIL PROTECTED] writes: On Sun, Nov 12, 2000 at 06:14:36AM -0700, Eric W. Biederman wrote: x86-64 doesn't load the segment registers at all before use. Yes, before switching to 64bit long mode we never do any data access. We do a stack access to clear eflags only while we still run in legacy mode with paging disabled and so we only rely on ss to be valid when the bootloader jumps at 0x10 for executing the head.S code (and not anymore on the gdt_48 layout). Actually it just occurred to me that this stack assess is buggy. You haven't set up a stack yet so. Only the boot/compressed/head.S did and that location isn't safe to use. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Erik Andersen [EMAIL PROTECTED] writes: On Thu Nov 09, 2000 at 01:18:24AM -0700, Eric W. Biederman wrote: I have recently developed a patch that allows linux to directly boot into another linux kernel. Looks very cool. I'm curious about your decision to use ELF images. This makes it much less conveinient to use due to the kernel postprocessing, and makes it that the kernel binary from which you initially boot is not necessirily the same as the binary that you re-boot into. The decision here was that I needed to pass a vector of physical address, length, data pairs. The elf program header is dead simple and provides it. So I either had to invent a complicated argument passing mechanism for a syscall or have the kernel parse a file. Wouldn't it be more reasonable to simply try to exec whatever file is provided? If the concern is initrds; they can be simply pasted into the kernel binary. That's exactly what my preprocessing does. vmlinux is also an elf binary. As is arch/i386/boot/bvmlinux but it is compressed. All mkelfImage does is the pasting of initrd's, command lines, and just a touch of argument conversion code. What I don't do deliberately is allow or need setup.S which does syscalls to run. All it does are BIOS calls, and store them in a nasty data structure. I have replaced that data structure with something that is maintainable. I would like very much to not need mkelfImage. However that requires further changes to the kernel, and I cannot boot an unpatched kernel with that method. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Erik Andersen [EMAIL PROTECTED] writes: On Tue Nov 14, 2000 at 07:59:18AM -0700, Eric W. Biederman wrote: All mkelfImage does is the pasting of initrd's, command lines, and just a touch of argument conversion code. You can link in an initrd using linker magic, i.e. $(OBJCOPY) --add-section=image=kernel --add-section=initrd=initrd.gz Hmm this is certainly possible. My impression is that this doesn't currently work on x86. I would love to be wrong. This is done in ppc/boot/Makefile for example. It might be a nice thing to add a .config option to optionally specify an initrd to link into the kernel image. Similarly, several architectures have a CONFIG_CMDLINE which could also do the job (see arch/ppc/config.in for example). Presumably, by doing such things you could avoid needing to use mkelfImage. Agreed. And I would like to see that. With the 2.4 code freeze it is too late to do that today. Also mkelfImage gives me backwards compatibility for now. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Addressing logically the buffer cache
Juan [EMAIL PROTECTED] writes: Alexander Viro escribió: On Tue, 14 Nov 2000, Juan wrote: Hi!. Is there any patch or project to address logically the buffer cache?. Now, you use three parameters to find a buffer in cache: device, block number, and block size. But, what about if I want to find a buffer using a super block, an inode number, and a block number within the file specified by the inode number. What's wrong with using the pagecache and per-page buffer_heads? Suppose you are implementing a log-structured file system and a process adds a new logical block to a file. Besides, suppose that the segment is 512 KBytes in size. Usually, you don't want to write the segment before it is full. The logical block hasn't got a physical address because you don't build the segment until it is written to disk. So, what happens if another process wants to access to the new block?. You can't assign a physical address to the new block because the address can change when the buffer is written to disk. So you don't assign a buffer head until you make the final decision. There are some interesting issues with how you track that your data is dirty but otherwise all is well. Perhaps, I'm wrong, but I think that the implementation of the BSD-LFS needs to address logically the buffer cache. The linux vfs is quite different from the berkley one. The linux page cache is much closer to the berkley block cache, then the depricated linux block cache. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Swapping over NFS in Linux 2.4?
Rik van Riel [EMAIL PROTECTED] writes: On Wed, 15 Nov 2000, Andreas Osterburg wrote: Because I set up a diskless Linux-workstation, I want to swap over NFS. For this purpose I found only patches for "older" Linux-versions (2.0, 2.1, 2.2?). Does anyone know wheter there are patches for 2.4 or does anyone know another solution for this problem? 1. you can swap over NBD 2. if you point me to the swap-over-nfs patches you have found, I can try to make them work on 2.4 ;) Rik all we need to do now is convert the swapout code to address space methods just like the block device was. This has a number of interesting effects. One of which is that brw_page should no longer have any users. Simplifying fs/buffer.c Further this is equivalent to mounting a nfs file loop back which the address space methods now allow, but it is more direct. Which means that if this reveals any bugs in nfs/lock ups in nfs they were already there. This has been on my want to do list for a while but I'm busy reinventing booting so I haven't gotten to it. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Werner Almesberger [EMAIL PROTECTED] writes: Eric W. Biederman wrote: There are a couple of differences. But the big one is I'm trying to do it right. So why do you need a file-based interface then ? ;-) When possible it is nice to set as much policy as possible, without removing functionality. Since this is a highly privileged operation anyway, you may as well trust user space to use the right data format ... Hmm. I hadn't thought of it from that angle. I don't think I have much code tied up in format checks so I'm not too worried. If something goes wrong it is simple a question of where it will crash. Doing some checking simply allows for better debugging of problems :) One thing I'm going to have to consider though is if the memory regions that the new kernel is going into are actually memory. The pro argument is that checking for reserved areas of memory catches changes to an architecture that were unexpected. The recent issues with the extended BIOS area growing are a good example of this. I get the impression that you incur quite a lot of overhead just to make it fit with the exec interface. I agree that it's conceptually nice, and it looks cleanly done, but I don't quite see the practical value. (Except, perhaps, that this allowed you to pick the rather cute name "kexec" ;-) Well there is that. Somehow implementing scatter/gather from a user space process seemed like a potential mess, and extra work. In part I am starting with a network boot loader, so building a file format that works was needed anyway. As far as overhead my impression is that there is none in speed, and only one or two extra ones functions in space. Additionally mine is the only one that has a real chance of booting a non-linux kernel. Hmm, I think all approaches could boot a non-Linux kernel, but ... bootimg is close. I was thinking a couple of directions here. - Mine is the only interface that can boot a non-Linux kernel natively. Bootimg doesn't count because it doesn't do anything natively :) In particular every other boot loader passes the nasty empty zero page to the new kernel. Definitely requiring a chain loader. With an OS neutral format, cataloging the non-probable hardware details, and providing those details in an extensible format, I gain a lot in easy extensibility. I need to find time soon and write up all of the file format details in an RFC like the GRUB multiboot spec. Possibly even submit it to the IETF as an RFC for compatible booting and multiple platforms. And this raises an important point. Lazy programmers tend to go with whatever is easiest. Having a good file format, making this the easy case, should reduce the number of formats supported and increase boot interoperability. Most of what was said on this score with GRUB I agree with. I would even be following the GRUB multiboot spec except it doesn't allow passing of the unprobeable hardware details and it doesn't allow easy expansion of what it does pass. This is the big reason I'm not in favor of the bootimg approach, that doesn't define anything. As far as loading is concerned, bootimg probably has an advantage there, because you can put things together in memory (e.g. some OS-specific chain loader), without going to secondary storage. Well with ramfs is hardly secondary storage, though it has a touch more overhead. And you only need to do this for the non common case. Getting images to adapt to a specific bootloader isn't to hard. Every other boot loader in the world does it. (Proof of concept: bootimg is able to load all currently supported kernel image formats on ia32.) I do conceded that bootimg has this ability as well in theory. I actually have booted multiboot compliant images in an earlier version of my patch and the cost to support both formats in a kernel loader is negligible. My mkelfImage builds linux kernels that support being booted both ways. As far as execution is concerned, you're probably slightly better off with an approach that goes back to real mode. (Or use a chain loader - this can be transparent to the kernel.) But then, I'm not sure if you can re-animate the BIOS in any consistent way, so your choice of operating systems may be quite limited, or you have to provide your own BIOS substitute. Agreed if the goal is to boot code is designed to start with a single sector loaded at 0x7c00. If I really care I might worry about that. Since linux preserved the first page of memory which includes the interrupt table reanimating the BIOS might not be so bad. My primary non-linux target are the BSD's, and various experimental OS's. And in those cases why go to the pain of dropping out of protected mode if you are going to just load back into it again. All of what I do is colored by the fact that my most important environment I have no BIOS. So for me I can't reanimate the BIOS because it isn't there. Once this bullet is bitten though this
Re: bzImage ~ 900K with i386 test11-pre2
Andi Kleen [EMAIL PROTECTED] writes: [This is quite a bizarre discussion, but I'll answer anyways. I am not exactly sure what your point is] Let me step aside a second and explain where I'm coming from. As a spin off of the work of the linuxBIOS project I have implemented a system call that implements exec functionality at the kernel level. Essentially allowing you to warm boot linux from linux. To get this to work no bios calls are involved, so I'm not using setup.S. This also has the interesting side effect of allowing a boot loader to be written that will work on all linux platforms. (I have currently just begun my port to alpha). In the process of the above I have learned quite a bit about how the current boot loader works. And want eventually to convert linux to not need wrapper code to use my bootloader. Booting vmlinux is fun :) On Sun, Nov 12, 2000 at 11:57:15AM -0700, Eric W. Biederman wrote: I can tell you don't have real hardware. The non obviousness I need to retract this a bit. You are still building a compressed image, and the code in the boot/compressed/head.S remains unchanged and loads segment registers, so it works by luck. If you didn't build a compressed image you would be in trouble. boot/compressed/head.S does run in 32bit legacy mode, where you of course need segment registers. After you got into long mode segments are only needed to jump between 32/64bit code segments and and for a the data segment of the 32bit emulation (+ the iretd bug currently which I hope will be fixed in final hardware) Also note that boot/compressed/* currently does not even link, because the x86-64 toolchain cannot generate relocated 32bit code ATM (the linker chokes on the 32bit relocations) The tests we did so far used a precompiled relocated binary compressed/{head,misc}.o from a IA32 build. ... Sure, go ahead if you weren't missing that basic part of the long mode specs. Thanks. Nope. Though I suspect we should do the switch to 64bit mode in setup.S and not have these issues pollute head.S at all. I see no advantage in doing it there instead of in head.S After reading through the long mode specs I now agree. If you could be in long mode with the mmu disabled that would be a different story but you can't and it isn't. I was thinking of symmetry with the x86 and how much easier everything is if you only use one processor mode for the initial boot strap. No need for super assemblers etc. Oh well. On x86 there are some real advantages to moving the segment loads into setup.S from the various head.S's and they still apply (although to a lesser extent) to x86-64. This causes less code confusion. For my kexec stuff I now need to think really hard how I want to handle x86-64. What I was thinking would work well in general is to start the processor it's native/optimal mode with the mmu disabled. With x86-64 I can't do this unfortunately :( Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Advanced Linux Kernel/Enterprise Linux Kernel
Daniel Phillips [EMAIL PROTECTED] writes: Actually, I was planning on doing on putting in a hack to do something like that: calculate a checksum after every buffer data update and check it after write completion, to make sure nothing scribbled in the buffer in the interim. This would also pick up some bad memory problems. Be very careful that this just applies to metadata. For normal data this is a valid case. Weird but valid. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] swap=device kernel commandline
Werner Almesberger [EMAIL PROTECTED] writes: Rik van Riel wrote: Did you try to load an initrd on a low-memory machine? It shouldn't work and it probably won't ;) You must be really low on memory ;-) # zcat initrd.gz | wc -c 409600 (ash, pwd, chroot, pivot_root, smount, and still about 82 kB free.) Hmm And that's without trying to be small. I have one that loads a second kernel over the network using dhcp to configure it's interface and tftp to fetch the image and boots that is only 20kb uncompressed Compressed I can fit that and a kernel all in plus a minimal BIOS all in 512K with some room to spare... Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] swap=device kernel commandline
Werner Almesberger [EMAIL PROTECTED] writes: Eric W. Biederman wrote: I have one that loads a second kernel over the network using dhcp to configure it's interface and tftp to fetch the image and boots that is only 20kb uncompressed Neat ;-) My goal is actually not only size, but also to have a relatively normal build environment, e.g. my example is with shared newlib, regular ash, and - unfortunately rather wasteful - glibc's ld.so. But a tftp loader in 20kB is rather good. Now the next challenge is the same thing with NFS. Then we can finally kill nfsroot ;-) Hmm. What does it take to mount an NFS partition? Anyway. All I did was wrote a tiny libc that is just a bunch of wrappers for syscalls, and some string functions. Then I just wrote a straight forward C program to do the job. Except for my added kexec call I can compile with glibc :) Now if glibc wouldn't link in 200k of unused crap when you make a trivial static binary I'd much prefer to use it... Though I wish it was possible to have a ramfs preloader instead of initrd. An initramfs would allow me to not even compile in the block device driver layer, and be more efficient. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Werner Almesberger [EMAIL PROTECTED] writes: Eric W. Biederman wrote: Well there is that. Somehow implementing scatter/gather from a user space process seemed like a potential mess, and extra work. Did you look at kiobufs ? I think they may just have the right functionality. I always wanted bootimg to be able to memory-map things to reduce memory pressure, and it seems now all the ingredients are in place. Your file-based approach could probably use brw_kiovec. When I looked kiobufs seemed to do a good gather but not a good scatter. The code wasn't trivially reusable, and the structures had a lot of overhead. I need to find time soon and write up all of the file format details in an RFC like the GRUB multiboot spec. Possibly even submit it to the IETF as an RFC for compatible booting and multiple platforms. Hmm, if you succeed in selling the format as an integral part of your network boot protocol, this may even work ;-) Well I'd sell it to promote interoperability. What I'm doing protocol wise has been RFC sanctions for years. It's just that every vendor invents their own format. So interoperability is a problem. This is the big reason I'm not in favor of the bootimg approach, that doesn't define anything. Oh, it does - but the policy is implemented in user space. And, of course, it's rather simple. But I'm a little confused with your UBE. It only seems to copy the e820 information, so you still seem to rely on e.g. the SMP tables the BIOS stores in memory. Also, I don't quite see where you're using the saved information. What am I missing ? Defining all of the parameters for the UBE is a separate issue. It comes next in a couple of weeks. The rebooting is done the rest is not yet. As far as where I use the information is used, look in do_kexec. Right after kimage_get_chunk which figures out where it is safe to put the information. However, parameter passing like UBE may solve the following two potential problems: - kernel 1 copies tables marked by "magic" numbers in memory, then boots kernel 2, which trips over the copy - kernel 1 doesn't know about a table and damages it, then boots kernel 2, which recognizes the table, and trips over it But I think we don't need to copy or even convert the entire tables for this. After all, any OS that boots on i386 already knows how to parse the BIOS-provided tables, so I think it's better to directly re-use this code than to invent a new format. A few flags or maybe a short list should be sufficient for the problems I've described above. I agree writing the code to understand the table may be a significant issue. On the other hand I still think it is worth a look, being able to unify option parsing for multiple platforms is not a small gain, nor is getting out from short sighted vendor half standards. Besides which most tables seem to contain a lot of information that is probeable. Which just makes them a waste of BIOS space, and sources of bugs. My primary non-linux target are the BSD's, and various experimental OS's. And in those cases why go to the pain of dropping out of protected mode if you are going to just load back into it again. Yep, I fully agree. Compiling the code in it's own file and putting it in it's own section of the kernel for size would probably do it though. This is exactly what bootimg does :) Being sure the code is PIC is a little tricky though. Yes, for now I cheat and depend on gcc to generate code that just happens to be PIC. Hmm. I wonder how hard it would be to add -fPIC to the compilation line for that file. But I'm not certain that would do what I want in this instance... Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: neighbour table?
David Ford [EMAIL PROTECTED] writes: Andrew Park wrote: I get a message neighbour table overflow What does that mean? It seems that net/ipv4/route.c is the place where it prints this. But under what circumstances does this happen? Thanks It means you set the link state of eth0 up before lo. Be sure lo is established before eth0 and you won't see this message. Hmm. How does the interaction work. I've been meaning to track it for a while but haven't yet. From the cases I have observed it seems to be connected with arp requests that aren't answered. (I.e when something is misconfigured and you try to nfsroot off of the wrong ip on your subnet) And I keep thinking neighbour table underflow would have been a better message. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Q: Linux rebooting directly into linux.
Werner Almesberger [EMAIL PROTECTED] writes: Eric W. Biederman wrote: The code wasn't trivially reusable, and the structures had a lot of overhead. There's some overhead, but I think it's not too bad. I'll give it a try ... The rebooting is done the rest is not yet. Ah, and I already wondered where in all the APIC code you've hidden the magic to avoid the config data clobbering issues ;-) Nope. That just comes in two parts. The first chunk is the work on the apic so the deadlock detector can run on UP kernels. From Ingo Molanar. The second part are my cleanups so we up the apic in a sane state upon reboot. I agree writing the code to understand the table may be a significant issue. On the other hand I still think it is worth a look, being able to unify option parsing for multiple platforms is not a small gain, nor is getting out from short sighted vendor half standards. Well, you certainly have a point where stupid vendors and BIOS nonsense are concerned. However, if we ignore LinuxBIOS for a moment, each platform already has a set of configuration parameter passing conventions imposed by the firmware. So we need to be able to handle this anyway, and most of the information is highly platform-specific. LinuxBIOS is a special case, because you have your own firmware. But what you're suggesting is basically yet another parameter format, which needs to incorporate and possibly unify much of the information contained in all those platform-specific formats. I'm not sure it's worth the effort. And, besides, I think it complicates the kernel, because you either have to add a parallel set of functions extracting and processing data from the "native" or the UBE environment, or you have to add a converter between "native" and UBE for each platform. Or do you have a better plan ? My initial plan was to have two parallel table parsers. The ones we have now. And another based on UBE. If we find the information we need via UBE use that. If not fall back to the old way. But the tables are only half of it. Right now we have all kinds of weirdness going through the empty_zero_page at boot time. A lot of that I plan on just gather in UBE format instead of random data in random locations. Since Setup.S implements this it should be transparent to most everything. But I need to see how well that works first before I'm too commited either way. For x86 it isn't too big of a deal. For other platforms though where the Firmware comes is multiple flavors converting everything looks like it could be a real win. I guess what I'm most after is improving the linux BIOS abstraction layer. We mostly have one, and only do BIOS calls before really starting the kernel (except for some stupid BIOS standards like APM). When I started with bootimg, I also thought that we'd need some parameter passing mechanism, a bit similar to UBE (although I would have tried to be more text-based). Then I realized that there are actually only a few tables, and we can just keep them in memory. And some of them need to be modified before we can re-use them. (Trivial example: the boot command line. Video modes are a similar, although much more complicated issue.) I agree with tables that we need to be careful. A lossy conversion can be a real problem. The empty_zero_page is my first canidate, and I'll see where it goes from there. One of the more ugly challenges that I've already run into is that there are multiple tables for specifying how interrupts are routed. (In modern PC irq number is dynamically assigned). I would like to have one good table than two that fight each other. But the point is that looking through the parameters and figuring out what works and what makes sense will take some doing, and I'm not promising to do any more than clean up the empty_zero_page. Besides which most tables seem to contain a lot of information that is probeable. Which just makes them a waste of BIOS space, and sources of bugs. Agreed with BIOS bugs ;-) Where probing is possible, is it reliable ? It'd take some baroque BIOS parameter table over yet another mandatory boot command line parameter any time ... Hmm. I wonder how hard it would be to add -fPIC to the compilation line for that file. But I'm not certain that would do what I want in this instance... Are there actually architectures where the compiler generates position-dependent code even if you're careful ? (I.e. all functions inlined, only auto variables.) I don't know yet. And since that part is machine specific, x86 is really the only case that matters. I just don't quite trust the compiler. But next rev I'll make certain to steal this code from bootimg. Given a normal architecture I believe no references to global data should be sufficient, to ensure the code is pic. Inlines are interesting because they aren't always inlined. To be really certain you can specify -fPIC a
Re: Kernel 2.5 Workshop RealVideo streams -- next time, please get better audio.
Miles Lane [EMAIL PROTECTED] writes: http://www.osdn.com/conferences/kernel/ Thanks to all responsible for getting these captures of the Kernel 2.5 Workshop prosentations put together. There is one major shortcoming of the recordings. Usually, only the comments of the presenter(s) can be heard. This reduces the value of these recording substantially, since the comments, insights and give-and-take of the other kernel developers would help us get a much more complete understanding of the areas being presented -- try listening to Andy Grover's Power Management presentation and you'll see what I mean. I actually managed to get almost all of it by simply pressing my ear against my speaker, and then pulling back quickly when the main speaker was talking. So my question is, what would it take to get some automatic software volume correction going. This looks like it would be the easiest fix of all. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Longstanding elf fix (2.2.19 fix)
A little while ago I was playing with building an elf self extracting binary. In doing so I discovered that the linux kernel does not handle elf program headers with multiple BSS segments. Eric Binary files linux-2.2.19/drivers/char/conmakehash and linux-2.2.19.elf-fix/drivers/char/conmakehash differ Binary files linux-2.2.19/drivers/char/hfmodem/gentbl and linux-2.2.19.elf-fix/drivers/char/hfmodem/gentbl differ diff -uNrX linux-exclude-files linux-2.2.19/fs/binfmt_elf.c linux-2.2.19.elf-fix/fs/binfmt_elf.c --- linux-2.2.19/fs/binfmt_elf.c Fri Apr 20 13:25:11 2001 +++ linux-2.2.19.elf-fix/fs/binfmt_elf.c Sun Apr 22 17:55:42 2001 @@ -71,18 +71,6 @@ #endif }; -static void set_brk(unsigned long start, unsigned long end) -{ - start = ELF_PAGEALIGN(start); - end = ELF_PAGEALIGN(end); - if (end = start) - return; - do_mmap(NULL, start, end - start, - PROT_READ | PROT_WRITE | PROT_EXEC, - MAP_FIXED | MAP_PRIVATE, 0); -} - - /* We need to explicitly zero any fractional pages after the data section (i.e. bss). This would contain the junk from the file that should not @@ -213,6 +201,28 @@ return sp; } +static inline unsigned long +elf_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type) +{ + unsigned long start, data_len, mem_len, offset; + unsigned long map_addr; + + start = ELF_PAGESTART(addr); + data_len = ELF_PAGEALIGN(eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + mem_len = ELF_PAGEALIGN(eppnt-p_memsz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + offset = eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr); + + if (eppnt-p_filesz) { + map_addr = do_mmap(filep, start, data_len, prot, type, offset); + do_mmap(NULL, map_addr + data_len, mem_len - data_len, prot, + MAP_FIXED | MAP_PRIVATE, 0); + padzero(map_addr + eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + } else { + map_addr = do_mmap(NULL, start, mem_len, prot, MAP_PRIVATE, 0); + } + return(map_addr); +} + /* This is much more generalized than the library routine read function, so we keep this separate. Technically the library read function @@ -293,12 +303,7 @@ #endif } - map_addr = do_mmap(file, - load_addr + ELF_PAGESTART(vaddr), - eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr), - elf_prot, - elf_type, - eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr)); + map_addr = elf_map(file, load_addr + vaddr, eppnt, elf_prot, elf_type); if (map_addr -1024UL) /* Real error */ goto out_close; @@ -325,23 +330,6 @@ } } - /* Now use mmap to map the library into memory. */ - - /* - * Now fill out the bss section. First pad the last page up - * to the page boundary, and then perform a mmap to make sure - * that there are zero-mapped pages up to and including the - * last bss page. - */ - padzero(elf_bss); - elf_bss = ELF_PAGESTART(elf_bss + ELF_EXEC_PAGESIZE - 1); /* What we have mapped so far */ - - /* Map the last of the bss segment */ - if (last_bss elf_bss) - do_mmap(NULL, elf_bss, last_bss - elf_bss, - PROT_READ|PROT_WRITE|PROT_EXEC, - MAP_FIXED|MAP_PRIVATE, 0); - *interp_load_addr = load_addr; /* * AUDIT: is everything deallocated properly if this happens @@ -660,12 +648,7 @@ if (elf_ex.e_type == ET_EXEC || load_addr_set) { elf_flags |= MAP_FIXED; } - - error = do_mmap(file, ELF_PAGESTART(load_bias + vaddr), - (elf_ppnt-p_filesz + - ELF_PAGEOFFSET(elf_ppnt-p_vaddr)), - elf_prot, elf_flags, (elf_ppnt-p_offset - - ELF_PAGEOFFSET(elf_ppnt-p_vaddr))); + error = elf_map(file, load_bias + vaddr, elf_ppnt, elf_prot, elf_flags); if (!load_addr_set) { load_addr_set = 1; @@ -760,13 +743,6 @@ current-mm-start_code = start_code; current-mm-end_data = end_data; current-mm-start_stack = bprm-p; - - /* Calling set_brk effectively mmaps the pages that we need - * for the bss and break sections - */ - set_brk(elf_bss, elf_brk); - - padzero(elf_bss); #if 0 printk("(start_brk) %x\n" , current-mm-start_brk);
[PATCH] Longstanding elf fix (2.4.3 fix)
A little while ago I was playing with building an elf self extracting binary. In doing so I discovered that the linux kernel does not handle elf program headers with multiple BSS segments. In building a patch for 2.4.3 I also discovered that we are not taking the mmap_sem around do_brk in the exec paths. Attached is a patch that corrects, both of these problems. Eric diff -uNrX linux-exclude-files linux-2.4.3/arch/mips/kernel/irixelf.c linux-2.4.3.elf-fix2/arch/mips/kernel/irixelf.c --- linux-2.4.3/arch/mips/kernel/irixelf.c Fri Apr 20 12:06:40 2001 +++ linux-2.4.3.elf-fix2/arch/mips/kernel/irixelf.c Sun Apr 22 17:00:28 2001 @@ -130,7 +130,9 @@ end = PAGE_ALIGN(end); if (end = start) return; + down_write(current-mm-mmap_sem); do_brk(start, end - start); + up_write(current-mm-mmap_sem); } @@ -379,7 +381,9 @@ /* Map the last of the bss segment */ if (last_bss len) { + down_write(current-mm-mmap_sem); do_brk(len, (last_bss - len)); + up_write(current-mm-mmap_sem); } kfree(elf_phdata); @@ -567,8 +571,10 @@ unsigned long v; struct prda *pp; + down_write(current-mm-mmap_sem); v = do_brk (PRDA_ADDRESS, PAGE_SIZE); - + up_write(current-mm-mmap_sem); + if (v 0) return; @@ -858,8 +864,11 @@ len = (elf_phdata-p_filesz + elf_phdata-p_vaddr+ 0xfff) 0xf000; bss = elf_phdata-p_memsz + elf_phdata-p_vaddr; - if (bss len) - do_brk(len, bss-len); + if (bss len) { + down_write(current-mm-mmap_sem); + do_brk(len, bss-len); + up_write(current-mm-mmap_sem); + } kfree(elf_phdata); return 0; } diff -uNrX linux-exclude-files linux-2.4.3/arch/s390x/kernel/binfmt_elf32.c linux-2.4.3.elf-fix2/arch/s390x/kernel/binfmt_elf32.c --- linux-2.4.3/arch/s390x/kernel/binfmt_elf32.c Fri Apr 20 12:06:43 2001 +++ linux-2.4.3.elf-fix2/arch/s390x/kernel/binfmt_elf32.c Sun Apr 22 17:00:28 2001 @@ -188,16 +188,29 @@ static unsigned long elf_map32 (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type) { + unsigned long start, data_len, mem_len, offset; unsigned long map_addr; if(!addr) addr = 0x4000; - down_write(current-mm-mmap_sem); - map_addr = do_mmap(filep, ELF_PAGESTART(addr), - eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr), prot, type, - eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr)); - up_write(current-mm-mmap_sem); + start = ELF_PAGESTART(addr); + data_len = ELF_PAGEALIGN(eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + mem_len = ELF_PAGEALIGN(eppnt-p_memsz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + offset = eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr); + + if (eppnt-p_filesz) { + down_write(current-mm-mmap_sem); + map_addr = do_mmap(filep, start, data_len, prot, type, offset); + do_mmap(NULL, map_addr + data_len, mem_len - data_len, prot, + MAP_FIXED | MAP_PRIVATE, 0); + up_write(current-mm-mmap_sem); + padzero(map_addr + eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + } else { + down_write(current-mm-mmap_sem); + map_addr = do_mmap(NULL, start, mem_len, prot, MAP_PRIVATE, 0); + up_write(current-mm-mmap_sem); + } return(map_addr); } diff -uNrX linux-exclude-files linux-2.4.3/arch/sparc64/kernel/binfmt_aout32.c linux-2.4.3.elf-fix2/arch/sparc64/kernel/binfmt_aout32.c --- linux-2.4.3/arch/sparc64/kernel/binfmt_aout32.c Fri Apr 20 12:06:44 2001 +++ linux-2.4.3.elf-fix2/arch/sparc64/kernel/binfmt_aout32.c Sun Apr 22 17:00:28 2001 @@ -49,7 +49,9 @@ end = PAGE_ALIGN(end); if (end = start) return; + down_write(current-mm-mmap_sem); do_brk(start, end - start); + up_write(current-mm-mmap_sem); } /* @@ -245,10 +247,17 @@ if (N_MAGIC(ex) == NMAGIC) { loff_t pos = fd_offset; /* Fuck me plenty... */ + down_write(current-mm-mmap_sem); error = do_brk(N_TXTADDR(ex), ex.a_text); + up_write(current-mm-mmap_sem); + bprm-file-f_op-read(bprm-file, (char *) N_TXTADDR(ex), ex.a_text, pos); + + down_write(current-mm-mmap_sem); error = do_brk(N_DATADDR(ex), ex.a_data); + up_write(current-mm-mmap_sem); + bprm-file-f_op-read(bprm-file, (char *) N_DATADDR(ex), ex.a_data, pos); goto beyond_if; @@ -256,8 +265,10 @@ if (N_MAGIC(ex) == OMAGIC) { loff_t pos = fd_offset; + down_write(current-mm-mmap_sem); do_brk(N_TXTADDR(ex) PAGE_MASK, ex.a_text+ex.a_data + PAGE_SIZE - 1); + up_write(current-mm-mmap_sem); bprm-file-f_op-read(bprm-file, (char *) N_TXTADDR(ex), ex.a_text+ex.a_data, pos); } else { @@ -271,7 +282,9 @@ if (!bprm-file-f_op-mmap) { loff_t pos = fd_offset; + down_write(current-mm-mmap_sem); do_brk(0, ex.a_text+ex.a_data); + up_write(current-mm-mmap_sem); bprm-file-f_op-read(bprm-file,(char *)N_TXTADDR(ex), ex.a_text+ex.a_data, pos); goto beyond_if; @@ -382,7 +395,9 @@ len = PAGE_ALIGN(ex.a_text + ex.a_data); bss = ex.a_text + ex.a_data + ex.a_bss; if (bss len) { + down_write(current-mm-mmap_sem); error = do_brk(start_addr + len, bss - len); + up_write(current-mm-mmap_sem); retval =
[PATCH] Add DHCP to 2.4.x ipconfig support
Here is a forward port of the 2.2.x improvements to ipconfig.c. Especially support for DHCP. Eric diff -uNr linux-2.4.3/Documentation/Configure.help linux-2.4.3.ipdhcp/Documentation/Configure.help --- linux-2.4.3/Documentation/Configure.help Fri Apr 20 12:06:37 2001 +++ linux-2.4.3.ipdhcp/Documentation/Configure.help Sun Apr 22 16:03:26 2001 @@ -3961,6 +3961,21 @@ want to use BOOTP, a BOOTP server must be operating on your network. Read Documentation/nfsroot.txt for details. +DHCP support +CONFIG_IP_PNP_DHCP + If you want your Linux box to mount its whole root filesystem (the + one containing the directory /) from some other computer over the + net via NFS and you want the IP address of your computer to be + discovered automatically at boot time using the DHCP protocol (a + special protocol designed for doing this job), say Y here. In case + the boot ROM of your network card was designed for booting Linux and + does DHCP itself, providing all necessary information on the kernel + command line, you can say N here. + + If unsure, say Y. Note that if you want to use DHCP, a DHCP server + must be operating on your network. Read Documentation/nfsroot.txt + for details. + RARP support CONFIG_IP_PNP_RARP If you want your Linux box to mount its whole root file system (the diff -uNr linux-2.4.3/include/net/ipconfig.h linux-2.4.3.ipdhcp/include/net/ipconfig.h --- linux-2.4.3/include/net/ipconfig.h Mon Jan 4 16:31:35 1999 +++ linux-2.4.3.ipdhcp/include/net/ipconfig.h Sun Apr 22 16:03:26 2001 @@ -6,16 +6,33 @@ * Automatic IP Layer Configuration */ -extern __u32 root_server_addr; -extern u8 root_server_path[]; -extern u32 ic_myaddr; -extern u32 ic_servaddr; -extern u32 ic_gateway; -extern u32 ic_netmask; -extern int ic_enable; -extern int ic_host_name_set; -extern int ic_set_manually; -extern int ic_proto_enabled; +/* The following are initdata: */ -#define IC_BOOTP 1 -#define IC_RARP 2 +extern int ic_enable; /* Enable or disable the whole shebang */ + +extern int ic_proto_enabled; /* Protocols enabled (see IC_xxx) */ +extern int ic_host_name_set; /* Host name set by ipconfig? */ +extern int ic_set_manually; /* IPconfig parameters set manually */ + +extern u32 ic_myaddr; /* My IP address */ +extern u32 ic_netmask; /* Netmask for local subnet */ +extern u32 ic_gateway; /* Gateway IP address */ + +extern u32 ic_servaddr; /* Boot server IP address */ + +extern u32 root_server_addr; /* Address of NFS server */ +extern u8 root_server_path[]; /* Path to mount as root */ + + + +/* The following are persistent (not initdata): */ + +extern int ic_proto_used; /* Protocol used, if any */ +extern u32 ic_nameserver; /* DNS server IP address */ +extern u8 ic_domain[]; /* DNS (not NIS) domain name */ + +/* bits in ic_proto_{enabled,used} */ +#define IC_PROTO 0xFF /* Protocols mask: */ +#define IC_BOOTP 0x01 /* BOOTP (or DHCP, see below) */ +#define IC_RARP 0x02 /* RARP */ +#define IC_USE_DHCP0x100 /* If on, use DHCP instead of BOOTP */ diff -uNr linux-2.4.3/net/ipv4/Config.in linux-2.4.3.ipdhcp/net/ipv4/Config.in --- linux-2.4.3/net/ipv4/Config.in Tue Nov 7 15:12:02 2000 +++ linux-2.4.3.ipdhcp/net/ipv4/Config.in Sun Apr 22 16:03:26 2001 @@ -20,6 +20,7 @@ fi bool ' IP: kernel level autoconfiguration' CONFIG_IP_PNP if [ $CONFIG_IP_PNP = y ]; then + bool 'IP: DHCP support' CONFIG_IP_PNP_DHCP bool 'IP: BOOTP support' CONFIG_IP_PNP_BOOTP bool 'IP: RARP support' CONFIG_IP_PNP_RARP # not yet ready.. diff -uNr linux-2.4.3/net/ipv4/ipconfig.c linux-2.4.3.ipdhcp/net/ipv4/ipconfig.c --- linux-2.4.3/net/ipv4/ipconfig.c Mon Mar 26 18:20:57 2001 +++ linux-2.4.3.ipdhcp/net/ipv4/ipconfig.c Sun Apr 22 16:55:36 2001 @@ -1,10 +1,10 @@ /* * $Id: ipconfig.c,v 1.35 2000/12/30 06:46:36 davem Exp $ * - * Automatic Configuration of IP -- use BOOTP or RARP or user-supplied - * information to configure own IP address and routes. + * Automatic Configuration of IP -- use DHCP, BOOTP, RARP, or + * user-supplied information to configure own IP address and routes. * - * Copyright (C) 1996--1998 Martin Mares [EMAIL PROTECTED] + * Copyright (C) 1996-1998 Martin Mares [EMAIL PROTECTED] * * Derived from network configuration code in fs/nfs/nfsroot.c, * originally Copyright (C) 1995, 1996 Gero Kuhlmann and me. @@ -16,6 +16,16 @@ * Fixed ip_auto_config_setup calling at startup in the new Linker Magic * initialization scheme. * - Arnaldo Carvalho de Melo [EMAIL PROTECTED], 08/11/1999 + * + * DHCP support added. To users this looks like a whole separate + * protocol, but we know it's just a bag on the side of BOOTP. + * -- Chip Salzenberg [EMAIL PROTECTED], May 2000 + * + * Ported DHCP support from 2.2.16 to 2.4.0-test4 + * -- Eric Biederman [EMAIL PROTECTED], 30 Aug 2000 + * + * Merged changes from 2.2.19 into 2.4.3 + * -- Eric Biederman [EMAIL PROTECTED], 22 April Aug 2001 */ #include linux/config.h @@ -36,6 +46,7 @@
Re: [PATCH] Longstanding elf fix (2.4.3 fix)
David S. Miller [EMAIL PROTECTED] writes: Eric W. Biederman writes: In building a patch for 2.4.3 I also discovered that we are not taking the mmap_sem around do_brk in the exec paths. Does that really matter? In the library loader I can certainly see it making a difference. Who else can get at the address space? We are a singly referenced address space at that point... perhaps ptrace? In practice I don't see it being a big deal. But reliable code is made by closing all of the little loop holes. It also improves consistency as all of the calls to do_mmap are already protected in the exec paths. And of course since much of the code in the kernel is built on the copy a good example neglecting the locking without a big comment, invites trouble elsewhere like in elf_load_library. Where we could have multiple threads running. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Longstanding elf fix (2.4.3 fix)
David S. Miller [EMAIL PROTECTED] writes: Eric W. Biederman writes: In building a patch for 2.4.3 I also discovered that we are not taking the mmap_sem around do_brk in the exec paths. Does that really matter? Who else can get at the address space? We are a singly referenced address space at that point... perhaps ptrace? Well looking a little more closely than I did last night it looks like access_process_vm (called from ptrace) can cause what amounts to a page fault at pretty arbitrary times. ptrace is protected by the big kernel lock, but exec isn't so that doesn't help. Hmm. ptrace does require that the process be stopped in all cases, before it does anything and that probably saves us. This is subtle enough I'd rather be locally correct, and not have to worry about someone enhancing ptrace... I'm actually a little curious what the big kernel lock in ptrace buys us. I suspect it could be a performance issue with user mode linux. Where you have multiple processes being ptraced at the same time. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Longstanding elf fix (2.4.3 fix)
Linus Torvalds [EMAIL PROTECTED] writes: On 23 Apr 2001, Eric W. Biederman wrote: ptrace is protected by the big kernel lock, but exec isn't so that doesn't help. Hmm. ptrace does require that the process be stopped in all cases Right. Ptrace definitely cannot access a process at arbitrary times. In fact, it is very serialized indeed, in that it can only access a process at signal points, ie effectively when it is returning to user space. With threads, of course, that doesn't help us. But with threads, the other threads could have caused the same page faults, so ptrace() isn't actually adding any new cases in that sense. I'd be a lot more worried about /proc accesses. access_process_vm is also called from /proc to get the environment and the command line. I don't know if it has other locks it might serialize on, probably not. With execve it's a very small window... execve() doesn't really need the mm semaphore, but on the other hand it would be cleaner to get it, and it won't really hurt (there can not be any real contention on it anyway - the only contention might come through /proc, and I haven't looked at what that might imply). load-library should definitely get it. I thought it did already, but.. Did you have a patch? Maybe I missed it. I'll include it again. I had it attached as a plain text attachment, I don't know if that is a problem or not. The case I spotted it we were getting the mm semaphore for do_mmap but not for do_brk. So we only get it 50% of the time... The other thing my patch does is update elf_map so we now handles elf files with multiple bss sections. Eric diff -uNrX linux-exclude-files linux-2.4.3/arch/mips/kernel/irixelf.c linux-2.4.3.elf-fix2/arch/mips/kernel/irixelf.c --- linux-2.4.3/arch/mips/kernel/irixelf.c Fri Apr 20 12:06:40 2001 +++ linux-2.4.3.elf-fix2/arch/mips/kernel/irixelf.c Sun Apr 22 17:00:28 2001 @@ -130,7 +130,9 @@ end = PAGE_ALIGN(end); if (end = start) return; + down_write(current-mm-mmap_sem); do_brk(start, end - start); + up_write(current-mm-mmap_sem); } @@ -379,7 +381,9 @@ /* Map the last of the bss segment */ if (last_bss len) { + down_write(current-mm-mmap_sem); do_brk(len, (last_bss - len)); + up_write(current-mm-mmap_sem); } kfree(elf_phdata); @@ -567,8 +571,10 @@ unsigned long v; struct prda *pp; + down_write(current-mm-mmap_sem); v = do_brk (PRDA_ADDRESS, PAGE_SIZE); - + up_write(current-mm-mmap_sem); + if (v 0) return; @@ -858,8 +864,11 @@ len = (elf_phdata-p_filesz + elf_phdata-p_vaddr+ 0xfff) 0xf000; bss = elf_phdata-p_memsz + elf_phdata-p_vaddr; - if (bss len) - do_brk(len, bss-len); + if (bss len) { + down_write(current-mm-mmap_sem); + do_brk(len, bss-len); + up_write(current-mm-mmap_sem); + } kfree(elf_phdata); return 0; } diff -uNrX linux-exclude-files linux-2.4.3/arch/s390x/kernel/binfmt_elf32.c linux-2.4.3.elf-fix2/arch/s390x/kernel/binfmt_elf32.c --- linux-2.4.3/arch/s390x/kernel/binfmt_elf32.cFri Apr 20 12:06:43 2001 +++ linux-2.4.3.elf-fix2/arch/s390x/kernel/binfmt_elf32.c Sun Apr 22 17:00:28 +2001 @@ -188,16 +188,29 @@ static unsigned long elf_map32 (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type) { + unsigned long start, data_len, mem_len, offset; unsigned long map_addr; if(!addr) addr = 0x4000; - down_write(current-mm-mmap_sem); - map_addr = do_mmap(filep, ELF_PAGESTART(addr), - eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr), prot, type, - eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr)); - up_write(current-mm-mmap_sem); + start = ELF_PAGESTART(addr); + data_len = ELF_PAGEALIGN(eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + mem_len = ELF_PAGEALIGN(eppnt-p_memsz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + offset = eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr); + + if (eppnt-p_filesz) { + down_write(current-mm-mmap_sem); + map_addr = do_mmap(filep, start, data_len, prot, type, offset); + do_mmap(NULL, map_addr + data_len, mem_len - data_len, prot, + MAP_FIXED | MAP_PRIVATE, 0); + up_write(current-mm-mmap_sem); + padzero(map_addr + eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr)); + } else { + down_write(current-mm-mmap_sem); + map_addr = do_mmap(NULL, start, mem_len, prot, MAP_PRIVATE, 0); + up_write(current-mm-mmap_sem); + } return(map_addr); } diff -uNrX linux-exclude-files linux-2.4.3/arch/sparc64
Re: [PATCH] Longstanding elf fix (2.4.3 fix)
Manfred Spraul [EMAIL PROTECTED] writes: Well looking a little more closely than I did last night it looks like access_process_vm (called from ptrace) can cause what amounts to a page fault at pretty arbitrary times. It's also used for several /proc/pid files. I remember that I got crashes with concurrent exec+cat /proc/pid/cmdline until down(mmap_sem) was added into setup_arg_pages(). O.k. Then the race I'm catching is real though because it is confined to bss sections, we are quite unlikely to trigger it. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH: trident , pci_enable_device moved
Jeff Garzik [EMAIL PROTECTED] writes: Andres Salomon wrote: This is what I was told (it was only needed for secondary video devices). From that, I would expect that all video devices would need it, just in case they happened to be the second card. Am I missing some subtlety in some of the video driers/chipsets that wouldn't allow them to be used as a second video device (therefore not requiring pci_enable_device)? They do need pci_enable_device, both primary and secondary displays. For the primary display its safe to call pci_enable_device. For secondary displays, you have to first disable I/O decoding for all VGA devices before you can enable a secondary display. You don't want more than one device decoding the legacy VGA region at any one time. Some cards have the capability to relocate the VGA region, which is nice. The bigger problem is initializing secondary displays; every video card has a proprietary video BIOS initialization sequence that is run by main BIOS on startup. You can either duplicate this sequence with C code, which is sometimes difficult due to lack of docs or variety of boards, or you can execute the video BIOS with an x86 emulator. Note: With linuxBIOS (and some other embedded linux setups) even a primary display doesn't get initialized until you start linux so if you can properly initialize your display please do it. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Alpha compile problem solved by Andrea (pte_alloc)
Do you know if anyone has fixed the lazy vmalloc code? I know of as of early 2.4 it was broken on alpha. At the time I noticed it I didn't have time to persue it, but before I forget to even put in a bug report I thought I'd ask if you know anything about it? Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Alpha compile problem solved by Andrea (pte_alloc)
Andrea Arcangeli [EMAIL PROTECTED] writes: On Sun, Apr 29, 2001 at 05:27:10PM -0600, Eric W. Biederman wrote: Do you know if anyone has fixed the lazy vmalloc code? I know of as of early 2.4 it was broken on alpha. At the time I noticed it I didn't have time to persue it, but before I forget to even put in a bug report I thought I'd ask if you know anything about it? On alpha it's racy if you set CONFIG_ALPHA_LARGE_VMALLOC y (so don't do that as you don't need it). As long as you use only 1 entry of the pgd for the whole vmalloc space (CONFIG_ALPHA_LARGE_VMALLOC n) alpha is safe. Hmm. I was having problems reproducible with CONFIG_ALPHA_LARGE_VMALLOC n. Enabling the large vmalloc was my work around, because the large vmalloc whet back to the prelazy allocation code. I was getting repeatable problems inside of an mtd driver. The problem I had was entries failed to propagate across different tasks. I think it was something like the first pgd was lazily allocated and not propagated. I don't have a SRM on my 264 alpha so alpha (for reference on which code paths were followed. OTOH x86 is racy and there's no workaround available at the moment. GH Well racy is easier to work with than just plain non-functional. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ServerWorks LE and MTRR
Steffen Persvold [EMAIL PROTECTED] writes: [EMAIL PROTECTED] wrote: On Sun, 29 Apr 2001, Steffen Persvold wrote: I've learned it the hard way, I have two types : Compaq DL360 (rev 5) and a Tyan S2510 (rev 6). On the compaq machine I constantly get data corruption on the last double word (4 bytes) in a 64 byte PCI burst when I use write combining on the CPU. On the Tyan however the transfer is always ok. Are you sure that is not due to board design differences? No I can't be 100% certain that the layout of the board isn't the reason since I haven't asked ServerWorks about this and it doesn't say anything in their docs (yes my company has the NDA, so I shouldn't get to much in detail here), but if this was the case it would be totally wrong to disable write combining on any LE chipset. The test case that I have been using to trigger this is sort of special because we are using SCI shared memory adapters to write (with PIO) into remote nodes memory, and the bandwidth tends to get quite high (approx 170 MByte/sec on LE with write combining). I've been able to run this case on 5 different motherboards using the LE and HE-SL ServerWorks chipsets, but only two of them are LE (the DL360 and the S2510). Everything works fine with write-combining on every motherboard except the DL360 (which has rev 5). One basic test case that I haven't tried, could be to enable write-combining on your PCI graphics adapter memory and see if the X display gets screwed up. I will try to get some information from ServerWorks about this problem, but I'm not sure if ServerWorks would be happy if I told you the answer (because of the NDA). I'd like to put my small plug in that this make me a little nervous. It could also be a problem with the firmware (aka BIOS) missetting something up. Working with linuxBIOS I have seen burst-writes (enabled with write-combining or write-back) cause data corruption when non-burst-writes to memory don't cause problems, when the memory controller is setup wrong. (This is was with intel 440GX 440BX chipsets). Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How can do to disable the L1 cache in linux ?
Alex Huang [EMAIL PROTECTED] writes: Dear All, How can do to disable the L1 cache in linux ? Are there some commands or directives to disable it ?? Play with the MTRR's and disable caching on memory. Stupid but it should get what you want. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: serial console problems with 2.4.4
Fabrice Gautier [EMAIL PROTECTED] writes: On Wed, 02 May 2001 11:54:11 +0200 Reto Baettig [EMAIL PROTECTED] wrote: Hi I just installed 2.4.4 on our alpha SMP boxes (ES40) and now I have problems with the serial console: I get same kind of problem when upgrading from 2.4.2 to 2.4.3 and using busybox as init/getty The problem was a bug in busybox. The console initialisation code was not correct. sulogin does not accept input from the serial line mingetty does not accept input from the serial line agetty works fine So this this probably a sulogin/mingetty problem. They should set the CREAD flag in your tty c_cflag. the patch for busybox repalced the line tty.c_cflag |= HUPCL|CLOCAL by tty.c_cflag |= CREAD|HUPCL|CLOCAL Hope this help. This part is correct. However the kernel sets CREAD by default. sysvinit (and possibly other inits) clears CREAD. I wish I knew where the breakage actually occured. And then sulogin/mingetty need to reenable it. It's not too big of a deal except the serial code doesn't accept SAK's when CREAD is clear. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: serial console problems with 2.4.4
Fabrice Gautier [EMAIL PROTECTED] writes: On 02 May 2001 10:37:21 -0600 [EMAIL PROTECTED] (Eric W. Biederman) wrote: Fabrice Gautier [EMAIL PROTECTED] writes: So this this probably a sulogin/mingetty problem. They should set the CREAD flag in your tty c_cflag. the patch for busybox repalced the line tty.c_cflag |= HUPCL|CLOCAL by tty.c_cflag |= CREAD|HUPCL|CLOCAL Hope this help. This part is correct. However the kernel sets CREAD by default. Are your sure? Wasn't this the behaviour for 2.4.2 but changed in 2.4.3 init=/bin/bash works fine over a serial console in 2.4.4. So I am certain. I get the impression that something in 2.4.3 fixed CREAD handling, and we started noticing the buggy user space. sysvinit (and possibly other inits) clears CREAD. In my case I was using busybox as init. So there is no sysinit or any other init called before this line. The busy box init is also clearing CREAD (as of 0.51 anyway). I wish I knew where the breakage actually occured. Just look at this diff on serial.c between 2.4.2 and 2.4.3: If it was a real diff between 2.4.2 and 2.4.3 I would agree, however it looks like your attempt to fix 2.4.3. Eric --- serial.c Sat Apr 21 17:22:53 2001 +++ ../../../linux-2.4.2/drivers/char/serial.cSat Feb 17 01:02:36 2001 @@ -1764,8 +1765,8 @@ /* * !!! ignore all characters if CREAD is not set */ -// if ((cflag CREAD) == 0) -// info-ignore_status_mask |= UART_LSR_DR; + if ((cflag CREAD) == 0) + info-ignore_status_mask |= UART_LSR_DR; save_flags(flags); cli(); if (uart_config[info-state-type].flags UART_STARTECH) { serial_outp(info, UART_LCR, 0xBF); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible PCI subsystem bug in 2.4
Alan Cox [EMAIL PROTECTED] writes: I suspect it would be safe to round up to the next megabyte, possibly up to 64MB or so. But much more would make me nervous. Any suggestions? I'd go for 1MByte simply because I've not seen an EBDA/NVRAM area that large stuck at the top of RAM. 1Mb would fix the Dell. (It was only when I saw your email it suddenely clicked and I grabbed the bootup log) There are a couple of options here. 1) read the MTRRs unless the BIOS is braindead it will set up that area as write-back. At any rate we shouldn't ever try to allocate a pci region that is write-back cached. 2) read the memory locations from the northbridge. It's not possible on every chipset (lack of documentation) but with the linuxBIOS project we code for a couple of them, and we are working on more all of the time. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: smp_send_stop() and disable_local_APIC()
Matt D. Robinson [EMAIL PROTECTED] writes: It looks like around 2.3.30 or so, someone added the call disable_local_APIC() to smp_send_stop(). I'm not sure what the intention was, but I'm getting some strange behavior as a result based on some code I'm writing. Basically, I'm doing the following ... panic() { /* do whatever you want, notifier list, etc. */ smp_send_stop(); write_system_memory(); /* then do whatever */ } write_system_memory() does a write of all system memory pages to some block device. It uses kiobufs as the way to get the pages to disk, doing brw_kiovec() on those pages (using either the IDE or SCSI driver to write the data). IDE being less likely to hang than SCSI as it tends to use legacy isa interrupt lines. The wierd behavior I see is that sometimes, smp_send_stop() being called causes the system to hang up (not every time). Doing event driver i/o after disabling the interrupt controller hmm, I wonder why... If we don't call smp_send_stop() on those systems, everything works fine. This looks to be directly caused by the disabling of the APIC, which we may need to dump pages to local disk. This only applies to some people's systems -- not everyone displays the same behavior. I'm sure it's good to disable the APIC, but there's no clean way to wait on disabling the APIC until after I'm done writing pages out. My questions are: 1) Why was disable_local_APIC() added to stop_this_cpu() and smp_send_stop()? Completeness? 2) Is there a better way around this to disable all the other CPUs without disabling the APIC? I don't know what a good way is, since there is a kernel panic it should only be something truly fatal. Given that reusing anything that hasn't been designed to run in that situation is playing with fire. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible PCI subsystem bug in 2.4
Alan Cox [EMAIL PROTECTED] writes: There are a couple of options here. 1) read the MTRRs unless the BIOS is braindead it will set up that area as write-back. At any rate we shouldn't ever try to allocate a pci region that is write-back cached. 'unless the BIOS is braindead'. Right. We only got into this problem because the BIOS _was_ braindead. Well I did provide a suggestion so you don't have to second guess... Usually it's actually easier to read the memory size from the northbridge than to parse the E820 map. However since it is different kinds of braindamage to mess up the MTRRs, and the E820 memory map, it is worth a shot. Personally I think MTRRs are much easier to get right, because you don't need to take into account what the BIOS is going to do just where your ram is. As for braindead BIOS's in general any comments on totally nuking them? Seriously. With the general attitude of distrusting BIOS's I have been amazed at the number of things linux expects the BIOS to get right. In practice windows seem to trust the BIOS much less than linux does. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible PCI subsystem bug in 2.4
Alan Cox [EMAIL PROTECTED] writes: Seriously. With the general attitude of distrusting BIOS's I have been amazed at the number of things linux expects the BIOS to get right. In practice windows seem to trust the BIOS much less than linux does. It becomes more and more obvious over time exactly why. One problem however is that windows gets away with this because many vendors ship random extra gunge for their box with the system. We dont yet have that power Right. So we always need to keep heuristics in our toolbox to fallback on, so we can run on boards with incomplete information. However there is a lot of things we can do that we aren't currently doing. The example that sticks out in my head is we rely on the MP table to tell us if the local apic is in pic_mode or in virtual wire mode. When all we really have to do is ask it. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: smp_send_stop() and disable_local_APIC()
Matt D. Robinson [EMAIL PROTECTED] writes: It's an SMP (and only when your system crashes on a CPU other than 0) problem. I did some more checking of this to verify the specifics of the behavior. Thanks for the sarcasm, though. :) O.k. That makes perfect sense then. See below. All I wanted was clarification as to why it was added in the first place, and whether there was a better way around the scenario. I think Ingo added the code, but I never heard back from him. Thanks for the response. Welcome. Linux attempts to properly shutdown the apics when we are shutting down, and part of that is returning the apics to the mode they were before we got control. To do that you need to disable every cpu but the bootstrap processor, and return the bootstrap processor to either virtual wire mode or pic_mode. So of course it will be the only cpu getting interrupts because we are in legacy mode. I would say it probably makes sense to add an additional call. smp_send_panic_stop that does exactly what you need instead of what is needed on the normal shutdown path. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
Jeffrey W. Baker [EMAIL PROTECTED] writes: On Tue, 5 Jun 2001, Derek Glidden wrote: After reading the messages to this list for the last couple of weeks and playing around on my machine, I'm convinced that the VM system in 2.4 is still severely broken. This isn't trying to test extreme low-memory pressure, just how the system handles recovering from going somewhat into swap, which is a real day-to-day problem for me, because I often run a couple of apps that most of the time live in RAM, but during heavy computation runs, can go a couple hundred megs into swap for a few minutes at a time. Whenever that happens, my machine always starts acting up afterwards, so I started investigating and found some really strange stuff going on. I reboot each of my machines every week, to take them offline for intrusion detection. I use 2.4 because I need advanced features of iptables that ipchains lacks. Because the 2.4 VM is so broken, and because my machines are frequently deeply swapped, they can sometimes take over 30 minutes to shutdown. They hang of course when the shutdown rc script turns off the swap. The first few times this happened I assumed they were dead. Interesting. Is it constant disk I/O? Or constant CPU utilization. In any case you should be able to comment that line out of your shutdown rc script and be in perfectly good shape. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
Andrew Morton [EMAIL PROTECTED] writes: Jeffrey W. Baker wrote: Because the 2.4 VM is so broken, and because my machines are frequently deeply swapped, The swapoff algorithms in 2.2 and 2.4 are basically identical. The problem *appears* worse in 2.4 because it uses lots more swap. And 2.4 does delayed swap deallocation. We don't appear to optimize the case where a page is only used by the swap cache. That should be able to save some cpu overhead if nothing else. And I do know that in the early 2.2 timeframe, swapoff was used to generate an artifically high VM load, for testing the VM. It looks like that testing procedure has been abandoned :) they can sometimes take over 30 minutes to shutdown. Yes. The sys_swapoff() system call can take many minutes of CPU time. It basically does: for (each page in swap device) { for (each process) { for (each page used by this process) stuff It's interesting that you've found a case where this actually has an operational impact. Agreed. Haven't looked at it closely, but I think the algorithm could become something like: for (each process) { for (each page in this process) { if (page is on target swap device) get_it_off() } } for (each page in swap device) { if (it is busy) complain() } You would need to handle the shared memory case as well. But otherwise this looks sound. I would suggest going through page-address_space-i_mmap_shared to find all of the potential mappings but the swapper address space is used by all processes that have pages in swap. That's 10^4 to 10^6 times faster. It looks like it could be. The bottleneck should be diskio, if it is not we have a noticeable inefficient algorithm. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
Derek Glidden [EMAIL PROTECTED] writes: John Alvord wrote: On Wed, 06 Jun 2001 11:31:28 -0400, Derek Glidden [EMAIL PROTECTED] wrote: I'm beginning to be amazed at the Linux VM hackers' attitudes regarding this problem. I expect this sort of behaviour from academics - ignoring real actual problems being reported by real actual people really and actually experiencing and reporting them because technically or theoretically they shouldn't be an issue or because the literature [documentation] says otherwise - but not from this group. There have been multiple comments that a fix for the problem is forthcoming. Is there some reason you have to keep talking about it? Because there have been many more comments that The rule for 2.4 is 'swap == 2*RAM' and that's the way it is and disk space is cheap - just add more than there have been this is going to be fixed which is extremely discouraging and doesn't instill me with all sorts of confidence that this problem is being taken seriously. The hard rule will always be that to cover all pathological cases swap must be greater than RAM. Because in the worse case all RAM will be in thes swap cache. That this is more than just the worse case in 2.4 is problematic. I.e. In the worst case: Virtual Memory = RAM + (swap - RAM). You can't improve the worst case. We can improve the worst case that many people are facing. Or are you saying that if someone is unhappy with a particular situation, they should just keep their mouth shut and accept it? It's worth complaining about. It is also worth digging into and find out what the real problem is. I have a hunch that this hole conversation on swap sizes being irritating is hiding the real problem. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Requirement: swap = RAM x 2.5 ??
Jeff Garzik [EMAIL PROTECTED] writes: I'm sorry but this is a regression, plain and simple. Previous versons of Linux have worked great on diskless workstations with NO swap. Swap is extra space to be used if we have it and nothing else. Given the slow speed of disks to use them efficiently when you are using swap some additional rules apply. In the worse case when swapping is being used you get: Virtual Memory = RAM + (swap - RAM). That cannot be improved. You can increase your likely hood that that case won't come up, but that is a different matter entirely. I suspect in practice that we are suffering more from lazy reclamation of swap pages than from a more aggressive swap cache. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
Derek Glidden [EMAIL PROTECTED] writes: The problem I reported is not that 2.4 uses huge amounts of swap but that trying to recover that swap off of disk under 2.4 can leave the machine in an entirely unresponsive state, while 2.2 handles identical situations gracefully. The interesting thing from other reports is that it appears to be kswapd using up CPU resources. Not the swapout code at all. So it appears to be a fundamental VM issue. And calling swapoff is just a good way to trigger it. If you could confirm this by calling swapoff sometime other than at reboot time. That might help. Say by running top on the console. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
Mike Galbraith [EMAIL PROTECTED] writes: On 6 Jun 2001, Eric W. Biederman wrote: Derek Glidden [EMAIL PROTECTED] writes: The problem I reported is not that 2.4 uses huge amounts of swap but that trying to recover that swap off of disk under 2.4 can leave the machine in an entirely unresponsive state, while 2.2 handles identical situations gracefully. The interesting thing from other reports is that it appears to be kswapd using up CPU resources. Not the swapout code at all. So it appears to be a fundamental VM issue. And calling swapoff is just a good way to trigger it. If you could confirm this by calling swapoff sometime other than at reboot time. That might help. Say by running top on the console. The thing goes comatose here too. SCHED_RR vmstat doesn't run, console switch is nogo... After running his memory hog, swapoff took 18 seconds. I hacked a bleeder valve for dead swap pages, and it dropped to 4 seconds.. still utterly comatose for those 4 seconds though. At the top of the while(1) loop in try_to_unuse what happens if you put in. if (need_resched) schedule(); It should be outside all of the locks. It might just be a matter of everything serializing on the SMP locks, and the kernel refusing to preempt itself. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
LA Walsh [EMAIL PROTECTED] writes: Eric W. Biederman wrote: The hard rule will always be that to cover all pathological cases swap must be greater than RAM. Because in the worse case all RAM will be in thes swap cache. That this is more than just the worse case in 2.4 is problematic. I.e. In the worst case: Virtual Memory = RAM + (swap - RAM). Hmmmso my 512M laptop only really has 256M? Um...I regularlly run more than 256M of programs. I don't want it to swap -- its a special, weird condition if I do start swapping. I don't want to waste 1G of HD (5%) for something I never want to use. IRIX runs just fine with swapRAM. In Irix, your Virtual Memory = RAM + swap. Seems like the Linux kernel requires more swap than other old OS's (SunOS3 (virtual mem = min(mem,swap)). I *thought* I remember that restriction being lifted in SunOS4 when they upgraded the VM. Even though I worked there for 6 years, that was 6 years ago... There are cetain scenario's where you can't avoid virtual mem = min(RAM,swap). Which is what I was trying to say, (bad formula). What happens is that pages get referenced evenly enough and quickly enough that you simply cannot reuse the on disk pages. Basically in the worst case all of RAM is pretty much in flight doing I/O. This is true of all paging systems. However just because in the worst case virtual mem = min(RAM,swap), is no reason other cases should use that much swap. If you are doing a lot of swapping it is more efficient to plan on mem = min(RAM,swap) as well, because frequently you can save on I/O operations by simply reusing the existing swap page. You can't improve the worst case. We can improve the worst case that many people are facing. --- Other OS's don't have this pathological 'worst case' scenario. Even my Windows [vm]box seems to operate fine with swapMEM. On IRIX, virtual space closely approximates physical + disk memory. It's a theoretical worst case and they all have it. In practice it is very hard to find a work load where practically every page in the system is close to the I/O point howerver. Except for removing pages that aren't used paging with swap RAM is not useful. Simply removing pages that aren't in active use but might possibly be used someday is a common case, so it is worth supporting. It's worth complaining about. It is also worth digging into and find out what the real problem is. I have a hunch that this hole conversation on swap sizes being irritating is hiding the real problem. --- Okay, admission of ignorance. When we speak of swap space, is this term inclusive of both demand paging space and swap-out-entire-programs space or one or another? Linux has no method to swap out an entire program so when I speak of swapping I'm actually thinking paging. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
[EMAIL PROTECTED] (Linus Torvalds) writes: Somebody interested in trying the above add? And looking for other more obvious bandaid fixes. It won't fix swapoff per se, but it might make it bearable and bring it to the 2.2.x levels. At little bit. The one really bad behavior of not letting any other processes run seems to be fixed with an explicit: if (need_resched) { schedule(); } What I can't figure out is why this is necessary. Because we should be sleeping in alloc_pages if nowhere else. I suppose if the bulk of our effort really is freeing dead swap cache pages we can spin without sleeping, and never let another process run because we are busily recycling dead swap cache pages. Does this sound right? If this is going on I think we need to look at our delayed deallocation policy a little more carefully. I suspect we should have code in kswapd actively removing these dead swap cache pages. After we get the latency improvements in exit these pages do absolutely nothing for us except clog up the whole system, and generally give the 2.4 VM a bad name. Anyone care to check my analysis? Is anybody interested in making swapoff() better? Please speak up.. Interested. But finding the time... Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/