Re: Subtle MM bug

2001-01-10 Thread Eric W. Biederman

Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Wed, Jan 10, 2001 at 11:46:03AM +, David Woodhouse wrote:
  So the VM code spends a fair amount of time scanning lists of pages which 
  it really can't do anything about?
 
 Yes.
 
  Would it be possible to put such pages on different list, so that the VM
 
 Currently to unmap the other pages we have to waste time on those unfreeable
 pages as well.
 
 Once I or other developer finishes with the reverse lookup from page to
 pte-chain (an implementation from DaveM just exists) we'll be able to put them
 in a separate lru, but it's certainly not a 2.4.1-pre2 thing.

Why do we even want to do reverse page tables?
It seems everyone is assuming this is a good thing and except for being
a touch more flexible I don't see what this buys us (besides more locked memory).

My impression with the MM stuff is that everyone except linux is
trying hard to clone BSD instead of thinking through the issues
ourselves.

And because of the extra overhead this doesn't look to be a win on a
heavily loaded box with no swap.  And probably only glibc mmaped.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-12 Thread Eric W. Biederman

Ralf Baechle [EMAIL PROTECTED] writes:

 On Thu, Jan 11, 2001 at 12:56:57AM +0100, David Weinehall wrote:
 
   The MMU on these systems is a CAM, and the mmu table is thus backwards to
   convention. (It also means you can notionally map two physical addresses to
   one virtual but thats undefined in the implementation ;))
  
  Are there any other (not yet supported) platforms with similar (or other
  unrelated, but hard to support because of the current architecture of
  the kernel) problems?
  
  (No, I have no secret trumps up my sleeve, I'm just curious.)
 
 Having a reverse mappings is the least sucky way to handle virtual aliases
 of certain types of MIPS caches.

Hmm.  I would think that increasing the logical page size in the kernel would
be the trivial way to handle virtual aliases.  (i.e.) with a large enough page
size you can't actually have a virtual alias.

You could also play some games with simply allocating pages only with the proper 
proper high bits.   These games might also be useful on architectures for L2 caches
who have significant physical bits than PAGE_SHIFT bits.

But how does a reverse mapping help to handle virtual aliases?  What are those
caches doing?  The only model in my head is having a virtually indexed cache
where you have more index bits than PAGE_SHIFT bits.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-14 Thread Eric W. Biederman

Ralf Baechle [EMAIL PROTECTED] writes:

 On Fri, Jan 12, 2001 at 09:11:43PM +, Russell King wrote:
 
  Eric W. Biederman writes:
   Hmm.  I would think that increasing the logical page size in the kernel
   would be the trivial way to handle virtual aliases.  (i.e.) with a large
   enough page size you can't actually have a virtual alias.
  
  There are types of caches out there that no matter how large the page size,
  you will always have alias issues.  These are ones where the cache lines
  are indexed independent of virtual address (and therefore can have funny
  cache line replacement algorithms).
  
  And yes, you guessed which processor has it. ;)

Odd.  Does this affect correctness?

 I recently spoke with some CPU architecture researcher at some university
 about cache architectures; I suspect in the near future we'll see more
 funny cache indexing and replacment algorithems ...

But I doubt many of those will run incorrectly if just less efficiently if
the OS doesn't help you avoid aliases.  


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Caches, page coloring, virtual indexed caches, and more

2001-01-15 Thread Eric W. Biederman

Ralf Baechle [EMAIL PROTECTED] writes:

 On Fri, Jan 12, 2001 at 09:10:54AM -0700, Eric W. Biederman wrote:
 
   Having a reverse mappings is the least sucky way to handle virtual aliases
   of certain types of MIPS caches.
  
  Hmm.  I would think that increasing the logical page size in the kernel would
  be the trivial way to handle virtual aliases.  (i.e.) with a large enough page

O.k. I stepped back and took a little refresher to make certain I know what
is going on.  The only problem besides context switches with a virtually mapped
cache is that without some care you can have multiple cache blocks for
the same data. This is what we must avoid to be correct.

I admit that using a reverse mapping is one way we could prevent these
duplicate blocks. 

#define VIRT_INDEX_BITS 18 /* number of bits in the L1 virtually indexed cache */

These are the places I know of in the kernel that create page mappings.
fork, anonymous pages, mmap, sysv shared memory, mremap, kmap

fork just duplicates something that is already there but in a
different mm, so no bad virtual aliases are created.

anonymous pages only belong to one process, and have effectively only
one mapping so again not a problem.  Unless you need kmap.  To make
that work well we'd have to make the restriction that the swap cache
index and the virtual address are identical in their VIRT_INDEX_BITS.
That's better than doing it in alloc_pages especially as you never
alloc high order swap pages but it worries me a little.   This is
fairly close to what we do with swap clustering but it's still
a pain.

shared mmap.  This is the important one.  Since we have a logical
backing store this is easy to handle.  We just enforce that the
virtual address in a process that we mmap something to must match the
logical address to VIRT_INDEX_BITS.  The effect is the same as a
larger page size with virtually no overhead.

sysv shared memory is exactly the same as shared mmap.  Except instead
of a file offset you have an offset into the sysv segment.

mremap.  Linux specific but pretty much the same as mmap, but easier.
We just enforce that the virtual address of the source of mremap,
and the destination of mremap match on VIRT_INDEX_BITS.

kmap is a little different.  using VIRT_INDEX_BITS is a little
subtle but should work.  Currently kmap is used only with the page
cache so we can take advantage of the page-index field.  From page-index 
we can compute the logical offset of the page and make certain the
page mapped with all VIRT_INDEX_BITS the same as a mmap alias.

kmap and the swap cache are a little different.  Since index holds
the location of a page on the swap file we'd have to make that index
be the same for VIRT_INDEX_BITS as well.


 
  size you can't actually have a virtual alias.
 
 That's a possible solution; I'm not clear how bad the overhead would be.
 Right now a virtual alias is a relativly rare event and we don't want the
 common case of no virtual alias to make pay a high price.  Or?

I guess the question is how big would these logical pages need to be?
Answer big enough to turn your virtually indexed cache into a
physically indexed cache.  Which means they would have to be cache
size.  

Increasing PAGE_SIZE a few bits shouldn't be bad but going up two
orders of magnitude would likely skewer your swapping, and memory
management performance.  You'd just have way to few pages.

But I have a better suggestion so see above.

  You could also play some games with simply allocating pages only with the
  proper proper high bits.   These games might also be useful on architectures
  for L2 caches who have significant physical bits than PAGE_SHIFT bits.
 
 An alternative but less efficient solution.  I tried to implement it; I ran
 into problems with running out of larger pages soon as I had to split order 2
 pages into 4 order 0 pages to implement this; the fragmentation was _really_
 bad.

O.k. this is scratched off my list of possible good ideas.  Duh.  This
fails for exactly the same reason as increasing as increasing page
size.  at 256K cache and 4K PAGE_SIZE you'd need 256/4 = 64 different
types of pages, fairly nasty.
 
  But how does a reverse mapping help to handle virtual aliases?  What are those
 
  caches doing?
 
 You leave only mappings of one color accessible.  All other mappings are made
 unaccessible in the page table, so accessing will result in a TLB fault.
 The TLB fault handler then flushes the active mappings, makes them
 unaccessible by clearing the MIPS hw dirty / accessible bits, then makes the
 mapping of the new color accessible in the page table.  This is already
 possible right now but doing the necessary reverse mappings can be rather
 inefficient as is.

Hmm.  This doesn't sound right.  And this sounds like a silly way to
use reverse mappings anyway, since you can do it up front in mmap and
their kin.  Which means you don't have to slow any of the page fault
logic up.

 
  The only model in my head is having a virtually

Re: Caches, page coloring, virtual indexed caches, and more

2001-01-15 Thread Eric W. Biederman

Ralf Baechle [EMAIL PROTECTED] writes:

 On Mon, Jan 15, 2001 at 01:41:06AM -0700, Eric W. Biederman wrote:
 
 (Cc list truncated since probably not so many people do care ...)
 
  shared mmap.  This is the important one.  Since we have a logical
  backing store this is easy to handle.  We just enforce that the
  virtual address in a process that we mmap something to must match the
  logical address to VIRT_INDEX_BITS.  The effect is the same as a
  larger page size with virtually no overhead.
 
 I'm told this is going to break software.  Bad since it's otherwise it'd
 be such a nice silver bullet solution.

Heck if we wanted to we could even lie about PAGE_SIZE, and say it was huge.
I'd have to have a clear example before I give it up that easily.
mmap has never allowed totally arbitrary offsets, and mmap(MAP_FIXED)
is highly discouraged so I'd like to see it.

And on architectures that don't need this it should compile out with
no overhead.

 
  sysv shared memory is exactly the same as shared mmap.  Except instead
  of a file offset you have an offset into the sysv segment.
 
 No, it's simpler in the MIPS case.  The ABI guys were nice and did define
 that the virtual addresses have to be multiple of 256kbyte which is
 more than sufficient to kill the problem.

If VIRT_INDEX_BITS == 18 and because you can only map starting at
the beginning of a sysv shared memory segment this is exactly what
my code boils down to.

 
  mremap.  Linux specific but pretty much the same as mmap, but easier.
  We just enforce that the virtual address of the source of mremap,
  and the destination of mremap match on VIRT_INDEX_BITS.
 
 Correct and as mremap doesn't take any address argument we won't break
 any expecations on the properties of the returned address in mmap.
 
  kmap is a little different.  using VIRT_INDEX_BITS is a little
  subtle but should work.  Currently kmap is used only with the page
  cache so we can take advantage of the page-index field.  From page-index 
  we can compute the logical offset of the page and make certain the
  page mapped with all VIRT_INDEX_BITS the same as a mmap alias.
 
 Yup.  It gets somewhat tricker due to the page cache being in in KSEG0,
 an memory area which is essentially like a 512mb page that is hardwired
 in the CPU.  It's preferable to stick with since it means we never take
 any TLB faults for pages in the page cache on MIPS.

Good.  Then we don't need (at least for mips) to worry about this case.
I was just thinking through the general case.  

  kmap and the swap cache are a little different.  Since index holds
  the location of a page on the swap file we'd have to make that index
  be the same for VIRT_INDEX_BITS as well.
 
   That's a possible solution; I'm not clear how bad the overhead would be.
   Right now a virtual alias is a relativly rare event and we don't want the
   common case of no virtual alias to make pay a high price.  Or?
  
  I guess the question is how big would these logical pages need to be?
 
 Depending of the CPU 8kb to 32kb; the hardware supports page sizes 4kb, 16kb,
 64kb ... 16mb.

If all you need is 32kb that is better than the 256K number I had in my head.
Still as far as an application is concerned the results are the same as
my silver bullet above.

  Answer big enough to turn your virtually indexed cache into a
  physically indexed cache.  Which means they would have to be cache
  size.  
 
 For above mentioned CPU versions which have 8kb rsp. 16kb per primary cache
 we want 32kb as mentioned.
 
  Increasing PAGE_SIZE a few bits shouldn't be bad but going up two
  orders of magnitude would likely skewer your swapping, and memory
  management performance.  You'd just have way to few pages.
  
  But I have a better suggestion so see above.
 
  O.k. this is scratched off my list of possible good ideas.  Duh.  This
  fails for exactly the same reason as increasing as increasing page
  size.  at 256K cache and 4K PAGE_SIZE you'd need 256/4 = 64 different
  types of pages, fairly nasty.
 
 You say it; yet it seems like it could be part of a good solution.  Just
 forcefully allocating a single page by splitting a large page and before
 that even swapping until we can actually allocate a higher order page is
 bad.

I totally agree.  Larger pages don't suck but are unnecessary.  At least
I haven't been convinced otherwise yet.


  Hmm.  This doesn't sound right.  And this sounds like a silly way to
  use reverse mappings anyway, since you can do it up front in mmap and
  their kin.  Which means you don't have to slow any of the page fault
  logic up.
 
 Then how do you handle something like:
 
   fd = open(TESTFILE, O_RDWR | O_CREAT, 664);
   res = write(fd, one, 4096);
   mmap(addr, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
   mmap(addr + PAGE_SIZE, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
 
 If both mappings are immediately created accessible you'll directly endup
 with aliases.  There is no choice

Re: Caches, page coloring, virtual indexed caches, and more

2001-01-16 Thread Eric W. Biederman

Anton Blanchard [EMAIL PROTECTED] writes:

  
  
  At least for sparc it's already supported.  Right now I don't feel like
  looking into the 2.4 solution but checkout srmmu_vac_update_mmu_cache in
  the 2.2 kernel.
 
 I killed that hack now that we align all shared mmaps to the same virtual
 colour :)

Nice.

Where do you do this?  And how do you handle the case of aliases with kseg,
the giant kernel mapping.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Caches, page coloring, virtual indexed caches, and more

2001-01-17 Thread Eric W. Biederman

Anton Blanchard [EMAIL PROTECTED] writes:

 Hi,
  
  Where do you do this?  And how do you handle the case of aliases with kseg,
  the giant kernel mapping.
 
 Aliases between user and kernel mappings of a page are handled by
 flush_page_to_ram the old interface) or {copy,clear}_user_page,
 flush_dcache_page and update_mmu_cache (new interface). Sparc64 already
 uses the new interface and there are patches for ppc and ia64 to use it.
 
 The new interface allows flushes to be avoided, leading to rather nice
 performance increases.
 
 See Documentation/cachetlb.txt for more info.

Thanks,

Well they are a step in the right direction
But they are still racy, especially on SMP.

The bad case is:
Process A in kernel space calls flush_dcache_page.
Then process B in a separate thread writes to the first word in a
cache line. The Process A writes to the last word in the cache line. 

Assuming the virtual addresses from Process A and Process B are of a
different color this gives two non overlapping writes with a well
defined meaning, which the kernel gets wrong.  In particular the ram
will only see one write or the other not both.

What it looks like to me is that SHMLBA needs to be extended to normal
mmapings, making all pages in user space
(page-index  PAGE_SHIFT) % SHMLBA 
virtually aligned.

And whenever we access a page in the page cache that is not
appropriately virtually aligned in the fixed kernel mapping, 
we can use the kmap infrastructure to map it to a better kernel
location.  If we reuse the same optimizations from flush_dcache_page
it shouldn't be any worse, and in the pathological cases it will be
faster.   While removing the races seen above.

Any thoughts?

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2001-01-18 Thread Eric W. Biederman

Werner Almesberger [EMAIL PROTECTED] writes:


  I agree writing the code to understand the table may be a significant
  issue.  On the other hand I still think it is worth a look, being
  able to unify option parsing for multiple platforms is not a small
  gain, nor is getting out from short sighted vendor half standards.

 
 Well, you certainly have a point where stupid vendors and BIOS nonsense
 are concerned. However, if we ignore LinuxBIOS for a moment, each
 platform already has a set of configuration parameter passing conventions
 imposed by the firmware. So we need to be able to handle this anyway, and
 most of the information is highly platform-specific.

Well, I never intended for my UBE stuff to handle probeable
information, and after thinking about it.  It does seem reasonable to
say that a table in a firmware rom (or generated by one) is as
probeable as a table in a device rom.  

 LinuxBIOS is a special case, because you have your own firmware. But
 what you're suggesting is basically yet another parameter format, which
 needs to incorporate and possibly unify much of the information
 contained in all those platform-specific formats. I'm not sure it's worth
 the effort.

Well I half agree.  I think where I'm going to go is to propose some
new BIOS tables, as there are some truly broken platforms out there.
In particular on alpha you can't even build a variant motherboard
where the only change is the connection of interrupts to PCI slots
without needing a kernel patch.

 Agreed with BIOS bugs ;-) Where probing is possible, is it reliable ?

I hereby define probing as only being possible where you get reliable
results.  Thus PCI is included, pnp-ISA probably is, and straight-ISA
is not.  

The one thing I am most against is having to make BIOS calls.  It is
entirely too easy for a firmware constructor to be in a rush and mess
it up, and to crash the whole boot process.

 It'd take some baroque BIOS parameter table over yet another mandatory
 boot command line parameter any time ...

Definitely.

 
  Hmm. I wonder how hard it would be to add -fPIC to the compilation
  line for that file.  But I'm not certain that would do what I want
  in this instance...
 
 Are there actually architectures where the compiler generates
 position-dependent code even if you're careful ? (I.e. all functions
 inlined, only auto variables.)

O.k. I have looked, (I'm just polishing up my port to alpha).  And yes
this can happen.  It is not so much as the code being position
dependent as the code depending on the relative positions on the text
and data segments. On the alpha there a pointer to a globals area and
even using sufficiently large constants is enough to cause an access
to a static variable.

As for always having all functions inline and using only auto
variables, and no string constants, that is just asking for trouble.
When something goes wrong it is way to tempting to insert a bit of
debugging code and boom the code is broken.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: limit on number of kmapped pages

2001-01-23 Thread Eric W. Biederman

David Wragg [EMAIL PROTECTED] writes:

 While testing some kernel code of mine on a machine with
 CONFIG_HIGHMEM enabled, I've run into the limit on the number of pages
 that can be kmapped at once.  I was surprised to find it was so low --
 only 2MB/4MB of address space for kmap (according to the value of
 LAST_PKMAP; vmalloc gets a much more generous 128MB!).

kmap is for quick transitory mappings.  kmap is not for permanent mappings.
At least that was my impression.  The persistence is intended to just
kill error prone cases.

 My code allocates a large number of pages (4MB-worth would be typical)
 to act as a buffer; interrupt handlers/BHs copy data into this buffer,
 then a kernel thread moves filled pages into the page cache and
 replaces them with newly allocated pages.  To avoid overhead on
 IRQs/BHs, all the pages in the buffer are kmapped.  But with
 CONFIG_HIGHMEM if I try to kmap 512 pages or more at once, the kernel
 locks up (fork() starts blocking inside kmap(), etc.).

This may be a reasonable use, I'm not certain.  It wasn't the application
kmap was designed to deal with though...
 
 There are ways I could work around this (either by using kmap_atomic,
 or by adding another kernel thread that maintains a window of kmapped
 pages within the buffer).  But I'd prefer not to have to add a lot of
 code specific to the CONFIG_HIGHMEM case.

Why do you need such a large buffer?  And why do the pages need to be kmapped?
If you are doing dma there is no such requirement...  And unless you are
running on something faster than a PCI bus I can't imagine why you need
a buffer that big.  My hunch is that it makes sense to do the kmap,
and the i/o in the bottom_half.  What is wrong with that?

kmap should be quick and fast because it is for transitory mappings.
It shouldn't be something whose overhead you are trying to avoid.
If kmap is that expensive then kmap needs to be fixed, instead
of your code working around a perceived problem.

At least that is what it looks like from here.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: limit on number of kmapped pages

2001-01-24 Thread Eric W. Biederman

David Wragg [EMAIL PROTECTED] writes:

 I'd still like to know what the basis for the current kmap limit
 setting is.

Mostly at one point kmap_atomic was all there was.  It was only the
difficulty of implementing copy_from_user with kmap_atomic that convinced
people we needed something more.  So actually if we can kmap several
megabyte at once the kmap limit is quite high.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] vma limited swapin readahead

2001-01-31 Thread Eric W. Biederman

Marcelo Tosatti [EMAIL PROTECTED] writes:

 On Wed, 31 Jan 2001, Stephen C. Tweedie wrote:
 
  Hi,
  
  On Wed, Jan 31, 2001 at 01:05:02AM -0200, Marcelo Tosatti wrote:
   
   However, the pages which are contiguous on swap are not necessarily
   contiguous in the virtual memory area where the fault happened. That means
   the swapin readahead code may read pages which are not related to the
   process which suffered a page fault.
   
  Yes, but reading extra sectors is cheap, and throwing the pages out of
  memory again if they turn out not to be needed is also cheap.  The
  on-disk swapped pages are likely to have been swapped out at roughly
  the same time, which is at least a modest indicator of being of the
  same age and likely to have been in use at the same time in the past.
 
 You're throwing away pages from memory to do the readahead. 
 
 This pages might be more useful than the pages which you're reading from
 swap. 

Possibly.  However the win (lower latency) from getting swapin
readahead is probably even bigger.  And you are throwing out the least
desirable pages in memory.

  I'd like to see at lest some basic performance numbers on this,
  though.
 
 I'm not sure if limiting the readahead the way my patch does is a better
 choice, too.

A better choice is probably to make certain the read and write paths are in
sync so that you can know the readahead is going to do you some good.
This is a little tricky though.  

Unless you can see a big performance win somewhere please don't submit
this to Linus for inclusion.


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] vma limited swapin readahead

2001-02-01 Thread Eric W. Biederman

David Gould [EMAIL PROTECTED] writes:

 Hmmm, arguably reading pages we do not want is a mistake. I should think that
 if a big performance win is required to justify a design choice, it should
 be especially required to show such a win for doing something that on its
 face is wrong.

The case for files and has already been justified.   
The performance gain of reading pages that are contiguous on disk has
been justified. 
The only problem thing that has not been shown is that swap pages that
are used together are located near each other in swap.

As for design choices simplicity, maintainability and
comprehensiblility, tend to be more important than absolute
performance.  This lets bugs be fixed, and the big changes that tend
to be the biggest wins happen.

 I am skeptical of the argument that we can win by replacing "the least
 desirable" pages with pages were even less desireable and that we have
 no recent indication of any need for. It seems possible under heavy swap
 to discard quite a portion of the useful pages in favor of junk that just
 happenned to have a lucky disk address.

I won't argue that.  My gut just says we should work to improve the
disk addresses, so it isn't luck. ;)  And only if we fail in that
hack up the efficient simple policy, that we have for reading disk
data in.

Of course since I'm not actually writing the code at the moment
this is all hot air :)

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Question] Explanation of zero-copy networking

2001-05-13 Thread Eric W. Biederman

Jamie Lokier [EMAIL PROTECTED] writes:

 Richard B. Johnson wrote:
  However, PCI to memory copying runs at about 300 megabytes per
  second on modern PCs and memory to memory copying runs at over 1,000
  megabytes per second. In the future, these speeds will increase.
 
 That would be big expensive modern PCs then.  Our clusters of 700MHz
 boxes are strictly limited to 132 megabytes per second over PCI...

300 Megabytes per second is definitely an odd number for a PCI bus.
But 132 Megabytes per second is actually high, the continuous burst
speeds are:
32bit 33Mhz: 33*1000*1000*32/(1024*1024*8) = 125.8 Megabytes/second
64bit 33Mhz: 33*1000*1000*64/(1024*1024*8) = 251.7 Megabytes/second
32bit 66Mhz: 66*1000*1000*32/(1024*1024*8) = 251.7 Megabytes/second
64bit 66Mhz: 66*1000*1000*64/(1024*1024*8) = 503.4 Megabytes/second

The possibility of getting a continuous bursts is actually low, if
nothing else you have an interrupt acknowledgement 100 times per
second.  But if you are pushing the bus it should deliver close to
it's burst potential.  But the ISA traffic doing subtractive decode
can be nasty because you get 4 PCI cycles before you even get
acknowledgement from the PCI/ISA bridge that you there is something to
transfer to.   

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Possible PCI subsystem bug in 2.4

2001-05-13 Thread Eric W. Biederman

Maciej W. Rozycki [EMAIL PROTECTED] writes:

 On 4 May 2001, Eric W. Biederman wrote:
 
  The example that sticks out in my head is we rely on the MP table to
  tell us if the local apic is in pic_mode or in virtual wire mode.
  When all we really have to do is ask it.
 
  You can't.  IMCR is write-only and may involve chipset-specific
 side-effects.  Then even if IMCR exists, a system's firmware might have
 chosen the virtual wire mode for whatever reason (e.g. broken hardware). 

Admittedly you can't detect directly detect IMCR state.  But
triggering an interrupt on the bootstrap processor local apic, and
failing to receive it should be proof the IMCR is at work.
Alternatively if I'm wrong about the wiring disabling all interrupts
at the apic level and receiving one is a second proof that IMCR is at
work.  Further I don't think a processor with an onboard apic, works
with an IMCR register. 

What I was thinking of earlier is that you can detect an apic or
ioapic in virtual wire mode, which the current code and the intel MP
spec treats as the opposite possibility.

Eric



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PATCH: Enable IP PNP for 2.4.4-ac8

2001-05-13 Thread Eric W. Biederman

H . J . Lu [EMAIL PROTECTED] writes:

 On Fri, May 11, 2001 at 04:28:05PM -0700, David S. Miller wrote:
  
  H . J . Lu writes:
2.4.4-ac8 disables IP auto config by default even if CONFIG_IP_PNP is
defined.  Here is a patch.
  
  It doesn't make any sense to enable this unless parameters are
  given to the kernel via the kernel command line or from firmware
  settings.
 
 From Configure.help:
 
 IP: kernel level autoconfiguration
 CONFIG_IP_PNP
   This enables automatic configuration of IP addresses of devices and
   of the routing table during kernel boot, based on either information
   supplied on the kernel command line or by BOOTP or RARP protocols.
   You need to say Y only for diskless machines requiring network 
   access to boot (in which case you want to say Y to Root file system
   on NFS as well), because all other machines configure the network 
   in their startup scripts.
 
 It works fine for 2.4.4. However, in 2.4.4-ac8, even if I select
 CONFIG_IP_PNP, I have to pass ip= to kernel, in addition to
 nfsroot=x.x.x.x:/foo/bar. With 2.4.4, I can just pass
 nfsroot=x.x.x.x:/foo/bar to kernel.

O.k. Configure.help needs to be updated. ip=on or ip=bootp or
ip=dhcp work fine.  I wonder if I forgot to forward port the docs?

This same situation exists for 2.2.18  2.2.19 as well.

The only way to get long term stability out of this is to write
a user space client, you can put in a ramdisk.  One of these days...

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-17 Thread Eric W. Biederman

Daniel Phillips [EMAIL PROTECTED] writes:

 On Tuesday 15 May 2001 23:20, Nicolas Pitre wrote:
  Personally, I'd really like to see /dev/ttyS0 be the first detected
  serial port on a system, /dev/ttyS1 the second, etc.
 
 There are well-defined rules for the first four on PC's.  The ttySx 
 better match the labels the OEM put on the box.

Actually it would be better to have the OEM put a label in the
firmware, and then have a way to query the device for it's label.

The legacy rules are nice but serial ports are done with superio chips
now.  And superio chips are almost all ISA PNP chips without device
enumeration, and isolation. 

Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PATCH: Enable IP PNP for 2.4.4-ac8

2001-05-13 Thread Eric W. Biederman

H . J . Lu [EMAIL PROTECTED] writes:

 On Sun, May 13, 2001 at 01:24:18PM -0600, Eric W. Biederman wrote:
  H . J . Lu [EMAIL PROTECTED] writes:
  
   It doesn't make any senses. When I specify CONFIG_IP_PNP and
   BOOTP/DHCP, I want a kernel with IP config using BOOTP/DHCP. I would
   expect IP config is turned for BOOTP/DHCP by default. You can turn
   it off by passing ip=off to kernel. Did I miss something?
  
  Since you have to set the command line anyway ip=dhcp is no extra
  burden and it lets you use the same kernel to boot of the harddrive etc.
 
 Why do I have to set ip=dhcp? If I have selected CONFIG_IP_PNP and
 DHCP in my kernel configuration, should it be on by default?

I agree it isn't intuitive, and if nfsroot=xxx is specified it should
probably turn on if there is missing information.

But if you have to select the command line anyway

Mostly I like the situation where I can compile it in and turn it on
when I need it, instead of having to do thing differently if it is
compiled in or not.

ip=on is all it really takes.

This same situation exists for 2.2.18  2.2.19 as well.

The only way to get long term stability out of this is to write
a user space client, you can put in a ramdisk.  One of these days...
   
   It doesn't work with diskless machines which don't support ramdisk
   during boot.
  
  I don't believe that is a real world situation.
  
  I boot diskless all of time and supporting a ramdisk is trivial.  You
  just a have a program that slaps a kernel a ramdisk, and some command
  line arguments into a single image, along with a touch of adapter code
  to set the kernel parameters correctly and then boot that.
 
 Let me guess. Your diskless machines are mostly x86. 
Mostly, but not exclusively.

 Have you tried
 ramdisk on diskless alpha, arm, m68k, mips, ppc, sh, sparc, booting
 over network?

First the booting situation on linux with respect to multiple platform
sucks.  We pass parameters in weird ways on every platform.  The command
line is the only thing that stays mostly the same.  I'm looking at what
it takes to clean that up, so we can have multiplatform bootloaders.

I have implemented what it takes to attach a ramdisk, and if you can
boot an arbitrary kernel it isn't hard to have a program that attaches
a ramdisk.  

Now although I believe this is the right direction to go, you will
notice I ported the dhcp IP auto configuration from 2.2.19 to to 2.4.x
Buying a little more time to get this working.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-20 Thread Eric W. Biederman

Jonathan Lundell [EMAIL PROTECTED] writes:

 At 10:42 AM +0200 2001-05-19, Kai Henningsen wrote:
Jeff Garzik's ethtool
extension at least tells me the PCI bus/dev/fcn, though, and from
   that I can write a userland mapping function to the physical
   location.
 
 I don't see how PCI bus/dev/fcn lets you do that.
 
 I know from system documentation, or can figure out once and for all 
 by experimentation, the correspondence between PCI bus/dev/fcn and 
 physical locations. Jeff's extension gives me the mapping between 
 eth# and PCI bus/dev/fcn, which is not otherwise available (outside 
 the kernel).

Just a second let me reenumerate your pci busses, and change all of the bus
numbers.  Not that this is a bad thought.  It is just you need to know
the tree of PCI busses/bridges up to the root on the machine in question.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace

2001-05-19 Thread Eric W. Biederman

Ben LaHaise [EMAIL PROTECTED] writes:

 Hey folks,
 
 The work-in-progress patch for-demonstration-purposes-only below consists
 of 3 major components, and is meant to start discussion about the future
 direction of device naming and its interaction block layer.  The main
 motivations here are the wasting of minor numbers for partitions, and the
 duplication of code between user and kernel space in areas such as
 partition detection, uuid location, lvm setup, mount by label, journal
 replay, and so on...

 
 1. Generic lookup method and argument parsiing (fs/lookupargs.c)
 
   This code implements a lookup function which is for demonstration
   purposes used in fs/block_dev.c.  The general idea is to pass
   additional parameters to device drivers on open via a comma
   seperated list of options following the device's name.  Sample
   uses:
 
   /dev/sda/raw- open sda in raw mode.
   /dev/sda/limit=102400   - open sda with a limit of 100K
   /dev/sda/offset=1024,limit=2048
   - open a device that gives a view of sda at an
  offset of 1KB to 2KB

GAhh!!

Ben please think /proc/sys.  One value per ``file''.

 3. Userspace partition code proposal
 
   Given the above two bits, here's a brief explaination of a
   proposal to move management of the partitioning scheme into
   userspace, along with portions of raid startup, lvm, uuid and
   mount by label code needed for mounting the root filesystem.
 
   Consider that the device node currently known as /dev/hda5 can
   also be viewed as /dev/hda at offset 512000 with a limit of 10GB.
   With the extensions in fs/block_dev.c, you could replace /dev/hda5
   with /dev/hda/offset=512000,limit=1024000.  Now, by putting
   the partition parsing code into a libpart and binding mount to a
   libpart, the root filesystem mounting code can be run out of an
   initrd image.  The use of mount gives us the ability to mount
   filesystems by UUID, by label or other exotic schemes without
   having to add any additional code to the kernel.

But you need to use uclibc or a similar library to get the code size down
small enough, so you don't quadruple the size of your boot image.

As for wasting minors.  If you are going to rework partitions they
should have dynamic device numbers.  That are assigned when the
partition is discovered by the system.   I admit a hot-plug partition
sounds incongruous but it should be fairly simple to implement.

If your real root is on a ``hot-plug'' device then it does look
like you need an initrd to help select your root partition.  Hmm. the
code is simple enough code in the kernel shouldn't be bad.  And the
interface can be simple as well.

Have:
/dev/sda/partitions/1
/dev/sda/partitions/2
/dev/sda/partitions/3
/dev/sda/partitions/4
/dev/sda/partitions/5
and also
/dev/sda/partitions/1/uuid
/dev/sda/partitions/1/label
/dev/sda/partitions/1/offset
/dev/sda/partitions/1/limit

To expose what the kernel found it's initial scan of the partitions.

For creating partitions you might want to do:
cat 1024 2048  /dev/sda/newpartition
Though if you could do it with create that would be nicer, and writes
to offset and limit, that would be a little nicer.

Al would it work to have the lookup method for /dev/sda automatically
mount an instance of scsifs on /dev/hda (from an internal mount), and
then have dput drop that mount.  I skimmed the code and it looks
possible.  

Soft mounting a fs isn't strictly necessary but for the case above but
it looks simplest to keep the list of partitions permanently in the
dcache.  We would also need to modify permission to take a vfsmnt
argument so your permissions to a device file could vary depending on
which device file you start with.

Eric



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PATCH: Enable IP PNP for 2.4.4-ac8

2001-05-13 Thread Eric W. Biederman

H . J . Lu [EMAIL PROTECTED] writes:

 It doesn't make any senses. When I specify CONFIG_IP_PNP and
 BOOTP/DHCP, I want a kernel with IP config using BOOTP/DHCP. I would
 expect IP config is turned for BOOTP/DHCP by default. You can turn
 it off by passing ip=off to kernel. Did I miss something?

Since you have to set the command line anyway ip=dhcp is no extra
burden and it lets you use the same kernel to boot of the harddrive etc.

  This same situation exists for 2.2.18  2.2.19 as well.
  
  The only way to get long term stability out of this is to write
  a user space client, you can put in a ramdisk.  One of these days...
 
 It doesn't work with diskless machines which don't support ramdisk
 during boot.

I don't believe that is a real world situation.

I boot diskless all of time and supporting a ramdisk is trivial.  You
just a have a program that slaps a kernel a ramdisk, and some command
line arguments into a single image, along with a touch of adapter code
to set the kernel parameters correctly and then boot that.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PATCH: Enable IP PNP for 2.4.4-ac8

2001-05-15 Thread Eric W. Biederman

David Woodhouse [EMAIL PROTECTED] writes:

 [EMAIL PROTECTED] said:
   There wasn't even DHCP support before so yes you did.   As you can't
  get the nfs mount point from bootp. 
 
 Wasn't there a default? The Indy behind me seems to try to mount
 /tftpboot/172.16.18.195, so I put a filesystem there just to make it happy.
 
 It's a 2.4.3 kernel.

Duh.  I forgot about the default path.

   Well I think in the CONFIG_BLK_DEV=n case it might wind up being a
  ramfs or tmpfs image.  Something like a simplified version of tar. 
 
 Well, if it stops working and stays broken, I suppose I'll just have to 
 hack up a built-in command line option. ISTR ARM already has such an option.
 
 I'd rather it didn't break, though.

The clean way to handle it, and I'll take a look it to have
root=/dev/nfs (and the rdev equivalent) to set ip=on if it isn't
already.  The current 2.4.4 behavior of root=/dev/hda3 doing ip
autoconfig when the code is compiled into the kernel is just bad.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Eric W. Biederman

Stephen C. Tweedie [EMAIL PROTECTED] writes:

 Hi,
 
 On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote:
  
  On Wed, 23 May 2001, Stephen C. Tweedie wrote:
that the filesystems already do. And you can do it a lot _better_ than the
 
current buffer-cache-based approach. Done right, you can actually do all
IO in page-sized chunks, BUT fall down on sector-sized things for the
cases where you want to.
  
   Right, but you still lose the caching in that case.  The write works,
   but the cache becomes nothing more than a buffer.
  
  No. It is still cached. You find the buffer with page-buffer, and when
  all of them are up-to-date (whether from read-in or from having written
  to them all), you just mark the whole page up-to-date.
 
 It works, but *only* if the application writes a whole page worth of
 data.  From the previous emails I had the understanding that this
 application is writing small data items in random 512-byte blocks.  It
 is not writing the rest of the page.  The page never becomes uptodate.
 That in itself isn't a problem, but readpage() can't tell the
 underlying layers that only a part of the page is wanted, so there's
 no way to tell readpage that the page is in fact partially uptodate.
 
 And just telling the application to write the rest of the page too
 isn't going to cut it, because the rest of the page may contain other
 objects which aren't in cache so we can't write them without first
 reading the page.  The only alternative is to change the on-disk
 layout, forcing a minimum PAGESIZE on the IO chunks.
 
  This _works_. Try it on ext2 or NFS today.
 
 Not for this workload.  Now, maybe it's not an interesting workload.
 But shifting the uptodate granularity from buffer to page sized _does_
 impact the effectiveness of the cache for such an application. 
 
  So in short: the page cache supports _today_ all the optimizations.
 
 For write, perhaps; but for subsequent read, generic_read_page
 doesn't see any of the data in the page unless the whole page has been
 written.

generic_read_page???

block_read_full_page seems to handle this correctly.  At least
with respect to keeping the data around, and not doing the I/O
on data we already have.  But it still reads in the unpopulated
parts of the page even if it is unnecessary.

The case we don't get quite right are partial reads that hit cached
data, on a page that doesn't have PG_Uptodate set.  We don't actually
need to do the I/O on the surrounding page to satisfy the read
request.  But we do because generic_file_read doesn't even think about
that case.

For the small random read case we could use a 
mapping-a_ops-readpartialpage 
function that sees if a request can be satisfied entirely 
from cached data.  But this is just to allow generic_file_read
to handle this, case. 

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:

 On 25 May 2001, Eric W. Biederman wrote:
  
  For the small random read case we could use a 
  mapping-a_ops-readpartialpage 
 
 No, if so I'd prefer to just change readpage() to take the same kinds of
 arguments commit_page() does, namely the beginning and end of the read
 area. 

No.

I obviously picked a bad name, and a bad place to start.
int data_uptodate(struct page *page, unsigned offset, unsigned len)

This is really an extension to PG_uptodate, not readpage.  It should
never ever do any I/O.  It should just implement a check to see
if we have all of the data wanted already in the page in the page
cache.  As simply a buffer checking entity it will likely share
virtualy 0 code with read_page.

 Filesystems could choose to ignore the arguments completely, and just act
 the way they already do - filling in the whole page.
 
 OR a filesystem might know that the page is partially up-to-date (because
 of a partial write), and just return an immediate this area is already
 uptodate return code or something. Or it could even fill in the page
 partially, and just unlock it (but not mark it up-to-date: the reader then
 has to wait for the page and then look at PG_error to decide whether the
 partial read succeeded or not).

First mm/filemap.c has generic cache management, so it should make the
decision.

The logic is does this page have the data in cache?
If so just return it.

Otherwise read all that you can at once.  

So we either want a virtual function that can make the decision on
a per filesystem bases if we have the data we need in the page cache.
Or we need to convert the buffer_head into a more generic entity
so everyone can use it.

 I don't think it really matters, I have to say. It would be very easy to
 implement (all the buffer-based filesystems already use the common
 fs/buffer.c readpage, so it would really need changes in just one place,
 along with some expanded prototypes with ignored arguments in some other
 places).
 
 But it _could_ be a performance helper for some strange loads (write a
 partial page and immediately read it back - what a stupid program), and
 more importantly Al Viro felt earlier that a partial read approach might
 help his metadata-in-page-cache stuff because metadata tends to sometimes
 be scattered wildly across the disk.

Maybe I think despite the similarities (partial pages) Al  and I are
looking at two entirely different problems.

 So then we'd have
 
   int (*readpage)(struct file *, struct page *, unsigned offset, unsigned
 len);
 
 
 and the semantics would be:
  - the function needs to start IO for _at_least_ the page area
[offset, offset+len[
  - return error code for _immediate_ errors (ie not asynchronous)
  - if there was an asynchronous read error, we set PG_error
  - if the page is fully populated, we set PG_uptodate
  - if the page was not fully populated, but the partial read succeeded,
the filesystem needs to have some way of keeping track of the partial
success (page-buffers is obviously the way for a block-based one),
and must _not_ set PG_uptodate.
  - after the asynchronous operation (whether complete, partial or
unsuccessful), the page is unlocked to tell the reader that it is done.
 
 Now, this would be coupled with:
  - generic_file_read() does the read-ahead decisions, and may decide that
we really only need a partial page.
 
 But NOTE! The above is meant to potentially avoid unnecessary IO and thus
 speed up the read-in. HOWEVER, it _will_ slow down the case where we first
 would read a small part of the page and then soon afterwards read in the
 rest of the page. I suspect that is the common case by far, and that the
 current whole-page approach is the faster one in 99% of all cases. So I'm
 not at all convinced that the above is actually worth it.

I don't want partial I/O at all.  And I always want to see reads
reading in all of the data for a page.  I just want an interface
where we can say hey we don't actually have to do any I/O for this
read request, give them back their data.

 If somebody can show that the above is worth it and worth implementing (ie
 the Al Viro kind of I have a real-life schenario where I'd like to use
 it), and implements it (should be a fairly trivial exercise), then I'll
 happily accept new semantics like this.
 
 But I do _not_ want to see another new function (partialread()), and I
 do _not_ want to see synchronous interfaces (Al's first suggestion).

My naming mistake I don't want to see this logic combined with
readpage.  That is an entirely different case.

I can't see how adding a slow case to PageUptodate to check for a
partially uptodate page could hurt our performance.  And I can imagine
how it could help.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http

Re: getting include-files from arch/arch/subdir

2000-10-24 Thread Eric W. Biederman

"Heusden, Folkert van" [EMAIL PROTECTED] writes:

 Hi,
 
 ADC why not 
 ADC #include arch/i386/etc.h
 ADC Amit
 
 Since that is not cross-platform. I like a solution which does the #include
 transparantly
 for alpha/i386/etc.

Umm. Then the include file should probably rest under the include hierarchy.
Say: #includeasm/i386/etc.h

That make it clear the code is exported to someone else...
Going down into the arch tree looks ugly.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-22 Thread Eric W. Biederman

Rik van Riel [EMAIL PROTECTED] writes:

 On Wed, 21 Mar 2001, Patrick O'Rourke wrote:
 
  Since the system will panic if the init process is chosen by
  the OOM killer, the following patch prevents select_bad_process()
  from picking init.
 
 One question ... has the OOM killer ever selected init on
 anybody's system ?
 
 I think that the scoring algorithm should make sure that
 we never pick init, unless the system is screwed so badly
 that init is broken or the only process left ;)

Is there ever a case where killing init is the right thing to do?
My impression is that if init is selected the whole machine dies.
If you can kill init and still have a machine that mostly works,
then I guess it makes some sense not to kill it.

Guaranteeing not to select init can buy you piece of mind because
init if properly setup can put the machine back together again, while
not special casing init means something weird might happen and init
would be selected.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-22 Thread Eric W. Biederman

Guest section DW [EMAIL PROTECTED] writes:

 On Wed, Mar 21, 2001 at 08:48:54PM -0300, Rik van Riel wrote:
  On Wed, 21 Mar 2001, Patrick O'Rourke wrote:
 
   Since the system will panic if the init process is chosen by
   the OOM killer, the following patch prevents select_bad_process()
   from picking init.
 
 There is a dozen other processes that must not be killed.
 Init is just a random example.

Not killing init provides enough for recovery if you truly hit
an out of memory situation.  With 2.4.x at least it is a box
misconfiguration that causes it.   The 2.2.x VM doesn't always try
to swap, and free things up hard enough, before reporting out of
memory.  But even the 2.2.x problems are rare.

 
  One question ... has the OOM killer ever selected init on
  anybody's system ?
 
 Last week I installed SuSE 7.1 somewhere.
 During the install: "VM: killing process rpm",
 leaving the installer rather confused.
 (An empty machine, 256MB, 144MB swap, I think 2.2.18.)

swap  RAM. ouch!  This is a misconfiguration on a machine that
actually starts swapping, and where out of memory problems are a
reality.  The fact an installer would trigger swapping on a 256MB
machine is a second problem. 

 Last month I had a computer algebra process running for a week.
 Killed. But this computation was the only task this machine had.
 Its sole reason of existence.
 Too bad - zero information out of a week's computation.
 (I think 2.4.0.)

It looks like you didn't have enough resources on that machine
period.  I pretty much trust 2.4.x in this department.  Did that
machine also have it's swap misconfigured?

 
 Clearly, Linux cannot be reliable if any process can be killed
 at any moment. I am not happy at all with my recent experiences.

Hmm.  It should definitely not be at any moment.  It should only be
when resources are exhausted.  So putting enough swap on a machine
should be enough, to stop this from ever happening.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Q: Linux rebooting directly into linux.

2000-11-09 Thread Eric W. Biederman


I have recently developed a patch that allows linux to directly boot
into another linux kernel.  With the code freeze it appears
inappropriate to submit it at this time. 

Linus in principal do you have any trouble with this kind of
functionality? 

The immediate applications of this code, are:
- Clusters can network can network boot over arbitrary network
  interfaces, and the network driver only needs to be written and
  maintained in one place.
- Multiplatform boot loaders can be written.
- The Linux kernel can be included in a boot ROM and you can still
  boot other linux kernels.
- Kernel developers can have a fast interface for booting into a
  development kernel.

The interface is designed to be simple and inflexible yet very
powerful.  To that end the code just takes an elf binary, and a
command line.  The started image also takes an environment generated
by the kernel of all of the unprobeable hardware details.

ELF was picked for it's multiplatform support and the sheer simplicity
of it's program header.  Plus you can use standard tools to generate
elf images fairly easily.  

The environment passed to a loaded image is designed to expand and
handle new data types without breaking old decoders.  They just break
because the don't support the new hardware :)

Linus the path I envision is that this code gets integrated early in
2.5.  This includes cleaning up the boot paths so all our C code has
to deal with is this new format.  Then backporting the functionality
to 2.4 and possibly 2.2.

The kernel patches can be found in:
ftp://ftp.linuxnetworx.com/pub/kexec-patches-1.0.tar.gz
(This is a patchset with 4 patches
 1 Ingo Molanar's improved apic support
 2 My enhancements upon it so we restore the apics to their boot
   state when we shut down.
 3 My 2 line patch to make certain that in smp_send_stop
   the last cpu running is the boot cpu. (Required by the MP spec...)
 4 The code to support execing a new kernel. )

The code to generate a image bootable by this new syscall is in:
ftp://ftp.linuxnetworx.com/pub/mkelfImage-1.0.tar.gz
  (This is a perl script that takes a kernel and possibly a ramdisk
   and a command line and generates an elfimage suitable to be booted
   in this new infrastructure)

Eric

p.s. Linus the code is not included inline because I don't expect it to
be included just yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Better testing of hardware (was: Defective Read Hat)

2000-11-21 Thread Eric W. Biederman

"Stephen Gutknecht (linux-kernel)" [EMAIL PROTECTED] writes:

 A Linux Kernel compile test does a really good job of testing the hard disk,
 RAM, and CPU... as it executes all types of instructions and the final
 output depends on all prior steps completing correctly.  On a really fast
 system ( 900Mhz) might make sense to run it twice, once to "warm up" the
 CPU and other components.  Most "benchmarks" just test speed, not the actual
 stability or data integrity (they write results to a device but don't check
 for data corruption, or they test only one device at a time, not all at
 once).

Also note that a Linux Kernel compile stresses memory because
of the very pointer loaded data structures of gcc.  This means that
memory corruption is most likely to flip a bit in a pointer, and cause
a bad pointer.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Ext2 Performances

2000-11-21 Thread Eric W. Biederman

Aaron Sethman [EMAIL PROTECTED] writes:

 You might want to take a look at using reiserfs on the 130GB partition, as
 its is journalled and doesn't need to be fsck'ed.  
No.

All journaling filesystems need to be fsck'ed.
A correctly operating one simply doesn't need to be fsck'ed  because
of unexpected loss of operating system.Which brings greatly reduce
the probability.  If an error is detected in the filesystem fsck is
still what you have to do to correct it.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: LKCD from SGI

2000-11-24 Thread Eric W. Biederman

Peter Samuelson [EMAIL PROTECTED] writes:

 [Matt D. Robinson]
  Any way we can standardize 'make install' in the kernel?  It's
  disturbing to have different install mechanisms per platform ...
  I can make the changes for a few platforms.
 
 2.5 material, already on the todo list.

What is the thought on this.  There is an issue with different
boot loaders needing rather dramatically different formats...

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Booting AMD Elan520 without BIOS

2000-11-24 Thread Eric W. Biederman

Ronald G Minnich [EMAIL PROTECTED] writes:

 On Fri, 24 Nov 2000, I+D wrote:
 
  I'm trying to boot an AMD Elan520 board without bios
  with kernel 2.4.0-test10 configured for i486 and PCI direct access.
  This kernel boots correctly from HD using the bios provided with the 
  evaluation board but kernel 2.4.0-test8 and previous hang
  after "Ok booting the kernel".
 
 well, first I want your code for linuxbios :-)
 
  The last message I see is "Calibrating delay loop"
  (I see this thaks to the Jtag debugger for Elan520 because
  I haven't configured the VGA board yet).
 
 you don't have clock interrupts on. If you are able to single step you'll
 probably see it in the loop spinning on jiffies. This is one of our
 regular problems with a new mainboard.

This can also easily be a misconfiguration of the local apic.
I might need to be put into virtual wire mode.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PROBLEM: crashing kernels

2000-11-25 Thread Eric W. Biederman

Alan Cox [EMAIL PROTECTED] writes:

  benn compiled into the kernel, and not as a module) always gave the
  errors:
  
  eth0: Transmit timed out: status 0050  0090 at 134704418/134704432 
  eth0: Trying to restart the transmitter...
 
 Known problem. This one might be fixed in current 2.2.18pre. SOme people
 see it some dont

I have another data point on this problem.
I have seen it most with 2.4.0-test9.  But I'll look at 2.2.18pre.
I can trigger this bug fairly reliably by warm booting, several times
in a row.  With my linux warm booting directly into linux code triggers this
one fairly reliably :)  Also putting another nick in seems to help
trigger it as well.

The 2.4.0-testxxx watchdog seems eventually to handle this case 
but it takes 1/2 hour or so to actually kick in and reset the card.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: SCSI problem on aic7xxx on L440GX+ using LinuxBIOS

2000-12-01 Thread Eric W. Biederman

Ronald G Minnich [EMAIL PROTECTED] writes:

 Eric, here is the ksymoops (end of message) from that earlier failure. I'm
 just wondering if anyone out there has seen anything like this. Also, if
 anyone sees anything odd about the scsi configuration that would help too.
 
 Thanks in advance ...

Ron.  vger.rutgers.edu died a couple of months ago.
vger.kernel.org is the new machine, the linux kernel mailing list is on.
I'm forwarding this there.  I don't know how much help we can
get on a bug report against 2.4.0-test6 though.

Eric

 
 ron
 On 30 Nov 2000, Eric W. Biederman wrote:
 
  Ronald G Minnich [EMAIL PROTECTED] writes:
  
   This is 2.4.0-test6, on an L440GX, running linuxbios. The node comes up
   and appears to run fine:
   
   (scsi0) Adaptec AIC-7896/7 Ultra2 SCSI host adapter found at PCI 0/12/0
   (scsi0) Wide Channel A, SCSI ID=7, 32/255 SCBs
   (scsi0) Downloading sequencer code... 392 instructions downloaded
   (scsi1) Adaptec AIC-7896/7 Ultra2 SCSI host adapter found at PCI 0/12/1
   (scsi1) Wide Channel B, SCSI ID=7, 32/255 SCBs
   (scsi1) Downloading sequencer code... 392 instructions downloaded
   scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0
  Adaptec AIC-7896/7 Ultra2 SCSI host adapter
   scsi1 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0
  Adaptec AIC-7896/7 Ultra2 SCSI host adapter
   scsi : 2 hosts.
   (scsi0:0:1:0) Synchronous at 40.0 Mbyte/sec, offset 31.
 Vendor: QUANTUM   Model: ATLAS 10K 9SCARev: UCH0
 Type:   Direct-Access  ANSI SCSI revision: 03
   Detected scsi disk sda at scsi0, channel 0, id 1, lun 0
 Vendor: VA Linux  Model: Fullon 2x2Rev: 1.01
 Type:   Processor  ANSI SCSI revision: 02
   scsi : detected 1 SCSI disk total.
   SCSI device sda: hdwr sector= 512 bytes. Sectors= 17938986 [8759 MB] [8.8
   GB]
   Partition check:
sda: sda1 sda2 sda3
   .
   .
   .
   Welcome to Red Hat Linux
   Press 'I' to enter interactive startup.
   Mounting proc filesystem [  OK  ]
   Configuring kernel parameters [  OK  ]
   hwclock: Can't open /dev/tty1, errno=19: No such device.
   Setting clock  (utc): Thu Nov 30 23:07:43 /etc/localtime 2000 [  OK  ]
   Loading default keymap/etc/rc.d/rc.sysinit: /dev/tty0: No such device
   [FAILED]
   Activating swap partitions [  OK  ]
   Setting hostname rpc4 [  OK  ]
   Checking root filesystem
   /dev/sda1 contains a file system with errors, check forced.
   /dev/sda1: Inode 84024 has illegal block(s).  [/sbin/fsck.ext2 -- /]
   fsck.ext2 -a /dev/sda1
   
   
   /dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
   (i.e., without -a or -p options)
   
   ---
   
But in the middle of an fsck 
   
   
   Anyway, I'm wondering if anyone has seen anything like this at all on the
   aic7xxx driver. I also had a working L440GX that used IDE for /, and when
   i insmod aic7xxx.o I do see this same error. Any suggestions on this
   problem would be appreciated.
  
  Hmm. This looks like a kernel bug, probably triggered by lack of
  bios support.  Could you run the oops through ksymoops so we have a
  clue what is wrong?  If we knew where the kernel was crashing perhaps
  we could fix it. 
  
 
 Sorry, here's the ksymoops
 
 Oops: 
 CPU:0
 EIP:0010:[c012b234]
 Using defaults from ksymoops -t elf32-i386 -a i386
 EFLAGS: 00010206
 eax: c141   ebx: 0002   ecx: 0012c8f2   edx: 08458b00
 esi: 0008   edi: 0801   ebp: 0096   esp: cfb67d08
 ds: 0018   es: 0018   ss: 0018
 Process fsck.ext2 (pid: 49, stackpage=cfb67000)
 Stack: 0021  cfb67e9c cfb67f20 0801 107e c012fd73
 0801
0012c8f2 0400 cfea9c00 ffea  0400 25004400
 cbe56a20
0012c90d  0801  0001 cfb67e9c 4b234000
 
 Call Trace: [c012fd73] [c01299c2] [c0129b5b] [c010ac4f]
 Code: 39 4a 04 75 10 0f b7 42 08 3b 44 24 24 75 06 66 39 7a 0c 74
 
 EIP; c012b234 getblk+7c/124   =
 Trace; c012fd73 block_read+2df/540
 Trace; c01299c2 sys_lseek+5e/94
 Trace; c0129b5b sys_read+8b/a0
 Trace; c010ac4f system_call+33/38
 Code;  c012b234 getblk+7c/124
  _EIP:
 Code;  c012b234 getblk+7c/124   =
0:   39 4a 04  cmp%ecx,0x4(%edx)   =
 Code;  c012b237 getblk+7f/124
3:   75 10 jne15 _EIP+0x15 c012b249
 getblk+91/124
 Code;  c012b239 getblk+81/124
5:   0f b7 42 08   movzwl 0x8(%edx),%eax
 Code;  c012b23d getblk+85/124
9:   3b 44 24 24   cmp0x24(%esp,1),%eax
 Code;  c012b241 getblk+89/124
d:   75 06 jne15 _EIP+0x15 c012b249
 getblk+91/124
 Code;  c012b243 getblk+8b/124
f:   66 39 7a 0c   cmp%di,0xc(%edx)
 Code;  c012b247 getblk+8f/124
   13:   74 00 je 15 _EIP+0x15 c012b249
 getblk+91/124
 
 Unable to handle kernel NULL

Re: [patch] O_SYNC patch 3/3, add inode dirty buffer list support to ext2

2000-11-24 Thread Eric W. Biederman

"Jeff V. Merkey" [EMAIL PROTECTED] writes:

 Cool.  ORACLE is going to **SMOKE** on EXT2 with this change.

Pessimism

Hmm I don't see how ORACLE is going to **SMOKE**.
Last I looked ORACLE would need a query optimizer that always
would find the best possible index and much less overhead to **SMOKE**.

Last I looked table reads were 10x slower than file reads.

/Pessimism

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:


 In short, I don't see _those_ kinds of issues. I do see error reporting as
 a major issue, though. If we need to do proper low-level block allocation
 in order to get correct ENOSPC handling, then the win from doing deferred
 writes is not very big.

To get ENOSPC handling 99% correct all we need to do is decrement a counter,
that remembers how many disks blocks are free.  If we need a better
estimate than just the data blocks it should not be hard to add an
extra callback to the filesystem.  

There look to be some interesting cases to handle when we fill up a
filesystem.  Before actually failing and returning ENOSPC the
filesystem might want to fsync itself. And see how correct it's
estimates were.  But that is the rare case and shouldn't affect
performance.

rant
In the long term VFS support for deferred writes looks like a major
win.  Knowing how large a file is before we write it to disk allows
very efficient disk organization, and fast file access (esp combined
with an extent based fs).   Support for compressing files in real time
falls out naturally.  Support for filesystems maintain coherency by
never writing the same block back to the same disk location also
appears.
/rant

One other thing to think about for the VFS/MM layer is limiting the
total number of dirty pages in the system (to what disk pressure shows
the disk can handle), to keep system performance smooth when swapping.
All cases except mmaped files are easy, and they can be handled by a
modified page fault handler that directly puts the dirty bit on the
struct page.  (Except that is buggy with respect to clearing the dirty
bit on the struct page.)  In reality we would have to create a queue
of pointers to dirty pte's from the page fault handler and depending
on a timer or memory pressure move the dirty bits to the actual page.

Combined with the code to make sync and fsync to work on the page
cache we msync would be obsolete?

Of course the most important part is that when all of that is
working, the VFS/MM layer it would be perfect.  World domination
would be achieved.  For who would be caught using an OS with an
imperfect VFS layer :)

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:

 On 30 Dec 2000, Eric W. Biederman wrote:
  
  One other thing to think about for the VFS/MM layer is limiting the
  total number of dirty pages in the system (to what disk pressure shows
  the disk can handle), to keep system performance smooth when swapping.
 
 This is a separate issue,  and I think that it is most closely tied in to
 the "RSS limit" kind of patches because of the memory mapping issues. If
 you've seen the RSS rlimit patch (it's been posted a few times this week),
 then you could think of that modified by a "Resident writable pages Set
 Size" approach. 

Building on the RSS limit approach sounds much simpler then they way
I was thinking.

 Not just for shared mappings - this is also an issue with
 limiting swapout.
 
 (I actually don't think that RSS is all that interesting, it's really the
 "potentially dirty RSS" that counts for VM behaviour - everything else can
 be dropped easily enough)

Definitely.

Now the only tricky bit is how do we sense when we are overloading
the swap disks.  Well that is the next step.  I'll take a look
and see what it takes to keep statistics on dirty mapped pages.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Happy new year^H^H^H^Hkernel..

2001-01-03 Thread Eric W. Biederman

Kai Germaschewski [EMAIL PROTECTED] writes:

 On Tue, 2 Jan 2001, Gerold Jury wrote:
 
  I have reversed the patches part by part, the only thing that makes a
  difference is the diversion services.
  The reason for this remains unknown for me.
 
 I think I found it. Could everybody who was getting the crash on ISDN line
 hangup try if the following patch fixes the problem?
 
 I think the problem was that we relied on divert_if being initialized to
 zero automatically, which didn't happen because it was not declared static
 and therefore not in .bss (*is this true?*).

All variables with static storage (not with static scope) if not explicitly
initialized are placed in the bss segment.  In particular this
means that adding/removing a static changes nothing.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Happy new year^H^H^H^Hkernel..

2001-01-04 Thread Eric W. Biederman

Russell King [EMAIL PROTECTED] writes:

 Kai Germaschewski writes:
  The patch is right, the explanation was wrong. Sorry, I didn't CC l-k when
  I found what was really going on. Other source files used a global
  initialized variable "divert_if" as well, so this became the same one as
  the one referenced in isdn_common.c.  That's why it wasn't zero, it was
  explicitly initialized elsewhere. However, making divert_if static in
  isdn_common.c fixes the problem, because now it's really local to this
  file and therefore initialized to NULL.
 
 Maybe someone should compile the kernel with everything built in and
 -fno-common to catch stuff like this?  Maybe we should always compile
 the kernel with -fno-common?

Sounds good.

We probably need to wait until after 2.4.0 is released to make the 
change though.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: ramfs problem... (unlink of sparse file in D state)

2001-01-06 Thread Eric W. Biederman

Chris Wedgwood [EMAIL PROTECTED] writes:

 On Sat, Jan 06, 2001 at 03:58:20PM +, Alan Cox wrote:
 
 Ext2 handles large files almost properly. (properly on 2.2 +
 patches) NFSv3 handles large files but might be missing the
 O_LARGEFILE check.  I believe reiserfs went to at least 4Gig.
 
 reiserfs 3.6.x under 2.4.x should go much higher unless i am reading
 something wrong
 
 pause
 
 yup, it does.
 
 
 as for NFS, I'm not sure how to pass O_LARGEFILE via the protocol and
 since NFS isn't really POSIX like anyhow decided we might as well
 just ingore it and have all sys_open calls for NFS look like
 O_LARGEFILE was specified

Umm.  No.  The object of LFS stuff is so that programs that can't
handle large files don't shoot themselves in the foot.  You don't
need to pass O_LARGEFILE over the protocol and knfsd doesn't need
to handle it.  But with out specifying O_LARGEFILE you should
be limited to 2GB on 32bit systems.

Moving some of the LFS checks into the VFS does sound good.  

When I looked at one of the BSD's a while ago, they had
a max file size in (the superblock?) and the VFS did basic
max file size checking.  And I think it handled all of the LFS
API at the VFS layer as well.  Alan these are two seperate
but related issues.

Putting the LFS checks,  max filesize checks into the VFS sounds
right for 2.4.x because it fixes lots of filesystems, with just a
couple of lines of code. 

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4 todo list update

2001-01-07 Thread Eric W. Biederman

Rik van Riel [EMAIL PROTECTED] writes:


 The following bugs _could_ be fixed ... I'm not 100% certain
 but they're probably gone (could somebody confirm/deny?):
 
 * mm-rss is modified in some places without holding the
   page_table_lock

As of linux-2.4.0-test13-pre7 I can confirm that this bug
still exists.  The most obvious  location is in zap_page_range,
there may be others as well.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: ramfs problem... (unlink of sparse file in D state)

2001-01-07 Thread Eric W. Biederman

Alan Cox [EMAIL PROTECTED] writes:

  Putting the LFS checks,  max filesize checks into the VFS sounds
  right for 2.4.x because it fixes lots of filesystems, with just a
  couple of lines of code. 
 
 Rather more than that, and it only fixes those using generic_file_*

True.  But it is noticeable fewer lines of code than doing it all
once for each fs.

Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Patch (repost): cramfs memory corruption fix

2001-01-07 Thread Eric W. Biederman

Rik van Riel [EMAIL PROTECTED] writes:

 On Sun, 7 Jan 2001, Linus Torvalds wrote:
  On Sun, 7 Jan 2001, Alan Cox wrote:
  
   -ac has the rather extended ramfs with resource limits and stuff. That one
   also has rather more extended bugs 8). AFAIK none of those are in the
 vanilla
 
   ramfs code
 
  This is actually where I agree with whoever it was that said that ramfs as
  it stands now (without the limit checking etc) is much nicer simply
  because it can act as an example of how to do a simple filesystem. 
  
  I wonder what to do about this - the limits are obviously useful, as would
  the "use swap-space as a backing store" thing be. At the same time I'd
  really hate to lose the lean-mean-clean ramfs. 
 
 Sounds like a job for ... drum roll ... tmpfs!!

If you need tmpfs the VFS layer is broken.  For 99% of everything
performance is determined by VFS layer caching.  A fs that
uses swap space as a backing store is not a big win.  You just have 
a fs that doesn't support sync and you can add a mount option to
a normal fs if you want that.

I've written the filesystem and it was a dumb idea.

Ramfs with (maybe) some basic limits has a place.  tmpfs is just
extra code to maintain. 

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Related VIA PCI crazyness?

2001-01-08 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:

 On Sun, 7 Jan 2001, Albert Cranford wrote:
   Could anybody with a VIA chip who has the energy please do something for
   me:
- enable DEBUG in arch/i386/kernel/pci-i386.h
- do a "/sbin/lspci -xxvvv" on the interrupt routing chip (it's the
  "ISA bridge" chip - the VIA numbers are 82c586, 82c596, the PCI
  numbers for them are 1106:0586 and 1106:0596, I think)
- do a cat /proc/pci
   
  
  Does this help.
 
 Ahh, no.
 
 A SMP kernel (or one with UP IO-APIC) is not going to be helpful for this,
 actually. SMP will take the irq data from the MP block, not the pirq table
 (that can be considered something of a misfeature right now, but getting
 the mixture of PCI irq redirection from the MP tables and the pirq irq
 routing information right together is probably not worth it - especially
 as I don't think any MS OS has ever done that either, so the BIOS writers
 have never experienced that combination - so it's almost guaranteed to
 result in strange results).

pirq is specific to they legacy i8259 interrupt handler.
MP is specific to some kind of IO-APIC.

Right now when we enable the IO-APIC we disable the legacy i8259
controller.  And I'm not even certain you can have them both enabled
at the same time.

So except for not having an option to disable use of the IO-APIC
I don't see what we could do better.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-08 Thread Eric W. Biederman

Zlatko Calusic [EMAIL PROTECTED] writes:

 
 Yes, but a lot more data on the swap also means degraded performance,
 because the disk head has to seek around in the much bigger area. Are
 you sure this is all OK?

I don't think we have more data on the swap, just more data has an
allocated home on the swap.  With the earlier allocation we should
(I haven't verified) allocate contiguous chunks of memory contiguously
on the swap.   And reusing the same swap pages helps out with this.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-09 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:

 On 8 Jan 2001, Eric W. Biederman wrote:
 
  Zlatko Calusic [EMAIL PROTECTED] writes: 
   
   Yes, but a lot more data on the swap also means degraded performance,
   because the disk head has to seek around in the much bigger area. Are
   you sure this is all OK?
  
  I don't think we have more data on the swap, just more data has an
  allocated home on the swap.
 
 I think Zlatko's point is that because of the extra allocations, we will
 have worse locality (more seeks etc). 
 
 Clearly we should not actually do any more actual IO. But the sticky
 allocation _might_ make the IO we do be more spread out.

The tradeoff when implemented correctly is that writes will tend to be
more spread out and reads should be better clustered together. 

 To offset that, I think the sticky allocation makes us much better able to
 handle things like clustering etc more intelligently, which is why I think
 it's very much worth it.  But let's not close our eyes to potential
 downsides.

Certainly, keeping ours eyes open is a good a good thing.

But it has been apparent for a long time that by doing allocation as
we were doing it, that when it came to heavy swapping we were taking a
performance hit.  So I'm relieved that we are now being more aggressive.

From the sounds of it what we are currently doing actually sucks worse
for some heavy loads.  But it still feels like the right direction.

It's been my impression that work loads where we are actively swapping
are a lot different from work loads where we really don't swap.  To
the extent that it might make sense to make the actively swapping case
a config option to get our attention in the code.  It would be nice
to have a linux kernel for once that handles heavy swapping (below
the level of thrashing) gracefully. :)

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Eric W. Biederman

Dan Kegel [EMAIL PROTECTED] writes:

 It's harder to write correct programs that use edge-triggered events.

Huh? The race between when an event is reported, and when you take action 
on it effectively means all events are edge triggered. 

So making the interface clearly edge triggered seems to be a win for
correctness.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Updated 2.4 TODO List -- new addition WAS(test9 PCI resource collisions (fwd)

2000-10-24 Thread Eric W. Biederman

"David S. Miller" [EMAIL PROTECTED] writes:

Date:  Tue, 24 Oct 2000 13:50:10 -0700 (PDT)
From: Linus Torvalds [EMAIL PROTECTED]
 
Does the above make it work for you? I don't know if PCI even has
the notion of transparent bridging, and quite frankly I doubt it
does. The above would be nothing but a hack that basically says "I
don't understand the resources of this bridge, so I'll just say it
bridges everything".
 
 I bet PCI allows no such thing, thus to be totally safe I would
 conditionalize this feature on the specific bridge.  Ie. only allow
 it for this bridge type, because I bet it is just some bug in the
 the address comparators which makes the bridge interpret zero ranges
 as "forward and respond to everything".

I'm not certain of the details but I do know that it is legal.
To date I've only heard of it on ISA bridges, in particular the PIIXE.
It's some kind of passive listening mode as opposed to actually claiming
the bus cycles.

 This only would make sense if the bridge snooped config space access
 to devices behind it, so that it knew what addresses to forward and
 respond to.  Just responding to "everything" would not work for
 obvious reasons.

Right but I don't think you actually have to respond.  Not that I think
this is a good idea, but it does appear to be legal.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: guarantee_memory() syscall?

2000-10-29 Thread Eric W. Biederman

Raul Miller [EMAIL PROTECTED] writes:

 Can anyone tell me about the viability of a guarantee_memory() syscall?
 
 [I'm thinking: it would either kill the process, or allocate all virtual
 memory needed for its shared libraries, buffers, allocated memory, etc.
 Furthermore, it would render this process immune to the OOM killer,
 unless it allocated further memory.]

Except for the OOM killer semantics mlockall already exists.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: /proc xml data

2000-10-29 Thread Eric W. Biederman

Joe [EMAIL PROTECTED] writes:

 I remember hearing about various debates about the /proc structure.  I
 was wondering if anyone had ever considered storing some of the data in
 xml format rather than its current format?  Things like /proc/meminfo
 and cpuinfo may work good in this format as then it would be easy to
 write a generic xml parser that could then be used to parse any of the
 data. "MemTotal:  %8lu kB\n"
 
 In the case of the meminfo it would be a matter of changing the lines in
 fs/proc/array.c  function get_meminfo(char * buffer) from
 
 "MemTotal:  %8lu kB\n"
 
 to something like
 
 "memtotal%8lu kB/memtotal\n"

The general consensus is that if we have a major reorganization, in proc
the rule will be one value per file.  And let directories do the grouping.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: non-gcc linux?

2000-11-05 Thread Eric W. Biederman

Ion Badulescu [EMAIL PROTECTED] writes:

 On Sun, 5 Nov 2000 23:42:25 +0100, Marc Lehmann [EMAIL PROTECTED] wrote:
  On Sun, Nov 05, 2000 at 04:06:37PM -0500, Jakub Jelinek [EMAIL PROTECTED]
 wrote:
 
 
  for SGI, or SGI would have to be willing to assign some code to FSF.
  
  Which is the standard procedure that the FSF requires for all it's
  programs to be able to defend them
 
 ... or sell them under a different license. Not that they would, but they
 could, if they really wanted to.

The wording of the standard copyright assignment to the FSF binds the
FSF so that it can only release the code under a free software
license.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Persistent module storage [was Linux 2.4 Status / TODO page]

2000-11-06 Thread Eric W. Biederman

David Woodhouse [EMAIL PROTECTED] writes:

 The current situation is equivalent to stopping forwarding packets each
 time an app on the local machine decides it wants to send its own packets,
 after a period of inactivity.
 
 Defaulting to zero on boot is fine. Defaulting to zero after the module
 has been auto-unloaded and auto-loaded again is less good.

Well we don't have auto unload.
And module persistent data for the second load case causes chaos with 
the goal of having exactly the same code in modules and compiled in
kernel code.

It would probably be better (in this case) to increment the module count
when the mixer settings go above 0, and decrement it when the settings 
go totally to 0.  This prevents an unwanted unload.

But for reliability and code simplicity there does not yet seem to
be a case for persistent module storage.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Installing kernel 2.4

2000-11-08 Thread Eric W. Biederman

Horst von Brand [EMAIL PROTECTED] writes:

 I'd prefer to be a guinea pig for one of 3 or 4 generic kernels distributed
 in binary than of one of the hundreds of possibilities of patching a kernel
 together at boot, plus the (presumamby rather complex and fragile)
 machinery to do so *before* the kernel is booted, thank you very much.
 
 Plus I'm getting pissed off by how long a boot takes as it stands today...

Just for reference I can Boot from Power on to Login prompt in 12 seconds.
With Linux.  The big change is nuking the BIOS

  They just want it to boot, and run with the same level of ease of use
  and stability they get with NT and NetWare and other stuff they are used
  to.   This is an easy choice from where I'm sitting.
 
 Easy: i386. Or i486 (I very much doubt your customers run on less, and this
 should be geneic enough).

It's also possible to do a two stage boot.  Stage 1 i386 kernel stage 2 the
specific kernel for the machine  This adds about a second to the
whole boot process.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test11-pre2 compile error undefined reference to `bust_spinlocks' WHAT?!

2000-11-11 Thread Eric W. Biederman

Andrew Morton [EMAIL PROTECTED] writes:

 George Anzinger wrote:
  
  The notion of releasing a spin lock by initializing it seems IMHO, on
  the face of it, way off.  Firstly the protected area is no longer
  protected which could lead to undefined errors/ crashes and secondly,
  any future use of spinlocks to control preemption could have a lot of
  trouble with this, principally because the locker is unknown.
  
  In the case at hand, it would seem that an unlocked path to the console
  is a more correct answer that gives the system a far better chance of
  actually remaining viable.
  
 
 Does bust_spinlocks() muck up the preemptive kernel's spinlock
 counting?  Would you prefer spin_trylock()/spin_unlock()?
 It doesn't matter - if we call bust_spinlocks() the kernel is
 known to be dead meat and there is a fsck in your near future.
 
 We are still trying to find out why kumon@fujitsu's 8-way is
 crashing on the test10-pre5 sched.c.  Looks like it's fixed
 in test11-pre2 but we want to know _why_ it's fixed.  And at
 present each time he hits the bug, his printk() deadlocks.
 
 So bust_spinlocks() is a RAS feature :)  A very important one -
 it's terrible when your one-in-a-trillion bug happens and there
 are no diagnostics.
 
 It's a work-in-progress.  There are a lot of things which
 can cause printk to deadlock:
 
 - console_lock
 - timerlist_lock
 - global_irq_lock (console code does global_cli)
 - log_wait.lock
 - tasklist_lock (printk does wake_up) (*)
 - runqueue_lock (printk does wake_up)
 
 I'll be proposing a better patch for this in a few days.

Hmm.  I would like to suggest we look at non locking variants of
things. i.e. Data structure version changing with swap.  


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-11 Thread Eric W. Biederman

Michael Rothwell [EMAIL PROTECTED] writes:

 "Eric W. Biederman" wrote:
  
  I have recently developed a patch that allows linux to directly boot
  into another linux kernel.  
 
 This would rock. One place I can think of using it is with distro
 installers. The installer boots a generic i386 kernel, and then installs
 an optimized (i.e, PIII, etc.) kernel for run-time.

This would rock?  It already does.  Of course the installers need
to actually uses this.

Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-11 Thread Eric W. Biederman

"H. Peter Anvin" [EMAIL PROTECTED] writes:

 Followup to:  [EMAIL PROTECTED]
 By author:[EMAIL PROTECTED] (Eric W. Biederman)
 In newsgroup: linux.dev.kernel

The interface is designed to be simple and inflexible yet very
powerful.  To that end the code just takes an elf binary, and a
command line.  The started image also takes an environment generated
by the kernel of all of the unprobeable hardware details.
   
   Isn't this what milo does on alpha?
  
  Similar milo uses kernel drivers in it's own framework.  
  This has proved to be a major maintenance problem.  Milo is nearly
  a kernel fork.  
  
  The design is for the long term to get this incorporated into the
  kernel, and even if not a small kernel patch should be easier to
  maintain that a harness for calling kernel drivers.
  
 
 I'm working on something similiar in "Genesis".  It pretty much is (or
 rather, will be) a kernel *port*, not a fork; the port is such that it
 can run on top of a simple BIOS extender and thus access the boot
 media.

Hmm.  You must mean similiar to milo.

Have fun.  With linuxBIOS I'm working exactly the other way.  Killing
off the BIOS.  And letting the initial firmware be just a boot loader.
The reduction is complexity should make it more reliable.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-11 Thread Eric W. Biederman

Adam Lazur [EMAIL PROTECTED] writes:

 Eric W. Biederman ([EMAIL PROTECTED]) said:
  Michael Rothwell [EMAIL PROTECTED] writes:
   This would rock. One place I can think of using it is with distro
   installers. The installer boots a generic i386 kernel, and then installs
   an optimized (i.e, PIII, etc.) kernel for run-time.
  
  This would rock?  It already does.  Of course the installers need
  to actually uses this.
 
 Actually, along the lines of what Scyld uses two kernel monte for with
 their Beowulf2 distribution.
 
 They boot a network enabled kernel which pulls a kernel off of a server
 and then uses two kernel monte to boot with that one.  This allows you
 to centrally admin your cluster with one server. Good stuff...

Yep.  You can also do this with etherboot flashed on one a nick card as well.

I also intend to use my work for this functionality as well.  
FYI I work for linux networx which builds hardware for linux clusters.

The fact that Scyld is using arp and a fixed network socket is a 
design decision I don't agree with.   

Truly slick will be when linuxBIOS is solid.  Then you even get remote
control of the BIOS, and remote booting all from within the BIOS.  Only
time will tell if it is worth the effort :)

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-11 Thread Eric W. Biederman

Adam Lazur [EMAIL PROTECTED] writes:

 Eric W. Biederman ([EMAIL PROTECTED]) said:
  I have recently developed a patch that allows linux to directly boot
  into another linux kernel.  With the code freeze it appears
  inappropriate to submit it at this time. 
 
 Aside from what looks to be support for SMP, how does this differ from
 the two kernel monte stuff at http://scyld.com/software/monte.html ?

I admit that LOBOS, two kernel monte, and the one by by Werner Almsberg.
Were all related work that I looked at.  And I acknowledge
there were some good ideas I pilfered from all of them.

There are a couple of differences.  
But the big one is I'm trying to do it right.  In particular
this means fixing the problem where the problem is.

Additionally I'm killing backwards compatibility with a lot of short
sited things.

And multiplatform support is in the plan.  So long term this should
run on alpha, and x86, and sparc and everything else out there
that linux supports.  This means that you can have a multiplatform
boot loader.  There will have to be glue code out there to get
started from different firmware on different machines but that is it.

Additionally mine is the only one that has a real chance of booting
a non-linux kernel.  Gathering the non probable hardware information
is hard.  Currently mine implementation is the only one to not simply
copy the boot parameters page that is give to the linux kernel.

Unlike 2 kernel monte mine deliberately has no reliance upon a BIOS.

There is another major difference as well.  kexec is part of work
on the linuxBIOS project.  Where the goal is to have a very minimal
firmware before booting into linux.  And to use that initial linux
kernel as the firmware hardware drivers.  What this means is kexec
is being developed from a point of view that needs it.  If you don't
have a BIOS kexec is a must.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-11 Thread Eric W. Biederman

"H. Peter Anvin" [EMAIL PROTECTED] writes:

 "Eric W. Biederman" wrote:
  
  Hmm.  You must mean similiar to milo.
  
  Have fun.  With linuxBIOS I'm working exactly the other way.  Killing
  off the BIOS.  And letting the initial firmware be just a boot loader.
  The reduction is complexity should make it more reliable.
  
 
 ... except that you have to handle every single motherboard architecture
 out there now.

Agreed that is a bit of a risk.  Mostly you just have to handle
the chipset of the boards and there are a finite number of them.

Only time will tell if this is truly feasible.  I think it is certainly
work a try.  

And I don't have to handle every single one just all of the ones
I need it to run on :)

With the my kexec patch I'm just getting the infrastructure ready, and that
is functionality that can be used independently of linuxBIOS.  If
booting linux from linux would help with what you are doing I love to
work together on that.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: bzImage ~ 900K with i386 test11-pre2

2000-11-12 Thread Eric W. Biederman

Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Sun, Nov 12, 2000 at 06:14:36AM -0700, Eric W. Biederman wrote:
  x86-64 doesn't load the segment registers at all before use.
 
 Yes, before switching to 64bit long mode we never do any data access. We do a
 stack access to clear eflags only while we still run in legacy mode with paging
 disabled and so we only rely on ss to be valid when the bootloader jumps at
 0x10 for executing the head.S code (and not anymore on the gdt_48 layout).

Nope you rely on cs  ds as well.  cs is just a duh the codes running
so it must be valid.  But ds is needed for lgdt.

  I can tell you don't have real hardware.  The non obviousness

I need to retract this a bit.  You are still building a compressed image,
and the code in the boot/compressed/head.S remains unchanged and loads
segment registers, so it works by luck.  If you didn't build a
compressed image you would be in trouble.
 
 Current code definitely works fine on the simnow simulator so if current code
 shouldn't work because it's buggy then at least the simulator is sure buggy as
 well (and that isn't going to be the case as its behaviour is in full sync with
 the specs as far I can see).

Add a target for a noncompressed image and then build.  It should be
interesting to watch.
 
  So while you load the gdt before you set a segment register later,
  which is good the more important part was still missed.
 
 Sorry but I don't see the missing part. Are you sure you're not missing this
 part of the x86-64 specs?

Nope because what I was complaining about is in 32 bit mode. :)

   Data and Stack Segments:
 
   In 64-bit mode, the contents of the ES, DS, and SS segment registers
   are ignored. All fields (base, limit, and attribute) in the
   corresponding segment descriptor registers (hidden part) are also
   ignored.

Hmm.  I'll have to look and see if FS  GS are also ignored.

   Address calculations in 64-bit mode that reference the ES, DS, or SS
   segments, are treated as if the segment base is zero.  Rather than
   perform limit checks, the processor instead checks that all
   virtual-address references are in canonical form.

Cool I like this bit.  The segments are finally dead.

  O.k. on monday I'll dig up my patch and that clears this up.
 
 Sure, go ahead if you weren't missing that basic part of the long mode specs.
 Thanks.

Nope.  Though I suspect we should do the switch to 64bit mode in
setup.S and not have these issues pollute head.S at all.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: bzImage ~ 900K with i386 test11-pre2

2000-11-12 Thread Eric W. Biederman

Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Sun, Nov 12, 2000 at 06:14:36AM -0700, Eric W. Biederman wrote:
  x86-64 doesn't load the segment registers at all before use.
 
 Yes, before switching to 64bit long mode we never do any data access. We do a
 stack access to clear eflags only while we still run in legacy mode with paging
 disabled and so we only rely on ss to be valid when the bootloader jumps at
 0x10 for executing the head.S code (and not anymore on the gdt_48 layout).

Actually it just occurred to me that this stack assess is buggy.  You haven't
set up a stack yet so.  Only the boot/compressed/head.S did and that location isn't
safe to use.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-15 Thread Eric W. Biederman

Erik Andersen [EMAIL PROTECTED] writes:

 On Thu Nov 09, 2000 at 01:18:24AM -0700, Eric W. Biederman wrote:
  
  I have recently developed a patch that allows linux to directly boot
  into another linux kernel.  
 
 Looks very cool.  I'm curious about your decision to use ELF images.  This
 makes it much less conveinient to use due to the kernel postprocessing, and
 makes it that the kernel binary from which you initially boot is not
 necessirily the same as the binary that you re-boot into.  

The decision here was that I needed to pass a vector of 
physical address, length, data pairs.  The elf program header
is dead simple and provides it.  So I either had to invent a
complicated argument passing mechanism for a syscall or have the
kernel parse a file.

 Wouldn't it be more reasonable to simply try to exec whatever file is provided?
 If the concern is initrds; they can be simply pasted into the kernel binary.

That's exactly what my preprocessing does. 

vmlinux is also an elf binary.  As is arch/i386/boot/bvmlinux but it
is compressed.

All mkelfImage does is the pasting of initrd's, command lines,
and just a touch of argument conversion code.

What I don't do deliberately is allow or need setup.S which does
syscalls to run.  All it does are BIOS calls, and store them in a
nasty data structure.  I have replaced that data structure with 
something that is maintainable.  

I would like very much to not need mkelfImage.  However that
requires further changes to the kernel, and I cannot boot an unpatched
kernel with that method.  

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-15 Thread Eric W. Biederman

Erik Andersen [EMAIL PROTECTED] writes:

 On Tue Nov 14, 2000 at 07:59:18AM -0700, Eric W. Biederman wrote:
  
  All mkelfImage does is the pasting of initrd's, command lines,
  and just a touch of argument conversion code.
 
 You can link in an initrd using linker magic, i.e.
 $(OBJCOPY) --add-section=image=kernel --add-section=initrd=initrd.gz

Hmm this is certainly possible.
My impression is that this doesn't currently work on x86.
I would love to be wrong.

 This is done in ppc/boot/Makefile for example.  It might be a nice thing
 to add a .config option to optionally specify an initrd to link into
 the kernel image.  Similarly, several architectures have a CONFIG_CMDLINE
 which could also do the job (see arch/ppc/config.in for example).  
 
 Presumably, by doing such things you could avoid needing to use mkelfImage.

Agreed.  And I would like to see that.
With the 2.4 code freeze it is too late to do that today. 
Also mkelfImage gives me backwards compatibility for now.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Addressing logically the buffer cache

2000-11-16 Thread Eric W. Biederman

Juan [EMAIL PROTECTED] writes:

 Alexander Viro escribió:
  
  On Tue, 14 Nov 2000, Juan wrote:
  
   Hi!.
  
   Is there any patch or project to address logically the buffer cache?.
   Now, you use three parameters to find a buffer in cache: device, block
   number, and block size. But, what about if I want to find a buffer using
   a super block, an inode number, and a block number within the file
   specified by the inode number.
  
  What's wrong with using the pagecache and per-page buffer_heads?
 
 Suppose you are implementing a log-structured file system and a process
 adds a new logical block to a file. Besides, suppose that the segment is
 512 KBytes in size. Usually, you don't want to write the segment before
 it is full. The logical block hasn't got a physical address because you
 don't build the segment until it is written to disk. So, what happens if
 another process wants to access to the new block?.
 
 You can't assign a physical address to the new block because the address
 can change when the buffer is written to disk.

So you don't assign a buffer head until you make the final decision.
There are some interesting issues with how you track that your data
is dirty but otherwise all is well.
 
 Perhaps, I'm wrong, but I think that the implementation of the BSD-LFS
 needs to address logically the buffer cache.

The linux vfs is quite different from the berkley one.  The linux page
cache is much closer to the berkley block cache, then the depricated
linux block cache.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Swapping over NFS in Linux 2.4?

2000-11-16 Thread Eric W. Biederman

Rik van Riel [EMAIL PROTECTED] writes:

 On Wed, 15 Nov 2000, Andreas Osterburg wrote:
 
  Because I set up a diskless Linux-workstation, I want to swap
  over NFS. For this purpose I found only patches for "older"
  Linux-versions (2.0, 2.1, 2.2?).
 
  Does anyone know wheter there are patches for 2.4 or does anyone
  know another solution for this problem?
 
 1. you can swap over NBD
 2. if you point me to the swap-over-nfs patches you
have found, I can try to make them work on 2.4 ;)

Rik all we need to do now is convert the swapout code to address space
methods just like the block device was.

This has a number of interesting effects.  One of which is that
brw_page should no longer have any users.  Simplifying fs/buffer.c

Further this is equivalent to mounting a nfs file loop back which
the address space methods now allow, but it is more direct.  

Which means that if this reveals any bugs in nfs/lock ups in nfs they
were already there.

This has been on my want to do list for a while but I'm busy
reinventing booting so I haven't gotten to it.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-16 Thread Eric W. Biederman

Werner Almesberger [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:
  There are a couple of differences.  
  But the big one is I'm trying to do it right.
 
 So why do you need a file-based interface then ? ;-)

When possible it is nice to set as much policy as possible,
without removing functionality.

 Since this is a highly privileged operation anyway, you may as
 well trust user space to use the right data format ...

Hmm. I hadn't thought of it from that angle.
I don't think I have much code tied up in format checks
so I'm not too worried.  If something goes wrong it is simple
a question of where it will crash.  Doing some checking simply
allows for better debugging of problems :)

One thing I'm going to have to consider though is if the memory
regions that the new kernel is going into are actually memory.  The
pro argument is that checking for reserved areas of memory catches
changes to an architecture that were unexpected.  The recent issues
with the extended BIOS area growing are a good example of this.


 I get the impression that you incur quite a lot of overhead just
 to make it fit with the exec interface. I agree that it's
 conceptually nice, and it looks cleanly done, but I don't quite
 see the practical value. (Except, perhaps, that this allowed you
 to pick the rather cute name "kexec" ;-)

Well there is that.  Somehow implementing scatter/gather from 
a user space process seemed like a potential mess, and extra work.

In part I am starting with a network boot loader, so building
a file format that works was needed anyway.  As far as overhead my
impression is that there is none in speed, and only one or two extra
ones functions in space. 

  Additionally mine is the only one that has a real chance of booting
  a non-linux kernel.
 
 Hmm, I think all approaches could boot a non-Linux kernel, but ...
bootimg is close.

I was thinking a couple of directions here.
- Mine is the only interface that can boot a non-Linux kernel
  natively.  Bootimg doesn't count because it doesn't do anything
  natively :)

In particular every other boot loader passes the nasty empty zero page
to the new kernel.  Definitely requiring a chain loader.

With an OS neutral format, cataloging the non-probable hardware
details, and providing those details in an extensible format, I gain a lot 
in easy extensibility.

I need to find time soon and write up all of the file format details
in an RFC like the GRUB multiboot spec.  Possibly even submit it
to the IETF as an RFC for compatible booting and multiple platforms.

And this raises an important point.  Lazy programmers tend to go
with whatever is easiest.  Having a good file format, making this
the easy case, should reduce the number of formats supported
and increase boot interoperability.  Most of what was said
on this score with GRUB I agree with.  I would even be following
the GRUB multiboot spec except it doesn't allow passing of the
unprobeable hardware details and it doesn't allow easy expansion of
what it does pass.  This is the big reason I'm not in favor
of the bootimg approach, that doesn't define anything.


 As far as loading is concerned, bootimg probably has an advantage
 there, because you can put things together in memory (e.g. some
 OS-specific chain loader), without going to secondary storage.

Well with ramfs is hardly secondary storage, though it has
a touch more overhead.  And you only need to do this for the
non common case.  Getting images to adapt to a specific bootloader
isn't to hard.  Every other boot loader in the world does it.

 (Proof of concept: bootimg is able to load all currently supported
 kernel image formats on ia32.)

I do conceded that bootimg has this ability as well in theory.

I actually have booted multiboot compliant images in an earlier
version of my patch and the cost to support both formats in a kernel
loader is negligible.  My mkelfImage builds linux kernels that
support being booted both ways.

 As far as execution is concerned, you're probably slightly better
 off with an approach that goes back to real mode. (Or use a chain
 loader - this can be transparent to the kernel.) But then, I'm not
 sure if you can re-animate the BIOS in any consistent way, so your
 choice of operating systems may be quite limited, or you have to
 provide your own BIOS substitute.

Agreed if the goal is to boot code is designed to start with a single
sector loaded at 0x7c00.  If I really care I might worry about that.
Since linux preserved the first page of memory which includes the
interrupt table reanimating the BIOS might not be so bad. 

My primary non-linux target are the BSD's, and various experimental
OS's.  And in those cases why go to the pain of dropping out of
protected mode if you are going to just load back into it again.

All of what I do is colored by the fact that my most important
environment I have no BIOS.  So for me I can't reanimate the BIOS
because it isn't there.  Once this bullet is bitten though this

Re: bzImage ~ 900K with i386 test11-pre2

2000-11-16 Thread Eric W. Biederman

Andi Kleen [EMAIL PROTECTED] writes:

 [This is quite a bizarre discussion, but I'll answer anyways. I am not exactly
 sure what your point is]

Let me step aside a second and explain where I'm coming from.  As a spin
off of the work of the linuxBIOS project I have implemented a system
call that implements exec functionality at the kernel level.  Essentially
allowing you to warm boot linux from linux.  To get this to work no
bios calls are involved, so I'm not using setup.S.  This also has the 
interesting side effect of allowing a boot loader to be written that
will work on all linux platforms.  (I have currently just begun my
port to alpha).

In the process of the above I have learned quite a bit about how
the current boot loader works.  And want eventually to convert linux
to not need wrapper code to use my bootloader.

Booting vmlinux is fun :)
 
 On Sun, Nov 12, 2000 at 11:57:15AM -0700, Eric W. Biederman wrote:
  
I can tell you don't have real hardware.  The non obviousness
  
  I need to retract this a bit.  You are still building a compressed image,
  and the code in the boot/compressed/head.S remains unchanged and loads
  segment registers, so it works by luck.  If you didn't build a
  compressed image you would be in trouble.
 
 boot/compressed/head.S does run in 32bit legacy mode, where you of course
 need segment registers. After you got into long mode segments are only
 needed to jump between 32/64bit code segments and and for a the data segment
 of the 32bit emulation (+ the iretd bug currently which I hope will be fixed
 in final hardware) 
 
 Also note that boot/compressed/* currently does not even link, because the 
 x86-64 toolchain cannot generate relocated 32bit code ATM (the linker chokes
 on the 32bit relocations) The tests we did so far used a precompiled 
 relocated binary compressed/{head,misc}.o from a IA32 build.
...

   Sure, go ahead if you weren't missing that basic part of the long mode
 specs.
 
   Thanks.
  
  Nope.  Though I suspect we should do the switch to 64bit mode in
  setup.S and not have these issues pollute head.S at all.
 
 I see no advantage in doing it there instead of in head.S

After reading through the long mode specs I now agree.  If you could
be in long mode with the mmu disabled that would be a different story
but you can't and it isn't. 

I was thinking of symmetry with the x86 and how much easier everything
is if you only use one processor mode for the initial boot strap.  No
need for super assemblers etc. Oh well.

On x86 there are some real advantages to moving the segment loads into
setup.S from the various head.S's and they still apply (although to a
lesser extent) to x86-64. This causes less code confusion.

For my kexec stuff I now need to think really hard how I want
to handle x86-64.  What I was thinking would work well in general
is to start the processor it's native/optimal mode with the mmu
disabled.  With x86-64 I can't do this unfortunately :(

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Advanced Linux Kernel/Enterprise Linux Kernel

2000-11-17 Thread Eric W. Biederman

Daniel Phillips [EMAIL PROTECTED] writes:

 Actually, I was planning on doing on putting in a hack to do something
 like that: calculate a checksum after every buffer data update and check
 it after write completion, to make sure nothing scribbled in the buffer
 in the interim.  This would also pick up some bad memory problems.

Be very careful that this just applies to metadata.  For normal data
this is a valid case.  Weird but valid.


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] swap=device kernel commandline

2000-11-18 Thread Eric W. Biederman

Werner Almesberger [EMAIL PROTECTED] writes:

 Rik van Riel wrote:
  Did you try to load an initrd on a low-memory machine?
  It shouldn't work and it probably won't ;)
 
 You must be really low on memory ;-)
 
 # zcat initrd.gz | wc -c
  409600
 
 (ash, pwd, chroot, pivot_root, smount, and still about 82 kB free.)

Hmm And that's without trying to be small.
I have one that loads a second kernel over the network using dhcp 
to configure it's interface and tftp to fetch the image and boots
that is only 20kb uncompressed

Compressed I can fit that and a kernel all in plus a minimal
BIOS all in 512K with some room to spare...

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] swap=device kernel commandline

2000-11-19 Thread Eric W. Biederman

Werner Almesberger [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:
  I have one that loads a second kernel over the network using dhcp 
  to configure it's interface and tftp to fetch the image and boots
  that is only 20kb uncompressed
 
 Neat ;-) My goal is actually not only size, but also to have a relatively
 normal build environment, e.g. my example is with shared newlib, regular
 ash, and - unfortunately rather wasteful - glibc's ld.so.
 
 But a tftp loader in 20kB is rather good. Now the next challenge is the
 same thing with NFS. Then we can finally kill nfsroot ;-)

Hmm. What does it take to mount an NFS partition?

Anyway.  All I did was wrote a tiny libc that is just a bunch of
wrappers for syscalls, and some string functions.  Then I just wrote
a straight forward C program to do the job.  Except for my added
kexec call I can compile with glibc :)

Now if glibc wouldn't link in 200k of unused crap when you make a
trivial static binary I'd much prefer to use it...

Though I wish it was possible to have a ramfs preloader instead of
initrd.  An initramfs would allow me to not even compile in the block
device driver layer, and be more efficient.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-19 Thread Eric W. Biederman

Werner Almesberger [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:
  Well there is that.  Somehow implementing scatter/gather from 
  a user space process seemed like a potential mess, and extra work.
 
 Did you look at kiobufs ? I think they may just have the right
 functionality. I always wanted bootimg to be able to memory-map things
 to reduce memory pressure, and it seems now all the ingredients are in
 place. Your file-based approach could probably use brw_kiovec.

When I looked kiobufs seemed to do a good gather but not a good scatter.
The code wasn't trivially reusable, and the structures had a lot
of overhead.

 
  I need to find time soon and write up all of the file format details
  in an RFC like the GRUB multiboot spec.  Possibly even submit it
  to the IETF as an RFC for compatible booting and multiple platforms.
 
 Hmm, if you succeed in selling the format as an integral part of your
 network boot protocol, this may even work ;-)

Well I'd sell it to promote interoperability.  What I'm doing protocol
wise has been RFC sanctions for years.  It's just that every vendor
invents their own format.  So interoperability is a problem.

 
  This is the big reason I'm not in favor
  of the bootimg approach, that doesn't define anything.
 
 Oh, it does - but the policy is implemented in user space. And, of
 course, it's rather simple. But I'm a little confused with your
 UBE. It only seems to copy the e820 information, so you still seem
 to rely on e.g. the SMP tables the BIOS stores in memory. Also, I
 don't quite see where you're using the saved information. What am
 I missing ?

Defining all of the parameters for the UBE is a separate issue.
It comes next in a couple of weeks.

The rebooting is done the rest is not yet.

As far as where I use the information is used, look in do_kexec.
Right after kimage_get_chunk which figures out where it is safe
to put the information.  

 However, parameter passing like UBE may solve the following two
 potential problems:
 
  - kernel 1 copies tables marked by "magic" numbers in memory,
then boots kernel 2, which trips over the copy
  - kernel 1 doesn't know about a table and damages it, then boots
kernel 2, which recognizes the table, and trips over it
 
 But I think we don't need to copy or even convert the entire tables for
 this. After all, any OS that boots on i386 already knows how to parse
 the BIOS-provided tables, so I think it's better to directly re-use
 this code than to invent a new format. A few flags or maybe a short
 list should be sufficient for the problems I've described above.

I agree writing the code to understand the table may be a significant
issue.  On the other hand I still think it is worth a look, being
able to unify option parsing for multiple platforms is not a small
gain, nor is getting out from short sighted vendor half standards.

Besides which most tables seem to contain a lot of information that
is probeable.  Which just makes them a waste of BIOS space, and
sources of bugs.

  My primary non-linux target are the BSD's, and various experimental
  OS's.  And in those cases why go to the pain of dropping out of
  protected mode if you are going to just load back into it again.
 
 Yep, I fully agree.
 
  Compiling the code in it's own file and putting it in it's own section
  of the kernel for size would probably do it though.
 
 This is exactly what bootimg does :)
 
  Being sure the code is PIC is a little tricky though.
 
 Yes, for now I cheat and depend on gcc to generate code that just
 happens to be PIC.

Hmm. I wonder how hard it would be to add -fPIC to the compilation
line for that file.  But I'm not certain that would do what I want
in this instance...

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: neighbour table?

2000-11-19 Thread Eric W. Biederman

David Ford [EMAIL PROTECTED] writes:

 Andrew Park wrote:
 
  I get a message
 
  neighbour table overflow
 
  What does that mean?  It seems that
 
  net/ipv4/route.c
 
  is the place where it prints this.  But under what circumstances
  does this happen?
  Thanks
 
 It means you set the link state of eth0 up before lo.
 
 Be sure lo is established before eth0 and you won't see this message.

Hmm.  How does the interaction work.  I've been meaning to track it for
a while but haven't yet.  

From the cases I have observed it seems to be connected with arp requests
that aren't answered. (I.e when something is misconfigured and you try to nfsroot off
of the wrong ip on your subnet)
And I keep thinking neighbour table underflow would have been a better message.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Q: Linux rebooting directly into linux.

2000-11-19 Thread Eric W. Biederman

Werner Almesberger [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:
  The code wasn't trivially reusable, and the structures had a lot
  of overhead.
 
 There's some overhead, but I think it's not too bad. I'll give it a
 try ...
 
  The rebooting is done the rest is not yet.
 
 Ah, and I already wondered where in all the APIC code you've hidden
 the magic to avoid the config data clobbering issues ;-)

Nope.  That just comes in two parts.
The first chunk is the work on the apic so the deadlock detector
can run on UP kernels.  From Ingo Molanar.  The second part are my
cleanups so we up the apic in a sane state upon reboot.
 
  I agree writing the code to understand the table may be a significant
  issue.  On the other hand I still think it is worth a look, being
  able to unify option parsing for multiple platforms is not a small
  gain, nor is getting out from short sighted vendor half standards.
 
 Well, you certainly have a point where stupid vendors and BIOS nonsense
 are concerned. However, if we ignore LinuxBIOS for a moment, each
 platform already has a set of configuration parameter passing conventions
 imposed by the firmware. So we need to be able to handle this anyway, and
 most of the information is highly platform-specific.
 
 LinuxBIOS is a special case, because you have your own firmware. But
 what you're suggesting is basically yet another parameter format, which
 needs to incorporate and possibly unify much of the information
 contained in all those platform-specific formats. I'm not sure it's worth
 the effort.
 
 And, besides, I think it complicates the kernel, because you either
 have to add a parallel set of functions extracting and processing data
 from the "native" or the UBE environment, or you have to add a converter
 between "native" and UBE for each platform. Or do you have a better
 plan ?

My initial plan was to have two parallel table parsers.  The ones we
have now.  And another based on UBE.  If we find the information we
need via UBE use that.  If not fall back to the old way.

But the tables are only half of it.  Right now we have all kinds
of weirdness going through the empty_zero_page at boot time.
A lot of that I plan on just gather in UBE format instead of random
data in random locations.  Since Setup.S implements this it should
be transparent to most everything.

But I need to see how well that works first before I'm too commited
either way.

For x86 it isn't too big of a deal.  For other platforms though
where the Firmware comes is multiple flavors converting everything
looks like it could be a real win.

I guess what I'm most after is improving the linux BIOS abstraction layer.
We mostly have one, and only do BIOS calls before really starting the
kernel (except for some stupid BIOS standards like APM).

 When I started with bootimg, I also thought that we'd need some
 parameter passing mechanism, a bit similar to UBE (although I would
 have tried to be more text-based). Then I realized that there are
 actually only a few tables, and we can just keep them in memory. And
 some of them need to be modified before we can re-use them. (Trivial
 example: the boot command line. Video modes are a similar, although
 much more complicated issue.)

I agree with tables that we need to be careful.  A lossy conversion
can be a real problem.  The empty_zero_page is my first canidate,
and I'll see where it goes from there.

One of the more ugly challenges that I've already run into is that
there are multiple tables for specifying how interrupts are routed.
(In modern PC irq number is dynamically assigned).  I would
like to have one good table than two that fight each other.

But the point is that looking through the parameters and figuring
out what works and what makes sense will take some doing, and
I'm not promising to do any more than clean up the empty_zero_page.

 
  Besides which most tables seem to contain a lot of information that
  is probeable.  Which just makes them a waste of BIOS space, and
  sources of bugs.
 
 Agreed with BIOS bugs ;-) Where probing is possible, is it reliable ?
 It'd take some baroque BIOS parameter table over yet another mandatory
 boot command line parameter any time ...
 
  Hmm. I wonder how hard it would be to add -fPIC to the compilation
  line for that file.  But I'm not certain that would do what I want
  in this instance...
 
 Are there actually architectures where the compiler generates
 position-dependent code even if you're careful ? (I.e. all functions
 inlined, only auto variables.)

I don't know yet.  And since that part is machine specific, x86 is
really the only case that matters.  I just don't quite trust the compiler.
But next rev I'll make certain to steal this code from bootimg.

Given a normal architecture I believe no references to global data
should be sufficient, to ensure the code is pic.  Inlines are
interesting because they aren't always inlined.  To be really 
certain you can specify -fPIC a

Re: Kernel 2.5 Workshop RealVideo streams -- next time, please get better audio.

2001-04-18 Thread Eric W. Biederman

Miles Lane [EMAIL PROTECTED] writes:

   http://www.osdn.com/conferences/kernel/
 
 Thanks to all responsible for getting these captures 
 of the Kernel 2.5 Workshop prosentations put together.
 
 There is one major shortcoming of the recordings.
 Usually, only the comments of the presenter(s)
 can be heard.  This reduces the value of these
 recording substantially, since the comments, insights
 and give-and-take of the other kernel developers would
 help us get a much more complete understanding of the
 areas being presented -- try listening to Andy Grover's
 Power Management presentation and you'll see what I 
 mean.

I actually managed to get almost all of it by simply pressing my ear
against my speaker, and then pulling back quickly when the main
speaker was talking.

So my question is, what would it take to get some automatic software
volume correction going.  This looks like it would be the easiest fix
of all.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Longstanding elf fix (2.2.19 fix)

2001-04-22 Thread Eric W. Biederman


A little while ago I was playing with building an elf self extracting
binary.  In doing so I discovered that the linux kernel does not
handle elf program headers with multiple BSS segments.

Eric



Binary files linux-2.2.19/drivers/char/conmakehash and linux-2.2.19.elf-fix/drivers/char/conmakehash differ
Binary files linux-2.2.19/drivers/char/hfmodem/gentbl and linux-2.2.19.elf-fix/drivers/char/hfmodem/gentbl differ
diff -uNrX linux-exclude-files linux-2.2.19/fs/binfmt_elf.c linux-2.2.19.elf-fix/fs/binfmt_elf.c
--- linux-2.2.19/fs/binfmt_elf.c	Fri Apr 20 13:25:11 2001
+++ linux-2.2.19.elf-fix/fs/binfmt_elf.c	Sun Apr 22 17:55:42 2001
@@ -71,18 +71,6 @@
 #endif
 };
 
-static void set_brk(unsigned long start, unsigned long end)
-{
-	start = ELF_PAGEALIGN(start);
-	end = ELF_PAGEALIGN(end);
-	if (end = start)
-		return;
-	do_mmap(NULL, start, end - start,
-		PROT_READ | PROT_WRITE | PROT_EXEC,
-		MAP_FIXED | MAP_PRIVATE, 0);
-}
-
-
 /* We need to explicitly zero any fractional pages
after the data section (i.e. bss).  This would
contain the junk from the file that should not
@@ -213,6 +201,28 @@
 	return sp;
 }
 
+static inline unsigned long
+elf_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type)
+{
+	unsigned long start, data_len, mem_len, offset;
+	unsigned long map_addr;
+
+	start = ELF_PAGESTART(addr);
+	data_len = ELF_PAGEALIGN(eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+	mem_len = ELF_PAGEALIGN(eppnt-p_memsz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+	offset = eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr);
+	
+	if (eppnt-p_filesz) {
+		map_addr = do_mmap(filep, start, data_len, prot, type, offset);
+		do_mmap(NULL, map_addr + data_len, mem_len - data_len, prot,
+			MAP_FIXED | MAP_PRIVATE, 0);
+		padzero(map_addr + eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+	} else {
+		map_addr = do_mmap(NULL, start, mem_len, prot, MAP_PRIVATE, 0);
+	}
+	return(map_addr);
+}
+
 
 /* This is much more generalized than the library routine read function,
so we keep this separate.  Technically the library read function
@@ -293,12 +303,7 @@
 #endif
 	}
 
-	map_addr = do_mmap(file,
-			load_addr + ELF_PAGESTART(vaddr),
-			eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr),
-			elf_prot,
-			elf_type,
-			eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr));
+	map_addr = elf_map(file, load_addr + vaddr, eppnt, elf_prot, elf_type);
 	if (map_addr  -1024UL) /* Real error */
 		goto out_close;
 
@@ -325,23 +330,6 @@
 	  }
 	}
 
-	/* Now use mmap to map the library into memory. */
-
-	/*
-	 * Now fill out the bss section.  First pad the last page up
-	 * to the page boundary, and then perform a mmap to make sure
-	 * that there are zero-mapped pages up to and including the 
-	 * last bss page.
-	 */
-	padzero(elf_bss);
-	elf_bss = ELF_PAGESTART(elf_bss + ELF_EXEC_PAGESIZE - 1); /* What we have mapped so far */
-
-	/* Map the last of the bss segment */
-	if (last_bss  elf_bss)
-		do_mmap(NULL, elf_bss, last_bss - elf_bss,
-			PROT_READ|PROT_WRITE|PROT_EXEC,
-			MAP_FIXED|MAP_PRIVATE, 0);
-
 	*interp_load_addr = load_addr;
 	/*
 	 * AUDIT: is everything deallocated properly if this happens
@@ -660,12 +648,7 @@
 		if (elf_ex.e_type == ET_EXEC || load_addr_set) {
 			elf_flags |= MAP_FIXED;
 		}
-
-		error = do_mmap(file, ELF_PAGESTART(load_bias + vaddr),
-		(elf_ppnt-p_filesz +
-		ELF_PAGEOFFSET(elf_ppnt-p_vaddr)),
-		elf_prot, elf_flags, (elf_ppnt-p_offset -
-		ELF_PAGEOFFSET(elf_ppnt-p_vaddr)));
+		error = elf_map(file, load_bias + vaddr, elf_ppnt, elf_prot, elf_flags);
 
 		if (!load_addr_set) {
 			load_addr_set = 1;
@@ -760,13 +743,6 @@
 	current-mm-start_code = start_code;
 	current-mm-end_data = end_data;
 	current-mm-start_stack = bprm-p;
-
-	/* Calling set_brk effectively mmaps the pages that we need
-	 * for the bss and break sections
-	 */
-	set_brk(elf_bss, elf_brk);
-
-	padzero(elf_bss);
 
 #if 0
 	printk("(start_brk) %x\n" , current-mm-start_brk);









[PATCH] Longstanding elf fix (2.4.3 fix)

2001-04-22 Thread Eric W. Biederman


A little while ago I was playing with building an elf self extracting
binary.  In doing so I discovered that the linux kernel does not
handle elf program headers with multiple BSS segments.

In building a patch for 2.4.3 I also discovered that we are not taking 
the mmap_sem around do_brk in the exec paths.

Attached is a patch that corrects, both of these problems.

Eric



diff -uNrX linux-exclude-files linux-2.4.3/arch/mips/kernel/irixelf.c linux-2.4.3.elf-fix2/arch/mips/kernel/irixelf.c
--- linux-2.4.3/arch/mips/kernel/irixelf.c	Fri Apr 20 12:06:40 2001
+++ linux-2.4.3.elf-fix2/arch/mips/kernel/irixelf.c	Sun Apr 22 17:00:28 2001
@@ -130,7 +130,9 @@
 	end = PAGE_ALIGN(end);
 	if (end = start) 
 		return;
+	down_write(current-mm-mmap_sem);
 	do_brk(start, end - start);
+	up_write(current-mm-mmap_sem);
 }
 
 
@@ -379,7 +381,9 @@
 
 	/* Map the last of the bss segment */
 	if (last_bss  len) {
+		down_write(current-mm-mmap_sem);
 		do_brk(len, (last_bss - len));
+		up_write(current-mm-mmap_sem);
 	}
 	kfree(elf_phdata);
 
@@ -567,8 +571,10 @@
 	unsigned long v;
 	struct prda *pp;
 
+	down_write(current-mm-mmap_sem);
 	v =  do_brk (PRDA_ADDRESS, PAGE_SIZE);
-	
+	up_write(current-mm-mmap_sem);
+		
 	if (v  0)
 		return;
 
@@ -858,8 +864,11 @@
 
 	len = (elf_phdata-p_filesz + elf_phdata-p_vaddr+ 0xfff)  0xf000;
 	bss = elf_phdata-p_memsz + elf_phdata-p_vaddr;
-	if (bss  len)
-	  do_brk(len, bss-len);
+	if (bss  len) {
+		down_write(current-mm-mmap_sem);
+		do_brk(len, bss-len);
+		up_write(current-mm-mmap_sem);
+	}
 	kfree(elf_phdata);
 	return 0;
 }
diff -uNrX linux-exclude-files linux-2.4.3/arch/s390x/kernel/binfmt_elf32.c linux-2.4.3.elf-fix2/arch/s390x/kernel/binfmt_elf32.c
--- linux-2.4.3/arch/s390x/kernel/binfmt_elf32.c	Fri Apr 20 12:06:43 2001
+++ linux-2.4.3.elf-fix2/arch/s390x/kernel/binfmt_elf32.c	Sun Apr 22 17:00:28 2001
@@ -188,16 +188,29 @@
 static unsigned long
 elf_map32 (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type)
 {
+	unsigned long start, data_len, mem_len, offset;
 	unsigned long map_addr;
 
 	if(!addr)
 		addr = 0x4000;
 
-	down_write(current-mm-mmap_sem);
-	map_addr = do_mmap(filep, ELF_PAGESTART(addr),
-			   eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr), prot, type,
-			   eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr));
-	up_write(current-mm-mmap_sem);
+	start = ELF_PAGESTART(addr);
+	data_len = ELF_PAGEALIGN(eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+	mem_len = ELF_PAGEALIGN(eppnt-p_memsz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+	offset = eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr);
+
+	if (eppnt-p_filesz) {
+		down_write(current-mm-mmap_sem);
+		map_addr = do_mmap(filep, start, data_len, prot, type, offset);
+		do_mmap(NULL, map_addr + data_len, mem_len - data_len, prot,
+			MAP_FIXED | MAP_PRIVATE, 0);
+		up_write(current-mm-mmap_sem);
+		padzero(map_addr + eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+	} else {
+		down_write(current-mm-mmap_sem);
+		map_addr = do_mmap(NULL, start, mem_len, prot, MAP_PRIVATE, 0);
+		up_write(current-mm-mmap_sem);
+	}
 	return(map_addr);
 }
 
diff -uNrX linux-exclude-files linux-2.4.3/arch/sparc64/kernel/binfmt_aout32.c linux-2.4.3.elf-fix2/arch/sparc64/kernel/binfmt_aout32.c
--- linux-2.4.3/arch/sparc64/kernel/binfmt_aout32.c	Fri Apr 20 12:06:44 2001
+++ linux-2.4.3.elf-fix2/arch/sparc64/kernel/binfmt_aout32.c	Sun Apr 22 17:00:28 2001
@@ -49,7 +49,9 @@
 	end = PAGE_ALIGN(end);
 	if (end = start)
 		return;
+	down_write(current-mm-mmap_sem);
 	do_brk(start, end - start);
+	up_write(current-mm-mmap_sem);
 }
 
 /*
@@ -245,10 +247,17 @@
 	if (N_MAGIC(ex) == NMAGIC) {
 		loff_t pos = fd_offset;
 		/* Fuck me plenty... */
+		down_write(current-mm-mmap_sem);
 		error = do_brk(N_TXTADDR(ex), ex.a_text);
+		up_write(current-mm-mmap_sem);
+
 		bprm-file-f_op-read(bprm-file, (char *) N_TXTADDR(ex),
 			  ex.a_text, pos);
+
+		down_write(current-mm-mmap_sem);
 		error = do_brk(N_DATADDR(ex), ex.a_data);
+		up_write(current-mm-mmap_sem);
+
 		bprm-file-f_op-read(bprm-file, (char *) N_DATADDR(ex),
 			  ex.a_data, pos);
 		goto beyond_if;
@@ -256,8 +265,10 @@
 
 	if (N_MAGIC(ex) == OMAGIC) {
 		loff_t pos = fd_offset;
+		down_write(current-mm-mmap_sem);
 		do_brk(N_TXTADDR(ex)  PAGE_MASK,
 			ex.a_text+ex.a_data + PAGE_SIZE - 1);
+		up_write(current-mm-mmap_sem);
 		bprm-file-f_op-read(bprm-file, (char *) N_TXTADDR(ex),
 			  ex.a_text+ex.a_data, pos);
 	} else {
@@ -271,7 +282,9 @@
 
 		if (!bprm-file-f_op-mmap) {
 			loff_t pos = fd_offset;
+			down_write(current-mm-mmap_sem);
 			do_brk(0, ex.a_text+ex.a_data);
+			up_write(current-mm-mmap_sem);
 			bprm-file-f_op-read(bprm-file,(char *)N_TXTADDR(ex),
   ex.a_text+ex.a_data, pos);
 			goto beyond_if;
@@ -382,7 +395,9 @@
 	len = PAGE_ALIGN(ex.a_text + ex.a_data);
 	bss = ex.a_text + ex.a_data + ex.a_bss;
 	if (bss  len) {
+		down_write(current-mm-mmap_sem);
 		error = do_brk(start_addr + len, bss - len);
+		up_write(current-mm-mmap_sem);
 		retval = 

[PATCH] Add DHCP to 2.4.x ipconfig support

2001-04-22 Thread Eric W. Biederman


Here is a forward port of the 2.2.x improvements to ipconfig.c.
Especially support for DHCP.

Eric



diff -uNr linux-2.4.3/Documentation/Configure.help linux-2.4.3.ipdhcp/Documentation/Configure.help
--- linux-2.4.3/Documentation/Configure.help	Fri Apr 20 12:06:37 2001
+++ linux-2.4.3.ipdhcp/Documentation/Configure.help	Sun Apr 22 16:03:26 2001
@@ -3961,6 +3961,21 @@
   want to use BOOTP, a BOOTP server must be operating on your network.
   Read Documentation/nfsroot.txt for details.
 
+DHCP support
+CONFIG_IP_PNP_DHCP
+  If you want your Linux box to mount its whole root filesystem (the
+  one containing the directory /) from some other computer over the
+  net via NFS and you want the IP address of your computer to be
+  discovered automatically at boot time using the DHCP protocol (a
+  special protocol designed for doing this job), say Y here. In case
+  the boot ROM of your network card was designed for booting Linux and
+  does DHCP itself, providing all necessary information on the kernel
+  command line, you can say N here.
+
+  If unsure, say Y. Note that if you want to use DHCP, a DHCP server
+  must be operating on your network.  Read Documentation/nfsroot.txt
+  for details.
+
 RARP support
 CONFIG_IP_PNP_RARP
   If you want your Linux box to mount its whole root file system (the
diff -uNr linux-2.4.3/include/net/ipconfig.h linux-2.4.3.ipdhcp/include/net/ipconfig.h
--- linux-2.4.3/include/net/ipconfig.h	Mon Jan  4 16:31:35 1999
+++ linux-2.4.3.ipdhcp/include/net/ipconfig.h	Sun Apr 22 16:03:26 2001
@@ -6,16 +6,33 @@
  *  Automatic IP Layer Configuration
  */
 
-extern __u32 root_server_addr;
-extern u8 root_server_path[];
-extern u32 ic_myaddr;
-extern u32 ic_servaddr;
-extern u32 ic_gateway;
-extern u32 ic_netmask;
-extern int ic_enable;
-extern int ic_host_name_set;
-extern int ic_set_manually;
-extern int ic_proto_enabled;
+/* The following are initdata: */
 
-#define IC_BOOTP 1
-#define IC_RARP 2
+extern int ic_enable;		/* Enable or disable the whole shebang */
+
+extern int ic_proto_enabled;	/* Protocols enabled (see IC_xxx) */
+extern int ic_host_name_set;	/* Host name set by ipconfig? */
+extern int ic_set_manually;	/* IPconfig parameters set manually */
+
+extern u32 ic_myaddr;		/* My IP address */
+extern u32 ic_netmask;		/* Netmask for local subnet */
+extern u32 ic_gateway;		/* Gateway IP address */
+
+extern u32 ic_servaddr;		/* Boot server IP address */
+
+extern u32 root_server_addr;	/* Address of NFS server */
+extern u8 root_server_path[];	/* Path to mount as root */
+
+
+
+/* The following are persistent (not initdata): */
+
+extern int ic_proto_used;	/* Protocol used, if any */
+extern u32 ic_nameserver;	/* DNS server IP address */
+extern u8 ic_domain[];		/* DNS (not NIS) domain name */
+
+/* bits in ic_proto_{enabled,used} */
+#define IC_PROTO	0xFF	/* Protocols mask: */
+#define IC_BOOTP	0x01	/*   BOOTP (or DHCP, see below) */
+#define IC_RARP		0x02	/*   RARP */
+#define IC_USE_DHCP0x100	/* If on, use DHCP instead of BOOTP */
diff -uNr linux-2.4.3/net/ipv4/Config.in linux-2.4.3.ipdhcp/net/ipv4/Config.in
--- linux-2.4.3/net/ipv4/Config.in	Tue Nov  7 15:12:02 2000
+++ linux-2.4.3.ipdhcp/net/ipv4/Config.in	Sun Apr 22 16:03:26 2001
@@ -20,6 +20,7 @@
 fi
 bool '  IP: kernel level autoconfiguration' CONFIG_IP_PNP
 if [ $CONFIG_IP_PNP = y ]; then
+   bool 'IP: DHCP support' CONFIG_IP_PNP_DHCP
bool 'IP: BOOTP support' CONFIG_IP_PNP_BOOTP
bool 'IP: RARP support' CONFIG_IP_PNP_RARP
 # not yet ready..
diff -uNr linux-2.4.3/net/ipv4/ipconfig.c linux-2.4.3.ipdhcp/net/ipv4/ipconfig.c
--- linux-2.4.3/net/ipv4/ipconfig.c	Mon Mar 26 18:20:57 2001
+++ linux-2.4.3.ipdhcp/net/ipv4/ipconfig.c	Sun Apr 22 16:55:36 2001
@@ -1,10 +1,10 @@
 /*
  *  $Id: ipconfig.c,v 1.35 2000/12/30 06:46:36 davem Exp $
  *
- *  Automatic Configuration of IP -- use BOOTP or RARP or user-supplied
- *  information to configure own IP address and routes.
+ *  Automatic Configuration of IP -- use DHCP, BOOTP, RARP, or
+ *  user-supplied information to configure own IP address and routes.
  *
- *  Copyright (C) 1996--1998 Martin Mares [EMAIL PROTECTED]
+ *  Copyright (C) 1996-1998 Martin Mares [EMAIL PROTECTED]
  *
  *  Derived from network configuration code in fs/nfs/nfsroot.c,
  *  originally Copyright (C) 1995, 1996 Gero Kuhlmann and me.
@@ -16,6 +16,16 @@
  *  Fixed ip_auto_config_setup calling at startup in the new Linker Magic
  *  initialization scheme.
  *	- Arnaldo Carvalho de Melo [EMAIL PROTECTED], 08/11/1999
+ *
+ *  DHCP support added.  To users this looks like a whole separate
+ *  protocol, but we know it's just a bag on the side of BOOTP.
+ *		-- Chip Salzenberg [EMAIL PROTECTED], May 2000
+ *
+ *  Ported DHCP support from 2.2.16 to 2.4.0-test4
+ *  -- Eric Biederman [EMAIL PROTECTED], 30 Aug 2000
+ *
+ *  Merged changes from 2.2.19 into 2.4.3
+ *  -- Eric Biederman [EMAIL PROTECTED], 22 April Aug 2001
  */
 
 #include linux/config.h
@@ -36,6 +46,7 @@
 

Re: [PATCH] Longstanding elf fix (2.4.3 fix)

2001-04-23 Thread Eric W. Biederman

David S. Miller [EMAIL PROTECTED] writes:

 Eric W. Biederman writes:
   In building a patch for 2.4.3 I also discovered that we are not taking 
   the mmap_sem around do_brk in the exec paths.
 
 Does that really matter?  

In the library loader I can certainly see it making a difference.

 Who else can get at the address space?
  We are a singly referenced address space at that point... perhaps ptrace?

In practice I don't see it being a big deal.  But reliable code is
made by closing all of the little loop holes.  

It also improves consistency as all of the calls to do_mmap are
already protected in the exec paths. 

And of course since much of the code in the kernel is built on the
copy a good example neglecting the locking without a big comment,
invites trouble elsewhere like in elf_load_library.  Where we could
have multiple threads running.  

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Longstanding elf fix (2.4.3 fix)

2001-04-23 Thread Eric W. Biederman

David S. Miller [EMAIL PROTECTED] writes:

 Eric W. Biederman writes:
   In building a patch for 2.4.3 I also discovered that we are not taking 
   the mmap_sem around do_brk in the exec paths.
 
 Does that really matter?  Who else can get at the address space?  We
 are a singly referenced address space at that point... perhaps ptrace?

Well looking a little more closely than I did last night it looks like
access_process_vm (called from ptrace) can cause what amounts to a
page fault at pretty arbitrary times.  

ptrace is protected by the big kernel lock, but exec isn't so that
doesn't help.  Hmm.  ptrace does require that the process be stopped
in all cases, before it does anything and that probably saves us.  This
is subtle enough I'd rather be locally correct, and not have to
worry about someone enhancing ptrace...

I'm actually a little curious what the big kernel lock in ptrace buys
us.  I suspect it could be a performance issue with user mode linux.
Where you have multiple processes being ptraced at the same time.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Longstanding elf fix (2.4.3 fix)

2001-04-23 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:

 On 23 Apr 2001, Eric W. Biederman wrote:
  
  ptrace is protected by the big kernel lock, but exec isn't so that
  doesn't help.  Hmm.  ptrace does require that the process be stopped
  in all cases
 
 Right. Ptrace definitely cannot access a process at arbitrary times. In
 fact, it is very serialized indeed, in that it can only access a process
 at signal points, ie effectively when it is returning to user space.
 
 With threads, of course, that doesn't help us. But with threads, the other
 threads could have caused the same page faults, so ptrace() isn't actually
 adding any new cases in that sense.
 
 I'd be a lot more worried about /proc accesses.

access_process_vm is also called from /proc to get the environment and
the command line.  I don't know if it has other locks it might
serialize on, probably not.  With execve it's a very small window...

 execve() doesn't really need the mm semaphore, but on the other hand it
 would be cleaner to get it, and it won't really hurt (there can not be any
 real contention on it anyway - the only contention might come through
 /proc, and I haven't looked at what that might imply).
 
 load-library should definitely get it. I thought it did already, but..
 
 Did you have a patch? Maybe I missed it.

I'll include it again.  I had it attached as a plain text attachment,
I don't know if that is a problem or not.

The case I spotted it we were getting the mm semaphore for do_mmap but
not for do_brk.  So we only get it 50% of the time...  

The other thing my patch does is update elf_map so we now handles elf
files with multiple bss sections.

Eric

diff -uNrX linux-exclude-files linux-2.4.3/arch/mips/kernel/irixelf.c 
linux-2.4.3.elf-fix2/arch/mips/kernel/irixelf.c
--- linux-2.4.3/arch/mips/kernel/irixelf.c  Fri Apr 20 12:06:40 2001
+++ linux-2.4.3.elf-fix2/arch/mips/kernel/irixelf.c Sun Apr 22 17:00:28 2001
@@ -130,7 +130,9 @@
end = PAGE_ALIGN(end);
if (end = start) 
return;
+   down_write(current-mm-mmap_sem);
do_brk(start, end - start);
+   up_write(current-mm-mmap_sem);
 }
 
 
@@ -379,7 +381,9 @@
 
/* Map the last of the bss segment */
if (last_bss  len) {
+   down_write(current-mm-mmap_sem);
do_brk(len, (last_bss - len));
+   up_write(current-mm-mmap_sem);
}
kfree(elf_phdata);
 
@@ -567,8 +571,10 @@
unsigned long v;
struct prda *pp;
 
+   down_write(current-mm-mmap_sem);
v =  do_brk (PRDA_ADDRESS, PAGE_SIZE);
-   
+   up_write(current-mm-mmap_sem);
+   
if (v  0)
return;
 
@@ -858,8 +864,11 @@
 
len = (elf_phdata-p_filesz + elf_phdata-p_vaddr+ 0xfff)  0xf000;
bss = elf_phdata-p_memsz + elf_phdata-p_vaddr;
-   if (bss  len)
- do_brk(len, bss-len);
+   if (bss  len) {
+   down_write(current-mm-mmap_sem);
+   do_brk(len, bss-len);
+   up_write(current-mm-mmap_sem);
+   }
kfree(elf_phdata);
return 0;
 }
diff -uNrX linux-exclude-files linux-2.4.3/arch/s390x/kernel/binfmt_elf32.c 
linux-2.4.3.elf-fix2/arch/s390x/kernel/binfmt_elf32.c
--- linux-2.4.3/arch/s390x/kernel/binfmt_elf32.cFri Apr 20 12:06:43 2001
+++ linux-2.4.3.elf-fix2/arch/s390x/kernel/binfmt_elf32.c   Sun Apr 22 17:00:28 
+2001
@@ -188,16 +188,29 @@
 static unsigned long
 elf_map32 (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, 
int type)
 {
+   unsigned long start, data_len, mem_len, offset;
unsigned long map_addr;
 
if(!addr)
addr = 0x4000;
 
-   down_write(current-mm-mmap_sem);
-   map_addr = do_mmap(filep, ELF_PAGESTART(addr),
-  eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr), prot, 
type,
-  eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr));
-   up_write(current-mm-mmap_sem);
+   start = ELF_PAGESTART(addr);
+   data_len = ELF_PAGEALIGN(eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+   mem_len = ELF_PAGEALIGN(eppnt-p_memsz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+   offset = eppnt-p_offset - ELF_PAGEOFFSET(eppnt-p_vaddr);
+
+   if (eppnt-p_filesz) {
+   down_write(current-mm-mmap_sem);
+   map_addr = do_mmap(filep, start, data_len, prot, type, offset);
+   do_mmap(NULL, map_addr + data_len, mem_len - data_len, prot,
+   MAP_FIXED | MAP_PRIVATE, 0);
+   up_write(current-mm-mmap_sem);
+   padzero(map_addr + eppnt-p_filesz + ELF_PAGEOFFSET(eppnt-p_vaddr));
+   } else {
+   down_write(current-mm-mmap_sem);
+   map_addr = do_mmap(NULL, start, mem_len, prot, MAP_PRIVATE, 0);
+   up_write(current-mm-mmap_sem);
+   }
return(map_addr);
 }
 
diff -uNrX linux-exclude-files linux-2.4.3/arch/sparc64

Re: [PATCH] Longstanding elf fix (2.4.3 fix)

2001-04-24 Thread Eric W. Biederman

Manfred Spraul [EMAIL PROTECTED] writes:

  Well looking a little more closely than I did last night it looks like
  access_process_vm (called from ptrace) can cause what amounts to a
  page fault at pretty arbitrary times.
 
 It's also used for several /proc/pid files.
 
 I remember that I got crashes with concurrent exec+cat
 /proc/pid/cmdline until down(mmap_sem) was added into
 setup_arg_pages().

O.k. Then the race I'm catching is real though because it is confined
to bss sections, we are quite unlikely to trigger it.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PATCH: trident , pci_enable_device moved

2001-04-27 Thread Eric W. Biederman

Jeff Garzik [EMAIL PROTECTED] writes:

 Andres Salomon wrote:
  This is what I was told (it was only needed for secondary video
  devices).  From that, I would expect that all video devices would
  need it, just in case they happened to be the second card.  Am I
  missing some subtlety in some of the video driers/chipsets that
  wouldn't allow them to be used as a second video device (therefore
  not requiring pci_enable_device)?
 
 They do need pci_enable_device, both primary and secondary displays. 
 For the primary display its safe to call pci_enable_device.  For
 secondary displays, you have to first disable I/O decoding for all VGA
 devices before you can enable a secondary display.  You don't want more
 than one device decoding the legacy VGA region at any one time.
 
 Some cards have the capability to relocate the VGA region, which is
 nice.  The bigger problem is initializing secondary displays; every
 video card has a proprietary video BIOS initialization sequence that is
 run by main BIOS on startup.  You can either duplicate this sequence
 with C code, which is sometimes difficult due to lack of docs or variety
 of boards, or you can execute the video BIOS with an x86 emulator.

Note:  With linuxBIOS (and some other embedded linux setups) even a
primary display doesn't get initialized until you start linux so if
you can properly initialize your display please do it.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Alpha compile problem solved by Andrea (pte_alloc)

2001-04-29 Thread Eric W. Biederman


Do you know if anyone has fixed the lazy vmalloc code?  I know of
as of early 2.4 it was broken on alpha.  At the time I noticed it I didn't
have time to persue it, but before I forget to even put in a bug
report I thought I'd ask if you know anything about it?

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Alpha compile problem solved by Andrea (pte_alloc)

2001-04-29 Thread Eric W. Biederman

Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Sun, Apr 29, 2001 at 05:27:10PM -0600, Eric W. Biederman wrote:
  
  Do you know if anyone has fixed the lazy vmalloc code?  I know of
  as of early 2.4 it was broken on alpha.  At the time I noticed it I didn't
  have time to persue it, but before I forget to even put in a bug
  report I thought I'd ask if you know anything about it?
 
 On alpha it's racy if you set CONFIG_ALPHA_LARGE_VMALLOC y (so don't do
 that as you don't need it). As long as you use only 1 entry of the pgd
 for the whole vmalloc space (CONFIG_ALPHA_LARGE_VMALLOC n) alpha is
 safe.

Hmm. I was having problems reproducible with
CONFIG_ALPHA_LARGE_VMALLOC n.

Enabling the large vmalloc was my work around, because the large
vmalloc whet back to the prelazy allocation code.

I was getting repeatable problems inside of an mtd driver.  The
problem I had was entries failed to propagate across different tasks.
I think it was something like the first pgd was lazily allocated and
not propagated.   

I don't have a SRM on my 264 alpha so alpha (for reference on which
code paths were followed.

 
 OTOH x86 is racy and there's no workaround available at the moment.

GH

Well racy is easier to work with than just plain non-functional. 

Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ServerWorks LE and MTRR

2001-04-30 Thread Eric W. Biederman

Steffen Persvold [EMAIL PROTECTED] writes:

 [EMAIL PROTECTED] wrote:
  On Sun, 29 Apr 2001, Steffen Persvold wrote:
  
   I've learned it the hard way, I have two types : Compaq DL360 (rev 5) and a
   Tyan S2510 (rev 6). On the compaq machine I constantly get data corruption
 on
 
   the last double word (4 bytes) in a 64 byte PCI burst when I use write
   combining on the CPU. On the Tyan however the transfer is always ok.
  
  
  Are you sure that is not due to board design differences?
 
 No I can't be 100% certain that the layout of the board isn't the reason since
 I haven't asked ServerWorks about this and it doesn't say anything in their
 docs (yes my company has the NDA, so I shouldn't get to much in detail here),
 but if this was the case it would be totally wrong to disable write combining
 on any LE chipset.
 
 The test case that I have been using to trigger this is sort of special because
 we are using SCI shared memory adapters to write (with PIO) into remote nodes
 memory, and the bandwidth tends to get quite high (approx 170 MByte/sec on LE
 with write combining). I've been able to run this case on 5 different
 motherboards using the LE and HE-SL ServerWorks chipsets, but only two of them
 are LE (the DL360 and the S2510). Everything works fine with write-combining on
 every motherboard except the DL360 (which has rev 5).
 
 One basic test case that I haven't tried, could be to enable write-combining on
 your PCI graphics adapter memory and see if the X display gets screwed up.
 
 I will try to get some information from ServerWorks about this problem, but I'm
 not sure if ServerWorks would be happy if I told you the answer (because of the
 NDA).

I'd like to put my small plug in that this make me a little nervous.
It could also be a problem with the firmware (aka BIOS) missetting
something up.  Working with linuxBIOS I have seen burst-writes
(enabled with write-combining or write-back) cause data corruption
when non-burst-writes to memory don't cause problems, when the memory
controller is setup wrong.  (This is was with intel 440GX  440BX
chipsets).

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: How can do to disable the L1 cache in linux ?

2001-05-02 Thread Eric W. Biederman

Alex Huang [EMAIL PROTECTED] writes:

 Dear All,
  How can do to disable the L1 cache in linux ?
 Are there some commands or directives to disable it ??

Play with the MTRR's and disable caching on memory.

Stupid but it should get what you want.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: serial console problems with 2.4.4

2001-05-02 Thread Eric W. Biederman

Fabrice Gautier [EMAIL PROTECTED] writes:

 On Wed, 02 May 2001 11:54:11 +0200
 Reto Baettig [EMAIL PROTECTED] wrote:
 
  Hi
  
  I just installed 2.4.4 on our alpha SMP boxes (ES40) and now I have
  problems with the serial console:
 
 I get same kind of problem when upgrading from 2.4.2 to 2.4.3 and using
 busybox as init/getty 
 
 The problem was a bug in busybox. The console initialisation code was
 not correct.
  
  sulogin does not accept input from the serial line
  mingetty does not accept input from the serial line
  agetty works fine
 
 So this this probably a sulogin/mingetty problem. They should set the
 CREAD flag in your tty c_cflag.
 
 the patch for busybox repalced the line
   tty.c_cflag |= HUPCL|CLOCAL
 by
   tty.c_cflag |= CREAD|HUPCL|CLOCAL
   
 Hope this help.

This part is correct.  

However the kernel sets CREAD by default.  
sysvinit (and possibly other inits) clears CREAD.
I wish I knew where the breakage actually occured.

And then sulogin/mingetty need to reenable it.

It's not too big of a deal except the serial code doesn't accept SAK's
when CREAD is clear.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: serial console problems with 2.4.4

2001-05-03 Thread Eric W. Biederman

Fabrice Gautier [EMAIL PROTECTED] writes:

 On 02 May 2001 10:37:21 -0600
 [EMAIL PROTECTED] (Eric W. Biederman) wrote:
 
  Fabrice Gautier [EMAIL PROTECTED] writes:
   So this this probably a sulogin/mingetty problem. They should set the
   CREAD flag in your tty c_cflag.
   
   the patch for busybox repalced the line
 tty.c_cflag |= HUPCL|CLOCAL
   by
 tty.c_cflag |= CREAD|HUPCL|CLOCAL
 
   Hope this help.
  
  This part is correct.  
  
  However the kernel sets CREAD by default.  
 
 Are your sure? Wasn't this the behaviour for 2.4.2  but changed in 2.4.3

init=/bin/bash works fine over a serial console in 2.4.4.  So I am
certain.

I get the impression that something in 2.4.3 fixed CREAD handling, and we
started noticing the buggy user space.

  sysvinit (and possibly other inits) clears CREAD.
 
 In my case I was using busybox as init. So there is no sysinit or any other
 init called before this line.

The busy box init is also clearing CREAD (as of 0.51 anyway).

  I wish I knew where the breakage actually occured.
 
 Just look at this diff on serial.c between 2.4.2 and 2.4.3:

If it was a real diff between 2.4.2 and 2.4.3 I would agree, however it looks
like your attempt to fix 2.4.3. 

Eric


 --- serial.c  Sat Apr 21 17:22:53 2001
 +++ ../../../linux-2.4.2/drivers/char/serial.cSat Feb 17 01:02:36 2001
 @@ -1764,8 +1765,8 @@
   /*
* !!! ignore all characters if CREAD is not set
*/
 -//   if ((cflag  CREAD) == 0)
 -//   info-ignore_status_mask |= UART_LSR_DR;
 + if ((cflag  CREAD) == 0)
 + info-ignore_status_mask |= UART_LSR_DR;
   save_flags(flags); cli();
   if (uart_config[info-state-type].flags  UART_STARTECH) {
   serial_outp(info, UART_LCR, 0xBF);
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Possible PCI subsystem bug in 2.4

2001-05-04 Thread Eric W. Biederman

Alan Cox [EMAIL PROTECTED] writes:

  I suspect it would be safe to round up to the next megabyte, possibly up
  to 64MB or so. But much more would make me nervous.
  Any suggestions? 
 
 I'd go for 1MByte simply because I've not seen an EBDA/NVRAM area that large
 stuck at the top of RAM. 1Mb would fix the Dell. (It was only when I saw
 your email it suddenely clicked and I grabbed the bootup log)

There are a couple of options here.
1) read the MTRRs unless the BIOS is braindead it will set up that area as
   write-back.  At any rate we shouldn't ever try to allocate a pci region
   that is write-back cached.

2) read the memory locations from the northbridge.  It's not possible
   on every chipset (lack of documentation) but with the linuxBIOS
   project we code for a couple of them, and we are working on more
   all of the time.

Eric



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: smp_send_stop() and disable_local_APIC()

2001-05-04 Thread Eric W. Biederman

Matt D. Robinson [EMAIL PROTECTED] writes:

 It looks like around 2.3.30 or so, someone added the call
 disable_local_APIC() to smp_send_stop().  I'm not sure what the
 intention was, but I'm getting some strange behavior as a result
 based on some code I'm writing.
 
 Basically, I'm doing the following ...
 
 panic()
 {
 /* do whatever you want, notifier list, etc. */
 smp_send_stop();
 write_system_memory();
 /* then do whatever */
 }
 
 write_system_memory() does a write of all system memory pages to some
 block device.  It uses kiobufs as the way to get the pages to disk,
 doing brw_kiovec() on those pages (using either the IDE or SCSI
 driver to write the data).

IDE being less likely to hang than SCSI as it tends to use legacy isa
interrupt lines.
 
 The wierd behavior I see is that sometimes, smp_send_stop()
 being called causes the system to hang up (not every time). 

Doing event driver i/o after disabling the interrupt controller
hmm, I wonder why...

 If we don't call smp_send_stop() on those systems, everything works fine.
 This looks to be directly caused by the disabling of the APIC, which
 we may need to dump pages to local disk.  This only applies to some
 people's systems -- not everyone displays the same behavior.
 
 I'm sure it's good to disable the APIC, but there's no clean way to
 wait on disabling the APIC until after I'm done writing pages out.
 
 My questions are:
 
 1) Why was disable_local_APIC() added to stop_this_cpu()
and smp_send_stop()?  Completeness?
 
 2) Is there a better way around this to disable all the
other CPUs without disabling the APIC?
 

I don't know what a good way is, since there is a kernel panic it
should only be something truly fatal.  Given that reusing anything
that hasn't been designed to run in that situation is playing with
fire.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Possible PCI subsystem bug in 2.4

2001-05-04 Thread Eric W. Biederman

Alan Cox [EMAIL PROTECTED] writes:

  There are a couple of options here.
  1) read the MTRRs unless the BIOS is braindead it will set up that area as
 write-back.  At any rate we shouldn't ever try to allocate a pci region
 that is write-back cached.
 
 'unless the BIOS is braindead'. Right. We only got into this problem because
 the BIOS _was_ braindead.

Well I did provide a suggestion so you don't have to second guess...
Usually it's actually easier to read the memory size from the northbridge
than to parse the E820 map.

However since it is different kinds of braindamage to mess up the MTRRs,
and the E820 memory map, it is worth a shot.  Personally I think MTRRs
are much easier to get right, because you don't need to take into
account what the BIOS is going to do just where your ram is.

As for braindead BIOS's in general any comments on totally nuking
them?  

Seriously.  With the general attitude of distrusting BIOS's I have
been amazed at the number of things linux expects the BIOS to get
right.  In practice windows seem to trust the BIOS much less than
linux does.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Possible PCI subsystem bug in 2.4

2001-05-04 Thread Eric W. Biederman

Alan Cox [EMAIL PROTECTED] writes:

  Seriously.  With the general attitude of distrusting BIOS's I have
  been amazed at the number of things linux expects the BIOS to get
  right.  In practice windows seem to trust the BIOS much less than
  linux does.
 
 It becomes more and more obvious over time exactly why. One problem however
 is that windows gets away with this because many vendors ship random extra
 gunge for their box with the system. We dont yet have that power

Right.  So we always need to keep heuristics in our toolbox to fallback on,
so we can run on boards with incomplete information.  However there is a lot
of things we can do that we aren't currently doing.

The example that sticks out in my head is we rely on the MP table to
tell us if the local apic is in pic_mode or in virtual wire mode.
When all we really have to do is ask it.

Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: smp_send_stop() and disable_local_APIC()

2001-05-05 Thread Eric W. Biederman

Matt D. Robinson [EMAIL PROTECTED] writes:

 It's an SMP (and only when your system crashes on a CPU other
 than 0) problem.  I did some more checking of this to verify the
 specifics of the behavior.  Thanks for the sarcasm, though. :)

O.k.  That makes perfect sense then.  See below.

 All I wanted was clarification as to why it was added in the first
 place, and whether there was a better way around the scenario.
 I think Ingo added the code, but I never heard back from him.
 Thanks for the response.

Welcome.  Linux attempts to properly shutdown the apics when we are
shutting down, and part of that is returning the apics to the mode
they were before we got control.  To do that you need to disable every
cpu but the bootstrap processor, and return the bootstrap processor to
either virtual wire mode or pic_mode.  So of course it will be the
only cpu getting interrupts because we are in legacy mode.

I would say it probably makes sense to add an additional call.
smp_send_panic_stop that does exactly what you need instead of what is
needed on the normal shutdown path. 

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-06 Thread Eric W. Biederman

Jeffrey W. Baker [EMAIL PROTECTED] writes:

 On Tue, 5 Jun 2001, Derek Glidden wrote:
 
 
  After reading the messages to this list for the last couple of weeks and
  playing around on my machine, I'm convinced that the VM system in 2.4 is
  still severely broken.
 
  This isn't trying to test extreme low-memory pressure, just how the
  system handles recovering from going somewhat into swap, which is a real
  day-to-day problem for me, because I often run a couple of apps that
  most of the time live in RAM, but during heavy computation runs, can go
  a couple hundred megs into swap for a few minutes at a time.  Whenever
  that happens, my machine always starts acting up afterwards, so I
  started investigating and found some really strange stuff going on.
 
 I reboot each of my machines every week, to take them offline for
 intrusion detection.  I use 2.4 because I need advanced features of
 iptables that ipchains lacks.  Because the 2.4 VM is so broken, and
 because my machines are frequently deeply swapped, they can sometimes take
 over 30 minutes to shutdown.  They hang of course when the shutdown rc
 script turns off the swap.  The first few times this happened I assumed
 they were dead.

Interesting.  Is it constant disk I/O?  Or constant CPU utilization.
In any case you should be able to comment that line out of your shutdown
rc script and be in perfectly good shape.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-06 Thread Eric W. Biederman

Andrew Morton [EMAIL PROTECTED] writes:

 Jeffrey W. Baker wrote:
  
  Because the 2.4 VM is so broken, and
  because my machines are frequently deeply swapped,
 
 The swapoff algorithms in 2.2 and 2.4 are basically identical.
 The problem *appears* worse in 2.4 because it uses lots
 more swap.

And 2.4 does delayed swap deallocation.  We don't appear to optimize
the case where a page is only used by the swap cache.  That should
be able to save some cpu overhead if nothing else.

And I do know that in the early 2.2 timeframe, swapoff was used
to generate an artifically high VM load, for testing the VM.  It looks
like that testing procedure has been abandoned :)

  they can sometimes take over 30 minutes to shutdown.
 
 Yes. The sys_swapoff() system call can take many minutes
 of CPU time.  It basically does:
 
   for (each page in swap device) {
   for (each process) {
   for (each page used by this process)
   stuff
 
 It's interesting that you've found a case where this
 actually has an operational impact.

Agreed.
 
 Haven't looked at it closely, but I think the algorithm
 could become something like:
 
   for (each process) {
   for (each page in this process) {
   if (page is on target swap device)
   get_it_off()
   }
   }
 
   for (each page in swap device) {
   if (it is busy)
   complain()
   }

You would need to handle the shared memory case as well.
But otherwise this looks sound.  I would suggest going
through page-address_space-i_mmap_shared to find all of the
potential mappings but the swapper address space is used by all
processes that have pages in swap.

 That's 10^4 to 10^6 times faster.

It looks like it could be.  The bottleneck should be diskio, if it
is not we have a noticeable inefficient algorithm.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-06 Thread Eric W. Biederman

Derek Glidden [EMAIL PROTECTED] writes:

 John Alvord wrote:
  
  On Wed, 06 Jun 2001 11:31:28 -0400, Derek Glidden
  [EMAIL PROTECTED] wrote:
  
  
  I'm beginning to be amazed at the Linux VM hackers' attitudes regarding
  this problem.  I expect this sort of behaviour from academics - ignoring
  real actual problems being reported by real actual people really and
  actually experiencing and reporting them because technically or
  theoretically they shouldn't be an issue or because the literature
  [documentation] says otherwise - but not from this group.
  
  There have been multiple comments that a fix for the problem is
  forthcoming. Is there some reason you have to keep talking about it?
 
 Because there have been many more comments that The rule for 2.4 is
 'swap == 2*RAM' and that's the way it is and disk space is cheap -
 just add more than there have been this is going to be fixed which is
 extremely discouraging and doesn't instill me with all sorts of
 confidence that this problem is being taken seriously.

The hard rule will always be that to cover all pathological cases swap
must be greater than RAM.  Because in the worse case all RAM will be
in thes swap cache.  That this is more than just the worse case in 2.4
is problematic.  I.e. In the worst case: 
Virtual Memory = RAM + (swap - RAM).

You can't improve the worst case.  We can improve the worst case that
many people are facing.

 Or are you saying that if someone is unhappy with a particular
 situation, they should just keep their mouth shut and accept it?

It's worth complaining about.  It is also worth digging into and find
out what the real problem is.  I have a hunch that this hole
conversation on swap sizes being irritating is hiding the real
problem.  

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Requirement: swap = RAM x 2.5 ??

2001-06-06 Thread Eric W. Biederman

Jeff Garzik [EMAIL PROTECTED] writes:

 I'm sorry but this is a regression, plain and simple.
 
 Previous versons of Linux have worked great on diskless workstations
 with NO swap.
 
 Swap is extra space to be used if we have it and nothing else.

Given the slow speed of disks to use them efficiently when you are using
swap some additional rules apply.

In the worse case when swapping is being used you get:
Virtual Memory = RAM + (swap - RAM).

That cannot be improved.  You can increase your likely hood that that case won't
come up, but that is a different matter entirely.  

I suspect in practice that we are suffering more from lazy reclamation
of swap pages than from a more aggressive swap cache. 

Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-06 Thread Eric W. Biederman

Derek Glidden [EMAIL PROTECTED] writes:


 The problem I reported is not that 2.4 uses huge amounts of swap but
 that trying to recover that swap off of disk under 2.4 can leave the
 machine in an entirely unresponsive state, while 2.2 handles identical
 situations gracefully.  
 

The interesting thing from other reports is that it appears to be kswapd
using up CPU resources.  Not the swapout code at all.  So it appears
to be a fundamental VM issue.  And calling swapoff is just a good way
to trigger it. 

If you could confirm this by calling swapoff sometime other than at
reboot time.  That might help.  Say by running top on the console.

Eric



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-06 Thread Eric W. Biederman

Mike Galbraith [EMAIL PROTECTED] writes:

 On 6 Jun 2001, Eric W. Biederman wrote:
 
  Derek Glidden [EMAIL PROTECTED] writes:
 
 
   The problem I reported is not that 2.4 uses huge amounts of swap but
   that trying to recover that swap off of disk under 2.4 can leave the
   machine in an entirely unresponsive state, while 2.2 handles identical
   situations gracefully.
  
 
  The interesting thing from other reports is that it appears to be kswapd
  using up CPU resources.  Not the swapout code at all.  So it appears
  to be a fundamental VM issue.  And calling swapoff is just a good way
  to trigger it.
 
  If you could confirm this by calling swapoff sometime other than at
  reboot time.  That might help.  Say by running top on the console.
 
 The thing goes comatose here too. SCHED_RR vmstat doesn't run, console
 switch is nogo...
 
 After running his memory hog, swapoff took 18 seconds.  I hacked a
 bleeder valve for dead swap pages, and it dropped to 4 seconds.. still
 utterly comatose for those 4 seconds though.

At the top of the while(1) loop in try_to_unuse what happens if you put in.
if (need_resched) schedule(); 
It should be outside all of the locks.  It might just be a matter of everything
serializing on the SMP locks, and the kernel refusing to preempt itself.

Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-07 Thread Eric W. Biederman

LA Walsh [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:
 
  The hard rule will always be that to cover all pathological cases swap
  must be greater than RAM.  Because in the worse case all RAM will be
  in thes swap cache.  That this is more than just the worse case in 2.4
  is problematic.  I.e. In the worst case:
  Virtual Memory = RAM + (swap - RAM).
 
 Hmmmso my 512M laptop only really has 256M?  Um...I regularlly run
 more than 256M of programs.  I don't want it to swap -- its a special, weird
 condition if I do start swapping.  I don't want to waste 1G of HD (5%) for
 something I never want to use.  IRIX runs just fine with swapRAM.  In
 Irix, your Virtual Memory = RAM + swap.  Seems like the Linux kernel requires
 more swap than other old OS's (SunOS3 (virtual mem = min(mem,swap)).
 I *thought* I remember that restriction being lifted in SunOS4 when they
 upgraded the VM.  Even though I worked there for 6 years, that was
 6 years ago...

There are cetain scenario's where you can't avoid virtual mem =
min(RAM,swap). Which is what I was trying to say, (bad formula).  What
happens is that pages get referenced  evenly enough and quickly enough
that you simply cannot reuse the on disk pages.  Basically in the
worst case all of RAM is pretty much in flight doing I/O.  This is
true of all paging systems.

However just because in the worst case virtual mem = min(RAM,swap), is
no reason other cases should use that much swap.  If you are doing a
lot of swapping it is more efficient to plan on mem = min(RAM,swap) as
well, because frequently you can save on I/O operations by simply
reusing the existing swap page.

 
  You can't improve the worst case.  We can improve the worst case that
  many people are facing.
 
 ---
 Other OS's don't have this pathological 'worst case' scenario.  Even
 my Windows [vm]box seems to operate fine with swapMEM.  On IRIX,
 virtual space closely approximates physical + disk memory.

It's a theoretical worst case and they all have it.  In practice it is
very hard to find a work load where practically every page in the
system is close to the I/O point howerver.

Except for removing pages that aren't used paging with swap  RAM is
not useful.  Simply removing pages that aren't in active use but might
possibly be used someday is a common case, so it is worth supporting.

 
  It's worth complaining about.  It is also worth digging into and find
  out what the real problem is.  I have a hunch that this hole
  conversation on swap sizes being irritating is hiding the real
  problem.
 
 ---
 Okay, admission of ignorance.  When we speak of swap space,
 is this term inclusive of both demand paging space and
 swap-out-entire-programs space or one or another?

Linux has no method to swap out an entire program so when I speak of
swapping I'm actually thinking paging.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-07 Thread Eric W. Biederman

[EMAIL PROTECTED] (Linus Torvalds) writes:
 
 Somebody interested in trying the above add? And looking for other more
 obvious bandaid fixes.  It won't fix swapoff per se, but it might make
 it bearable and bring it to the 2.2.x levels. 

At little bit.  The one really bad behavior of not letting any other
processes run seems to be fixed with an explicit:
if (need_resched) {
schedule();
}

What I can't figure out is why this is necessary.  Because we should
be sleeping in alloc_pages if nowhere else.

I suppose if the bulk of our effort really is freeing dead swap cache
pages we can spin without sleeping, and never let another process run
because we are busily recycling dead swap cache pages. Does this sound
right? 

If this is going on I think we need to look at our delayed
deallocation policy a little more carefully.   I suspect we should
have code in kswapd actively removing these dead swap cache pages. 
After we get the latency improvements in exit these pages do
absolutely nothing for us except clog up the whole system, and
generally give the 2.4 VM a bad name.

Anyone care to check my analysis? 

 Is anybody interested in making swapoff() better? Please speak up..

Interested.   But finding the time...

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



  1   2   3   4   5   6   7   8   9   10   >