Ben Marzinski wrote:

>There is a reproducible memory mapping problem with the s390 SuSE linux
>setup we have.  The bug occurs when two processes have private, read-only
>mappings of the same file and both processes page in the same page at
>the same time. The PTE for that page gets incorrectly marked dirty, which
>causes the page to be marked dirty, and the writepage() address space
>operation to be called. Nothing that the processes have done should have
>caused the page to be written back to the file. The file is modified even
>if the whole filesystem is mounted Read-Only.

There does indeed appear to be a race condition in the s390-specific
memory management backend that could explain the symptom you're seeing.

The problem is that the S/390 does not have a dirty bit that resides
in the PTE like most platforms, and which gets set whenever the processor
writes to the page via the mapping defined by this PTE.  Instead we have a
dirty bit in the 'storage key'; there is one storage key associated with
every physical page.  The dirty bit there is set by the memory subsystem
on every write to the page, no matter via which PTE, and even on DMA
writes by hardware devices.

This means that pages from memory-mapped files would usually start out
with the dirty bit set (because the page was read in from backing store,
and doing so would access the physical page and set the dirty bit).
So we have to reset the dirty bit, which we do when the page is first
mapped into a user address space.  (Note we must only do so on the *first*
mapping to user space, subsequent mapping of a page must not reset a
dirty bit that might have been set in the mean time.)

Unfortunately, this logic would appear to exhibit a race condition
in the situation you're describing.  When a process faults in a page
from a memory-mapped file, the following steps happen in sequence:

- the page is looked up in the page cache; if not found, a new page is
  allocated and added to the page cache.  In any case, at this point
  the page reference count is incremented.

- if the page is not uptodate, a read-in from backing store is triggered
  and the process sleeps until the read has completed

- finally, a page table entry referring to the page is placed into the
  process' page tables.  At this point, our platform-specific hook is
  triggered; we check whether the page count is 1, and if so, reset the
  dirty bit in the storage key.

If the same file is mapped into multiple address spaces, this is broken.
Consider two processes faulting in the page at (nearly) the same time;
both increment the page count, and none of them will see a page count
of 1 when updating the page table, therefore the dirty bit will not be
reset.

In later kernels (e.g. SLES-8), this problem is less visible because
we completely ignore the storage key on private read-only mappings as
those can never be dirty anyway.  This was intended as a performance
improvement only (we save storage key operations, which can be expensive),
which is why we didn't backport the new logic.  However, it would have
prevented your symptoms.  (The race is still there, but only on read-
write mappings where it has less obvious effects; some pages may be
written back although there isn't really a need to.)

A proper fix of the race is probably not possible without changes
to common code; a good place to reset the storage key dirty bit might
be SetPageUptodate, but there's no arch-dependent hook there.  We'll
have to think about this ...

>Has anyone else seen this? Does anyone know of any patches to deal with this?
>If anyone wants to see if they can reproduce this, I can send them a copy
>of the program that we wrote to do Step 4 from above. It's less than
>100 lines of C code.

I'd appreciate it if you could send me this program.


Mit freundlichen Gruessen / Best Regards

Ulrich Weigand

--
  Dr. Ulrich Weigand
  Linux for S/390 Design & Development
  IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
  Phone: +49-7031/16-3727   ---   Email: [EMAIL PROTECTED]

Reply via email to