Ben Marzinski wrote: >There is a reproducible memory mapping problem with the s390 SuSE linux >setup we have. The bug occurs when two processes have private, read-only >mappings of the same file and both processes page in the same page at >the same time. The PTE for that page gets incorrectly marked dirty, which >causes the page to be marked dirty, and the writepage() address space >operation to be called. Nothing that the processes have done should have >caused the page to be written back to the file. The file is modified even >if the whole filesystem is mounted Read-Only.
There does indeed appear to be a race condition in the s390-specific memory management backend that could explain the symptom you're seeing. The problem is that the S/390 does not have a dirty bit that resides in the PTE like most platforms, and which gets set whenever the processor writes to the page via the mapping defined by this PTE. Instead we have a dirty bit in the 'storage key'; there is one storage key associated with every physical page. The dirty bit there is set by the memory subsystem on every write to the page, no matter via which PTE, and even on DMA writes by hardware devices. This means that pages from memory-mapped files would usually start out with the dirty bit set (because the page was read in from backing store, and doing so would access the physical page and set the dirty bit). So we have to reset the dirty bit, which we do when the page is first mapped into a user address space. (Note we must only do so on the *first* mapping to user space, subsequent mapping of a page must not reset a dirty bit that might have been set in the mean time.) Unfortunately, this logic would appear to exhibit a race condition in the situation you're describing. When a process faults in a page from a memory-mapped file, the following steps happen in sequence: - the page is looked up in the page cache; if not found, a new page is allocated and added to the page cache. In any case, at this point the page reference count is incremented. - if the page is not uptodate, a read-in from backing store is triggered and the process sleeps until the read has completed - finally, a page table entry referring to the page is placed into the process' page tables. At this point, our platform-specific hook is triggered; we check whether the page count is 1, and if so, reset the dirty bit in the storage key. If the same file is mapped into multiple address spaces, this is broken. Consider two processes faulting in the page at (nearly) the same time; both increment the page count, and none of them will see a page count of 1 when updating the page table, therefore the dirty bit will not be reset. In later kernels (e.g. SLES-8), this problem is less visible because we completely ignore the storage key on private read-only mappings as those can never be dirty anyway. This was intended as a performance improvement only (we save storage key operations, which can be expensive), which is why we didn't backport the new logic. However, it would have prevented your symptoms. (The race is still there, but only on read- write mappings where it has less obvious effects; some pages may be written back although there isn't really a need to.) A proper fix of the race is probably not possible without changes to common code; a good place to reset the storage key dirty bit might be SetPageUptodate, but there's no arch-dependent hook there. We'll have to think about this ... >Has anyone else seen this? Does anyone know of any patches to deal with this? >If anyone wants to see if they can reproduce this, I can send them a copy >of the program that we wrote to do Step 4 from above. It's less than >100 lines of C code. I'd appreciate it if you could send me this program. Mit freundlichen Gruessen / Best Regards Ulrich Weigand -- Dr. Ulrich Weigand Linux for S/390 Design & Development IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen Phone: +49-7031/16-3727 --- Email: [EMAIL PROTECTED]