On 10/30/25 19:31, Vlastimil Babka wrote:
> On 10/30/25 17:43, Lorenzo Stoakes wrote:
>> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
>>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
>>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
>>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
>>> > > > Currently, if a user needs to determine if guard regions are present
>>> > > > in a
>>> > > > range, they have to scan all VMAs (or have knowledge of which ones
>>> > > > might
>>> > > > have guard regions).
>>> > > >
>>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
>>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
>>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
>>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
>>> > > > operation at a virtual address level.
>>> > > >
>>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps
>>> > > > level
>>> > > > that guard regions exist in ranges.
>>> > > >
>>> > > > This patch remedies the situation by establishing a new VMA flag,
>>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it
>>> > > > is
>>> > > > uncertain because we cannot reasonably determine whether a
>>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA,
>>> > > > and
>>> > > > additionally VMAs may change across merge/split).
>>> > > >
>>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
>>> > > > architectures also, a flag that was previously used by VM_DENYWRITE,
>>> > > > which
>>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and
>>> > > > hasn't
>>> > > > bee reused yet.
>>> > > >
>>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
>>> > > > lock (and also VMA write lock) whereas previously it did not, but this
>>> > > > seems a reasonable overhead.
>>> > >
>>> > > Do you though? Could it be possible to simply atomically set the flag
>>> > > with
>>> > > the read lock held? This would make it so we can't split the VMA (and
>>> > > tightly
>>> >
>>> > VMA flags are not accessed atomically so no I don't think we can do that
>>> > in any
>>> > workable way.
>>> >
>>>
>>> FWIW I think you could work it as an atomic flag and treat those races as
>>> benign
>>> (this one, at least).
>>
>> It's not benign as we need to ensure that page tables are correctly
>> propagated
>> on fork.
>
> Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> setting? I think the places that could race with us to cause RMW use vma
> write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> the oldmm) and it probably would't make sense to start doing it. Maybe we
> could think of something to deal with this special case...
During discussion with Pedro off-list I realized fork takes mmap lock for
write on the old mm, so if we kept taking mmap sem for read, then vma lock
for read in addition (which should be cheap enough, also we'd only need it
in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
that would cover all non-bening races?