Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-06-16 Thread Andrea Arcangeli
Hello everyone,

On Mon, Jun 16, 2014 at 01:12:41PM -0700, John Stultz wrote:
> On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner  wrote:
> > That, however, truly is a separate virtual memory feature.  Would it
> > be possible for you to take MADV_FREE and MADV_REVIVE as a base and
> > implement an madvise op that switches the no-page behavior of a VMA
> > from zero-filling to SIGBUS delivery?
> 
> I'll see if I can look into it if I get some time. However, I suspect
> its more likely I'll just have to admit defeat on this one and let
> someone else champion the effort. Interest and reviews have seemingly
> dropped again here and with other work ramping up, I'm not sure if
> I'll be able to justify further work on this. :(

About adding an madvise op that switches the no-page behavior from
zero-filling to SIGBUS delivery (right now only for anonymous vmas but
we can evaluate to extend it) I've mostly completed the
userfaultfd/madvise(MADV_USERFAULT) according to the design I
described earlier. Like we discussed earlier that may fit the bill if
extended to tmpfs? The first preliminary tests just passed last week.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/?h=userfault

If userfaultfd() isn't instantiated by the process, it only sends a
SIBGUS to the thread accessing the unmapped virtual address
(handle_mm_faults returns VM_FAULT_SIGBUS). The address of the fault
is then available in siginfo->si_addr.

You strictly need a memory externalization thread opening the
userfaultfd and speaking the userfaultfd protocol only if you need to
access the memory also through syscalls or drivers doing GUP
calls. This allows memory mapped in a secondary MMU for example to be
externalized without a single change to the secondary MMU code. The
userfault becomes invisible to
handle_mm_fault/gup()/gup_fast/FOLL_NOWAIT etc The only
requirement is that the memory externalization thread never accesses
any memory in the MADV_USERFAULT marked regions (and if it does
because of a bug, the deadlock should be quite apparent by simply
checking the stack trace of the externalization thread blocked in
handle_userfault(), sigkill will then clear it up :). If you close the
userfaultfd the SIGBUS behavior will immediately return for the
MADV_USERFAULT marked regions and any hung task waiting to be waken
will get an immediate SIGBUS.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-06-16 Thread John Stultz
On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner  wrote:
> On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote:
>> On 04/29/2014 02:21 PM, John Stultz wrote:
>> > Another few weeks and another volatile ranges patchset...
>> >
>> > After getting the sense that the a major objection to the earlier
>> > patches was the introduction of a new syscall (and its somewhat
>> > strange dual length/purged-bit return values), I spent some time
>> > trying to rework the vma manipulations so we can be we won't fail
>> > mid-way through changing volatility (basically making it atomic).
>> > I think I have it working, and thus, there is no longer the
>> > need for a new syscall, and we can go back to using madvise()
>> > to set and unset pages as volatile.
>>
>> Johannes: To get some feedback, maybe I'll needle you directly here a
>> bit. :)
>>
>> Does moving this interface to madvise help reduce your objections?  I
>> feel like your cleaning-the-dirty-bit idea didn't work out, but I was
>> hoping that by reworking the vma manipulations to be atomic, we could
>> move to madvise and still avoid the new syscall that you seemed bothered
>> by. But I've not really heard much from you recently so I worry your
>> concerns on this were actually elsewhere, and I'm just churning the
>> patch needlessly.
>
> My objection was not the syscall.
>
> From a reclaim perspective, using the dirty state to denote whether a
> swap-backed page needs writeback before reclaim is quite natural and I
> much prefer Minchan's changes to the reclaim code over yours.
>
> From an interface point of view, I would prefer the simplicity of
> cleaning dirty bits to invalidate pages, and a default of zero-filling
> invalidated pages instead of sending SIGBUS.  This also is quite
> natural when you think of anon/shmem mappings as cache pages on top of
> /dev/zero (see mmap_zero() and shmem_zero_setup()).  And it translates
> well to tmpfs.
>
> At the same time, I acknowledge that there are usecases that want
> SIGBUS delivery for more than just convenience in order to implement
> userspace fault handling, and this is the only place where I see a
> real divergence in actual functionality from Minchan's code.

Thanks for the clarification and feedback. Sorry for my slow response,
as I was on vacation for a week and am just now catching up on this.

So again, SIGBUS for userspace fault handling is really of a
side-effect of having more userspace friendly semantics, and isn't
really the primary goal/usage model.

Zerofill semantics are mostly problematic because they make userspace
mistakes harder to find and diagnose. Android's ashmem actually uses
zerofill semantics, so while I see it as less ideal, technically
zerofill would work here.

However, combining zerofill with your preferred overloading of the
dirty state is particularly problematic because it makes any dirtying
of volatile data clear both the volatile state as well as the purged
state for the entire page. The volatile state is surprising, but less
problematic, but the clearing of the purged state means applications
would possibly get a partial zero page (for whatever wasn't written)
and no warning that their data was lost.  This is a very surprising
and unfriendly side-effect from a userspace perspective.

For context,  Android's ashmem preserves both the volatile and purged
state on volatile page dirtying (since the volatility and purged state
are kept in their own range structure independently from the VM).

> That, however, truly is a separate virtual memory feature.  Would it
> be possible for you to take MADV_FREE and MADV_REVIVE as a base and
> implement an madvise op that switches the no-page behavior of a VMA
> from zero-filling to SIGBUS delivery?

I'll see if I can look into it if I get some time. However, I suspect
its more likely I'll just have to admit defeat on this one and let
someone else champion the effort. Interest and reviews have seemingly
dropped again here and with other work ramping up, I'm not sure if
I'll be able to justify further work on this. :(

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-06-16 Thread John Stultz
On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner han...@cmpxchg.org wrote:
 On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote:
 On 04/29/2014 02:21 PM, John Stultz wrote:
  Another few weeks and another volatile ranges patchset...
 
  After getting the sense that the a major objection to the earlier
  patches was the introduction of a new syscall (and its somewhat
  strange dual length/purged-bit return values), I spent some time
  trying to rework the vma manipulations so we can be we won't fail
  mid-way through changing volatility (basically making it atomic).
  I think I have it working, and thus, there is no longer the
  need for a new syscall, and we can go back to using madvise()
  to set and unset pages as volatile.

 Johannes: To get some feedback, maybe I'll needle you directly here a
 bit. :)

 Does moving this interface to madvise help reduce your objections?  I
 feel like your cleaning-the-dirty-bit idea didn't work out, but I was
 hoping that by reworking the vma manipulations to be atomic, we could
 move to madvise and still avoid the new syscall that you seemed bothered
 by. But I've not really heard much from you recently so I worry your
 concerns on this were actually elsewhere, and I'm just churning the
 patch needlessly.

 My objection was not the syscall.

 From a reclaim perspective, using the dirty state to denote whether a
 swap-backed page needs writeback before reclaim is quite natural and I
 much prefer Minchan's changes to the reclaim code over yours.

 From an interface point of view, I would prefer the simplicity of
 cleaning dirty bits to invalidate pages, and a default of zero-filling
 invalidated pages instead of sending SIGBUS.  This also is quite
 natural when you think of anon/shmem mappings as cache pages on top of
 /dev/zero (see mmap_zero() and shmem_zero_setup()).  And it translates
 well to tmpfs.

 At the same time, I acknowledge that there are usecases that want
 SIGBUS delivery for more than just convenience in order to implement
 userspace fault handling, and this is the only place where I see a
 real divergence in actual functionality from Minchan's code.

Thanks for the clarification and feedback. Sorry for my slow response,
as I was on vacation for a week and am just now catching up on this.

So again, SIGBUS for userspace fault handling is really of a
side-effect of having more userspace friendly semantics, and isn't
really the primary goal/usage model.

Zerofill semantics are mostly problematic because they make userspace
mistakes harder to find and diagnose. Android's ashmem actually uses
zerofill semantics, so while I see it as less ideal, technically
zerofill would work here.

However, combining zerofill with your preferred overloading of the
dirty state is particularly problematic because it makes any dirtying
of volatile data clear both the volatile state as well as the purged
state for the entire page. The volatile state is surprising, but less
problematic, but the clearing of the purged state means applications
would possibly get a partial zero page (for whatever wasn't written)
and no warning that their data was lost.  This is a very surprising
and unfriendly side-effect from a userspace perspective.

For context,  Android's ashmem preserves both the volatile and purged
state on volatile page dirtying (since the volatility and purged state
are kept in their own range structure independently from the VM).

 That, however, truly is a separate virtual memory feature.  Would it
 be possible for you to take MADV_FREE and MADV_REVIVE as a base and
 implement an madvise op that switches the no-page behavior of a VMA
 from zero-filling to SIGBUS delivery?

I'll see if I can look into it if I get some time. However, I suspect
its more likely I'll just have to admit defeat on this one and let
someone else champion the effort. Interest and reviews have seemingly
dropped again here and with other work ramping up, I'm not sure if
I'll be able to justify further work on this. :(

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-06-16 Thread Andrea Arcangeli
Hello everyone,

On Mon, Jun 16, 2014 at 01:12:41PM -0700, John Stultz wrote:
 On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner han...@cmpxchg.org wrote:
  That, however, truly is a separate virtual memory feature.  Would it
  be possible for you to take MADV_FREE and MADV_REVIVE as a base and
  implement an madvise op that switches the no-page behavior of a VMA
  from zero-filling to SIGBUS delivery?
 
 I'll see if I can look into it if I get some time. However, I suspect
 its more likely I'll just have to admit defeat on this one and let
 someone else champion the effort. Interest and reviews have seemingly
 dropped again here and with other work ramping up, I'm not sure if
 I'll be able to justify further work on this. :(

About adding an madvise op that switches the no-page behavior from
zero-filling to SIGBUS delivery (right now only for anonymous vmas but
we can evaluate to extend it) I've mostly completed the
userfaultfd/madvise(MADV_USERFAULT) according to the design I
described earlier. Like we discussed earlier that may fit the bill if
extended to tmpfs? The first preliminary tests just passed last week.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/?h=userfault

If userfaultfd() isn't instantiated by the process, it only sends a
SIBGUS to the thread accessing the unmapped virtual address
(handle_mm_faults returns VM_FAULT_SIGBUS). The address of the fault
is then available in siginfo-si_addr.

You strictly need a memory externalization thread opening the
userfaultfd and speaking the userfaultfd protocol only if you need to
access the memory also through syscalls or drivers doing GUP
calls. This allows memory mapped in a secondary MMU for example to be
externalized without a single change to the secondary MMU code. The
userfault becomes invisible to
handle_mm_fault/gup()/gup_fast/FOLL_NOWAIT etc The only
requirement is that the memory externalization thread never accesses
any memory in the MADV_USERFAULT marked regions (and if it does
because of a bug, the deadlock should be quite apparent by simply
checking the stack trace of the externalization thread blocked in
handle_userfault(), sigkill will then clear it up :). If you close the
userfaultfd the SIGBUS behavior will immediately return for the
MADV_USERFAULT marked regions and any hung task waiting to be waken
will get an immediate SIGBUS.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-06-03 Thread Johannes Weiner
On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote:
> On 04/29/2014 02:21 PM, John Stultz wrote:
> > Another few weeks and another volatile ranges patchset...
> >
> > After getting the sense that the a major objection to the earlier
> > patches was the introduction of a new syscall (and its somewhat
> > strange dual length/purged-bit return values), I spent some time
> > trying to rework the vma manipulations so we can be we won't fail
> > mid-way through changing volatility (basically making it atomic).
> > I think I have it working, and thus, there is no longer the
> > need for a new syscall, and we can go back to using madvise()
> > to set and unset pages as volatile.
> 
> Johannes: To get some feedback, maybe I'll needle you directly here a
> bit. :)
> 
> Does moving this interface to madvise help reduce your objections?  I
> feel like your cleaning-the-dirty-bit idea didn't work out, but I was
> hoping that by reworking the vma manipulations to be atomic, we could
> move to madvise and still avoid the new syscall that you seemed bothered
> by. But I've not really heard much from you recently so I worry your
> concerns on this were actually elsewhere, and I'm just churning the
> patch needlessly.

My objection was not the syscall.

>From a reclaim perspective, using the dirty state to denote whether a
swap-backed page needs writeback before reclaim is quite natural and I
much prefer Minchan's changes to the reclaim code over yours.

>From an interface point of view, I would prefer the simplicity of
cleaning dirty bits to invalidate pages, and a default of zero-filling
invalidated pages instead of sending SIGBUS.  This also is quite
natural when you think of anon/shmem mappings as cache pages on top of
/dev/zero (see mmap_zero() and shmem_zero_setup()).  And it translates
well to tmpfs.

At the same time, I acknowledge that there are usecases that want
SIGBUS delivery for more than just convenience in order to implement
userspace fault handling, and this is the only place where I see a
real divergence in actual functionality from Minchan's code.

That, however, truly is a separate virtual memory feature.  Would it
be possible for you to take MADV_FREE and MADV_REVIVE as a base and
implement an madvise op that switches the no-page behavior of a VMA
from zero-filling to SIGBUS delivery?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-06-03 Thread Johannes Weiner
On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote:
 On 04/29/2014 02:21 PM, John Stultz wrote:
  Another few weeks and another volatile ranges patchset...
 
  After getting the sense that the a major objection to the earlier
  patches was the introduction of a new syscall (and its somewhat
  strange dual length/purged-bit return values), I spent some time
  trying to rework the vma manipulations so we can be we won't fail
  mid-way through changing volatility (basically making it atomic).
  I think I have it working, and thus, there is no longer the
  need for a new syscall, and we can go back to using madvise()
  to set and unset pages as volatile.
 
 Johannes: To get some feedback, maybe I'll needle you directly here a
 bit. :)
 
 Does moving this interface to madvise help reduce your objections?  I
 feel like your cleaning-the-dirty-bit idea didn't work out, but I was
 hoping that by reworking the vma manipulations to be atomic, we could
 move to madvise and still avoid the new syscall that you seemed bothered
 by. But I've not really heard much from you recently so I worry your
 concerns on this were actually elsewhere, and I'm just churning the
 patch needlessly.

My objection was not the syscall.

From a reclaim perspective, using the dirty state to denote whether a
swap-backed page needs writeback before reclaim is quite natural and I
much prefer Minchan's changes to the reclaim code over yours.

From an interface point of view, I would prefer the simplicity of
cleaning dirty bits to invalidate pages, and a default of zero-filling
invalidated pages instead of sending SIGBUS.  This also is quite
natural when you think of anon/shmem mappings as cache pages on top of
/dev/zero (see mmap_zero() and shmem_zero_setup()).  And it translates
well to tmpfs.

At the same time, I acknowledge that there are usecases that want
SIGBUS delivery for more than just convenience in order to implement
userspace fault handling, and this is the only place where I see a
real divergence in actual functionality from Minchan's code.

That, however, truly is a separate virtual memory feature.  Would it
be possible for you to take MADV_FREE and MADV_REVIVE as a base and
implement an madvise op that switches the no-page behavior of a VMA
from zero-filling to SIGBUS delivery?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-05-08 Thread Minchan Kim
On Thu, May 08, 2014 at 10:04:49AM -0700, John Stultz wrote:
> On 05/07/2014 10:58 PM, Minchan Kim wrote:
> > On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
> >> Another few weeks and another volatile ranges patchset...
> >>
> >> After getting the sense that the a major objection to the earlier
> >> patches was the introduction of a new syscall (and its somewhat
> >> strange dual length/purged-bit return values), I spent some time
> >> trying to rework the vma manipulations so we can be we won't fail
> >> mid-way through changing volatility (basically making it atomic).
> >> I think I have it working, and thus, there is no longer the
> >> need for a new syscall, and we can go back to using madvise()
> >> to set and unset pages as volatile.
> > As I said reply as other patch's reply, I'm ok with this but I'd
> > like to make it clear to support zero-filled page as well as SIGBUS.
> > If we want to use madvise, maybe we need another advise flag like
> > MADV_VOLATILE_SIGBUS.
> 
> I still disagree that zero-fill is more obvious behavior. And again, I
> still support MADV_VOLATILE and MADV_FREE both being added, as they
> really do have different use cases that I'd rather not try to fit into
> one operation.

As I replied previous mail, MADV_FREE is one-shot operation so upcoming
faulted page couldn't be affected so caller should call the syscall again
sometime to make the range volatile again and MADV_FREE is O(N) so vrange
with zero-fill could avoid that totally.

> 
> 
> >>
> >> New changes are:
> >> 
> >> o Reworked vma manipulations to be be atomic
> >> o Converted back to using madvise() as syscall interface
> >> o Integrated fix from Minchan to avoid SIGBUS faulting race
> >> o Caught/fixed subtle use-after-free bug w/ vma merging
> >> o Lots of minor cleanups and comment improvements
> >>
> >>
> >> Still on the TODO list
> >> 
> >> o Sort out how best to do page accounting when the volatility
> >>   is tracked on a per-mm basis.
> > What's is your concern about page accouting?
> > Could you elaborate it more for everybody to understand your concern
> > clearly.
> 
> Basically the issue is that since we keep the volatility in the vma,
> when we mark a page as volatile, its only marking the page for that
> processes, not globally (since the page may be COWed). This makes
> keeping track of the number of actual pages that are volatile accurately
> somewhat difficult, since we can't just add one for each page marked and
> subtract one for each page unmarked (for tmpfs/shm file based
> volatility, where volatility is shared globally, this will be much easier ;)
> 
> It might not be too hard to keep a per-process-pages count of
> volatility, but in that case we could see some strange effects where it
> seems like there are 3x the number of actual volatile pages, and that
> might throw off some of the scanning. So its something I've deferred a
> bit to think about.

Okay. So, why do you want to account volatile page?
Originally, what I expected is to age anonymous LRU list until the number of
count is zero so aging overhead would be zero if there is no volatile page
any more in the system but downside of the approach is it makes vrange marking
syscall cost O(N). That's why I suggested couting of volatile *vmas* instead of
volatile *pages*. It could make unnecessary aging of anon lru list if there is
no physical pages in the vma but I think it's good deal because we moved
hot path overhead to slow path and that's one of design goal of vrange syscall.
We might make an effort to make such aging not agressive in future, which
would be another topic.

> 
> 
> 
> >> o Revisit anonymous page aging on swapless systems
> > One idea is that we can age forcefully on swapless system if system
> > has volatile vma or lazyfree pages. If the number of volatile vma or
> > lazyfree pages is zero, we can stop the aging automatically.
> 
> I'll look into this some more.
> 
> 
> >
> >> o Draft up re-adding tmpfs/shm file volatility support
> >>
> >   o One concern from minchan.
> >   I really like O(1) cost of unmarking syscall.
> >
> > Vrange syscall is for others, not itself. I mean if some process calls
> > vrange syscall, it would scacrifice his resource for others when
> > emergency happens so if the syscall is overhead rather expensive,
> > anybody doesn't want to use it.
> 
> So yes. I agree the cost is more expensive then I'd like. However, I'd
> like to get a consensus on the expected behavior established and get
> folks first agreeing to the semantics and the interface. Then we can
> follow up with optimizations.

Oops, I forgot mentioning "We could do it with optimization in future".
I absolute agree with you. I don't want to do that in this stage but just
want to record one idea to optimize it so don't get me wrong. It's not
a objection.

> 
> > One idea is put increasing counter in mm_struct and assign the token
> > to volatile vma. 

Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-05-08 Thread John Stultz
On 04/29/2014 02:21 PM, John Stultz wrote:
> Another few weeks and another volatile ranges patchset...
>
> After getting the sense that the a major objection to the earlier
> patches was the introduction of a new syscall (and its somewhat
> strange dual length/purged-bit return values), I spent some time
> trying to rework the vma manipulations so we can be we won't fail
> mid-way through changing volatility (basically making it atomic).
> I think I have it working, and thus, there is no longer the
> need for a new syscall, and we can go back to using madvise()
> to set and unset pages as volatile.

Johannes: To get some feedback, maybe I'll needle you directly here a
bit. :)

Does moving this interface to madvise help reduce your objections?  I
feel like your cleaning-the-dirty-bit idea didn't work out, but I was
hoping that by reworking the vma manipulations to be atomic, we could
move to madvise and still avoid the new syscall that you seemed bothered
by. But I've not really heard much from you recently so I worry your
concerns on this were actually elsewhere, and I'm just churning the
patch needlessly.

thanks
-john



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-05-08 Thread John Stultz
On 05/07/2014 10:58 PM, Minchan Kim wrote:
> On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
>> Another few weeks and another volatile ranges patchset...
>>
>> After getting the sense that the a major objection to the earlier
>> patches was the introduction of a new syscall (and its somewhat
>> strange dual length/purged-bit return values), I spent some time
>> trying to rework the vma manipulations so we can be we won't fail
>> mid-way through changing volatility (basically making it atomic).
>> I think I have it working, and thus, there is no longer the
>> need for a new syscall, and we can go back to using madvise()
>> to set and unset pages as volatile.
> As I said reply as other patch's reply, I'm ok with this but I'd
> like to make it clear to support zero-filled page as well as SIGBUS.
> If we want to use madvise, maybe we need another advise flag like
> MADV_VOLATILE_SIGBUS.

I still disagree that zero-fill is more obvious behavior. And again, I
still support MADV_VOLATILE and MADV_FREE both being added, as they
really do have different use cases that I'd rather not try to fit into
one operation.


>>
>> New changes are:
>> 
>> o Reworked vma manipulations to be be atomic
>> o Converted back to using madvise() as syscall interface
>> o Integrated fix from Minchan to avoid SIGBUS faulting race
>> o Caught/fixed subtle use-after-free bug w/ vma merging
>> o Lots of minor cleanups and comment improvements
>>
>>
>> Still on the TODO list
>> 
>> o Sort out how best to do page accounting when the volatility
>>   is tracked on a per-mm basis.
> What's is your concern about page accouting?
> Could you elaborate it more for everybody to understand your concern
> clearly.

Basically the issue is that since we keep the volatility in the vma,
when we mark a page as volatile, its only marking the page for that
processes, not globally (since the page may be COWed). This makes
keeping track of the number of actual pages that are volatile accurately
somewhat difficult, since we can't just add one for each page marked and
subtract one for each page unmarked (for tmpfs/shm file based
volatility, where volatility is shared globally, this will be much easier ;)

It might not be too hard to keep a per-process-pages count of
volatility, but in that case we could see some strange effects where it
seems like there are 3x the number of actual volatile pages, and that
might throw off some of the scanning. So its something I've deferred a
bit to think about.



>> o Revisit anonymous page aging on swapless systems
> One idea is that we can age forcefully on swapless system if system
> has volatile vma or lazyfree pages. If the number of volatile vma or
> lazyfree pages is zero, we can stop the aging automatically.

I'll look into this some more.


>
>> o Draft up re-adding tmpfs/shm file volatility support
>>
>   o One concern from minchan.
>   I really like O(1) cost of unmarking syscall.
>
> Vrange syscall is for others, not itself. I mean if some process calls
> vrange syscall, it would scacrifice his resource for others when
> emergency happens so if the syscall is overhead rather expensive,
> anybody doesn't want to use it.

So yes. I agree the cost is more expensive then I'd like. However, I'd
like to get a consensus on the expected behavior established and get
folks first agreeing to the semantics and the interface. Then we can
follow up with optimizations.

> One idea is put increasing counter in mm_struct and assign the token
> to volatile vma. Maybe we can squeeze it into vma->vm_start's lower
> bits if we don't want to bloat vma size because we always hold mmap_sem
> with write-side lock when we handle vrange syscall.
> And we can use the token and purged mark together to pte when the purge
> happens. With this, we can bail out as soon as we found purged entry in
> unmarking syscall so remained ptes still have purged pte although
> unmarking syscall is done. But it's no problem because if the vma is
> marked as volatile again, the token will be change(ie, increased) and
> doesn't match with pte's token. When the page fault occur, we can compare
> the token to emit SIGBUS. If it doesn't match, we can ignore and just
> map new page to pte.
>
> One problem is overflow of counter. In the case, we can deliver false
> positive to user but it isn't severe, either because use have a preparation
> to handle SIGBUS if he want to use vrange syscall with SIGBUS model.

This sounds like an interesting optimization. But again, I worry that
adding these edge cases (which I honestly really don't see as
problematic) muddies the water and keeps reviewers away. I'd rather wait
until after we have something settled behavior wise, then start
discussing these performance optimizations that may cause
safe-but-false-postives.


Thanks so much for your review and guidance here (I was worried I had
lost everyone's attention again). I really appreciate the 

Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-05-08 Thread John Stultz
On 05/07/2014 10:58 PM, Minchan Kim wrote:
 On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
 Another few weeks and another volatile ranges patchset...

 After getting the sense that the a major objection to the earlier
 patches was the introduction of a new syscall (and its somewhat
 strange dual length/purged-bit return values), I spent some time
 trying to rework the vma manipulations so we can be we won't fail
 mid-way through changing volatility (basically making it atomic).
 I think I have it working, and thus, there is no longer the
 need for a new syscall, and we can go back to using madvise()
 to set and unset pages as volatile.
 As I said reply as other patch's reply, I'm ok with this but I'd
 like to make it clear to support zero-filled page as well as SIGBUS.
 If we want to use madvise, maybe we need another advise flag like
 MADV_VOLATILE_SIGBUS.

I still disagree that zero-fill is more obvious behavior. And again, I
still support MADV_VOLATILE and MADV_FREE both being added, as they
really do have different use cases that I'd rather not try to fit into
one operation.



 New changes are:
 
 o Reworked vma manipulations to be be atomic
 o Converted back to using madvise() as syscall interface
 o Integrated fix from Minchan to avoid SIGBUS faulting race
 o Caught/fixed subtle use-after-free bug w/ vma merging
 o Lots of minor cleanups and comment improvements


 Still on the TODO list
 
 o Sort out how best to do page accounting when the volatility
   is tracked on a per-mm basis.
 What's is your concern about page accouting?
 Could you elaborate it more for everybody to understand your concern
 clearly.

Basically the issue is that since we keep the volatility in the vma,
when we mark a page as volatile, its only marking the page for that
processes, not globally (since the page may be COWed). This makes
keeping track of the number of actual pages that are volatile accurately
somewhat difficult, since we can't just add one for each page marked and
subtract one for each page unmarked (for tmpfs/shm file based
volatility, where volatility is shared globally, this will be much easier ;)

It might not be too hard to keep a per-process-pages count of
volatility, but in that case we could see some strange effects where it
seems like there are 3x the number of actual volatile pages, and that
might throw off some of the scanning. So its something I've deferred a
bit to think about.



 o Revisit anonymous page aging on swapless systems
 One idea is that we can age forcefully on swapless system if system
 has volatile vma or lazyfree pages. If the number of volatile vma or
 lazyfree pages is zero, we can stop the aging automatically.

I'll look into this some more.



 o Draft up re-adding tmpfs/shm file volatility support

   o One concern from minchan.
   I really like O(1) cost of unmarking syscall.

 Vrange syscall is for others, not itself. I mean if some process calls
 vrange syscall, it would scacrifice his resource for others when
 emergency happens so if the syscall is overhead rather expensive,
 anybody doesn't want to use it.

So yes. I agree the cost is more expensive then I'd like. However, I'd
like to get a consensus on the expected behavior established and get
folks first agreeing to the semantics and the interface. Then we can
follow up with optimizations.

 One idea is put increasing counter in mm_struct and assign the token
 to volatile vma. Maybe we can squeeze it into vma-vm_start's lower
 bits if we don't want to bloat vma size because we always hold mmap_sem
 with write-side lock when we handle vrange syscall.
 And we can use the token and purged mark together to pte when the purge
 happens. With this, we can bail out as soon as we found purged entry in
 unmarking syscall so remained ptes still have purged pte although
 unmarking syscall is done. But it's no problem because if the vma is
 marked as volatile again, the token will be change(ie, increased) and
 doesn't match with pte's token. When the page fault occur, we can compare
 the token to emit SIGBUS. If it doesn't match, we can ignore and just
 map new page to pte.

 One problem is overflow of counter. In the case, we can deliver false
 positive to user but it isn't severe, either because use have a preparation
 to handle SIGBUS if he want to use vrange syscall with SIGBUS model.

This sounds like an interesting optimization. But again, I worry that
adding these edge cases (which I honestly really don't see as
problematic) muddies the water and keeps reviewers away. I'd rather wait
until after we have something settled behavior wise, then start
discussing these performance optimizations that may cause
safe-but-false-postives.


Thanks so much for your review and guidance here (I was worried I had
lost everyone's attention again). I really appreciate the feedback!

thanks
-john






--
To unsubscribe from this list: send the line unsubscribe 

Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-05-08 Thread John Stultz
On 04/29/2014 02:21 PM, John Stultz wrote:
 Another few weeks and another volatile ranges patchset...

 After getting the sense that the a major objection to the earlier
 patches was the introduction of a new syscall (and its somewhat
 strange dual length/purged-bit return values), I spent some time
 trying to rework the vma manipulations so we can be we won't fail
 mid-way through changing volatility (basically making it atomic).
 I think I have it working, and thus, there is no longer the
 need for a new syscall, and we can go back to using madvise()
 to set and unset pages as volatile.

Johannes: To get some feedback, maybe I'll needle you directly here a
bit. :)

Does moving this interface to madvise help reduce your objections?  I
feel like your cleaning-the-dirty-bit idea didn't work out, but I was
hoping that by reworking the vma manipulations to be atomic, we could
move to madvise and still avoid the new syscall that you seemed bothered
by. But I've not really heard much from you recently so I worry your
concerns on this were actually elsewhere, and I'm just churning the
patch needlessly.

thanks
-john



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-05-08 Thread Minchan Kim
On Thu, May 08, 2014 at 10:04:49AM -0700, John Stultz wrote:
 On 05/07/2014 10:58 PM, Minchan Kim wrote:
  On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
  Another few weeks and another volatile ranges patchset...
 
  After getting the sense that the a major objection to the earlier
  patches was the introduction of a new syscall (and its somewhat
  strange dual length/purged-bit return values), I spent some time
  trying to rework the vma manipulations so we can be we won't fail
  mid-way through changing volatility (basically making it atomic).
  I think I have it working, and thus, there is no longer the
  need for a new syscall, and we can go back to using madvise()
  to set and unset pages as volatile.
  As I said reply as other patch's reply, I'm ok with this but I'd
  like to make it clear to support zero-filled page as well as SIGBUS.
  If we want to use madvise, maybe we need another advise flag like
  MADV_VOLATILE_SIGBUS.
 
 I still disagree that zero-fill is more obvious behavior. And again, I
 still support MADV_VOLATILE and MADV_FREE both being added, as they
 really do have different use cases that I'd rather not try to fit into
 one operation.

As I replied previous mail, MADV_FREE is one-shot operation so upcoming
faulted page couldn't be affected so caller should call the syscall again
sometime to make the range volatile again and MADV_FREE is O(N) so vrange
with zero-fill could avoid that totally.

 
 
 
  New changes are:
  
  o Reworked vma manipulations to be be atomic
  o Converted back to using madvise() as syscall interface
  o Integrated fix from Minchan to avoid SIGBUS faulting race
  o Caught/fixed subtle use-after-free bug w/ vma merging
  o Lots of minor cleanups and comment improvements
 
 
  Still on the TODO list
  
  o Sort out how best to do page accounting when the volatility
is tracked on a per-mm basis.
  What's is your concern about page accouting?
  Could you elaborate it more for everybody to understand your concern
  clearly.
 
 Basically the issue is that since we keep the volatility in the vma,
 when we mark a page as volatile, its only marking the page for that
 processes, not globally (since the page may be COWed). This makes
 keeping track of the number of actual pages that are volatile accurately
 somewhat difficult, since we can't just add one for each page marked and
 subtract one for each page unmarked (for tmpfs/shm file based
 volatility, where volatility is shared globally, this will be much easier ;)
 
 It might not be too hard to keep a per-process-pages count of
 volatility, but in that case we could see some strange effects where it
 seems like there are 3x the number of actual volatile pages, and that
 might throw off some of the scanning. So its something I've deferred a
 bit to think about.

Okay. So, why do you want to account volatile page?
Originally, what I expected is to age anonymous LRU list until the number of
count is zero so aging overhead would be zero if there is no volatile page
any more in the system but downside of the approach is it makes vrange marking
syscall cost O(N). That's why I suggested couting of volatile *vmas* instead of
volatile *pages*. It could make unnecessary aging of anon lru list if there is
no physical pages in the vma but I think it's good deal because we moved
hot path overhead to slow path and that's one of design goal of vrange syscall.
We might make an effort to make such aging not agressive in future, which
would be another topic.

 
 
 
  o Revisit anonymous page aging on swapless systems
  One idea is that we can age forcefully on swapless system if system
  has volatile vma or lazyfree pages. If the number of volatile vma or
  lazyfree pages is zero, we can stop the aging automatically.
 
 I'll look into this some more.
 
 
 
  o Draft up re-adding tmpfs/shm file volatility support
 
o One concern from minchan.
I really like O(1) cost of unmarking syscall.
 
  Vrange syscall is for others, not itself. I mean if some process calls
  vrange syscall, it would scacrifice his resource for others when
  emergency happens so if the syscall is overhead rather expensive,
  anybody doesn't want to use it.
 
 So yes. I agree the cost is more expensive then I'd like. However, I'd
 like to get a consensus on the expected behavior established and get
 folks first agreeing to the semantics and the interface. Then we can
 follow up with optimizations.

Oops, I forgot mentioning We could do it with optimization in future.
I absolute agree with you. I don't want to do that in this stage but just
want to record one idea to optimize it so don't get me wrong. It's not
a objection.

 
  One idea is put increasing counter in mm_struct and assign the token
  to volatile vma. Maybe we can squeeze it into vma-vm_start's lower
  bits if we don't want to bloat vma size because we always hold mmap_sem
  with write-side lock when we handle 

Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-05-07 Thread Minchan Kim
On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
> Another few weeks and another volatile ranges patchset...
> 
> After getting the sense that the a major objection to the earlier
> patches was the introduction of a new syscall (and its somewhat
> strange dual length/purged-bit return values), I spent some time
> trying to rework the vma manipulations so we can be we won't fail
> mid-way through changing volatility (basically making it atomic).
> I think I have it working, and thus, there is no longer the
> need for a new syscall, and we can go back to using madvise()
> to set and unset pages as volatile.

As I said reply as other patch's reply, I'm ok with this but I'd
like to make it clear to support zero-filled page as well as SIGBUS.
If we want to use madvise, maybe we need another advise flag like
MADV_VOLATILE_SIGBUS.
> 
> 
> New changes are:
> 
> o Reworked vma manipulations to be be atomic
> o Converted back to using madvise() as syscall interface
> o Integrated fix from Minchan to avoid SIGBUS faulting race
> o Caught/fixed subtle use-after-free bug w/ vma merging
> o Lots of minor cleanups and comment improvements
> 
> 
> Still on the TODO list
> 
> o Sort out how best to do page accounting when the volatility
>   is tracked on a per-mm basis.

What's is your concern about page accouting?
Could you elaborate it more for everybody to understand your concern
clearly.

> o Revisit anonymous page aging on swapless systems

One idea is that we can age forcefully on swapless system if system
has volatile vma or lazyfree pages. If the number of volatile vma or
lazyfree pages is zero, we can stop the aging automatically.

> o Draft up re-adding tmpfs/shm file volatility support
> 
  o One concern from minchan.
  I really like O(1) cost of unmarking syscall.

Vrange syscall is for others, not itself. I mean if some process calls
vrange syscall, it would scacrifice his resource for others when
emergency happens so if the syscall is overhead rather expensive,
anybody doesn't want to use it.

One idea is put increasing counter in mm_struct and assign the token
to volatile vma. Maybe we can squeeze it into vma->vm_start's lower
bits if we don't want to bloat vma size because we always hold mmap_sem
with write-side lock when we handle vrange syscall.
And we can use the token and purged mark together to pte when the purge
happens. With this, we can bail out as soon as we found purged entry in
unmarking syscall so remained ptes still have purged pte although
unmarking syscall is done. But it's no problem because if the vma is
marked as volatile again, the token will be change(ie, increased) and
doesn't match with pte's token. When the page fault occur, we can compare
the token to emit SIGBUS. If it doesn't match, we can ignore and just
map new page to pte.

One problem is overflow of counter. In the case, we can deliver false
positive to user but it isn't severe, either because use have a preparation
to handle SIGBUS if he want to use vrange syscall with SIGBUS model.

> 
> Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
> Hugh, and others for the great feedback and discussion at
> LSF-MM.
> 
> thanks
> -john
> 
> 
> Volatile ranges provides a method for userland to inform the kernel
> that a range of memory is safe to discard (ie: can be regenerated)
> but userspace may want to try access it in the future.  It can be
> thought of as similar to MADV_DONTNEED, but that the actual freeing
> of the memory is delayed and only done under memory pressure, and the
> user can try to cancel the action and be able to quickly access any
> unpurged pages. The idea originated from Android's ashmem, but I've
> since learned that other OSes provide similar functionality.
> 
> This functionality allows for a number of interesting uses. One such
> example is: Userland caches that have kernel triggered eviction under
> memory pressure. This allows for the kernel to "rightsize" userspace
> caches for current system-wide workload. Things like image bitmap
> caches, or rendered HTML in a hidden browser tab, where the data is
> not visible and can be regenerated if needed, are good examples.
> 
> Both Chrome and Firefox already make use of volatile range-like
> functionality via the ashmem interface:
> https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34
> 
> https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc
> 
> 
> The basic usage of volatile ranges is as so:
> 1) Userland marks a range of memory that can be regenerated if
> necessary as volatile
> 2) Before accessing the memory again, userland marks the memory as
> nonvolatile, and the kernel will provide notification if any pages in
> the range has been purged.
> 
> If userland accesses memory while it is volatile, it will either
> get the value stored at that memory if there has 

Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-05-07 Thread Minchan Kim
On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
 Another few weeks and another volatile ranges patchset...
 
 After getting the sense that the a major objection to the earlier
 patches was the introduction of a new syscall (and its somewhat
 strange dual length/purged-bit return values), I spent some time
 trying to rework the vma manipulations so we can be we won't fail
 mid-way through changing volatility (basically making it atomic).
 I think I have it working, and thus, there is no longer the
 need for a new syscall, and we can go back to using madvise()
 to set and unset pages as volatile.

As I said reply as other patch's reply, I'm ok with this but I'd
like to make it clear to support zero-filled page as well as SIGBUS.
If we want to use madvise, maybe we need another advise flag like
MADV_VOLATILE_SIGBUS.
 
 
 New changes are:
 
 o Reworked vma manipulations to be be atomic
 o Converted back to using madvise() as syscall interface
 o Integrated fix from Minchan to avoid SIGBUS faulting race
 o Caught/fixed subtle use-after-free bug w/ vma merging
 o Lots of minor cleanups and comment improvements
 
 
 Still on the TODO list
 
 o Sort out how best to do page accounting when the volatility
   is tracked on a per-mm basis.

What's is your concern about page accouting?
Could you elaborate it more for everybody to understand your concern
clearly.

 o Revisit anonymous page aging on swapless systems

One idea is that we can age forcefully on swapless system if system
has volatile vma or lazyfree pages. If the number of volatile vma or
lazyfree pages is zero, we can stop the aging automatically.

 o Draft up re-adding tmpfs/shm file volatility support
 
  o One concern from minchan.
  I really like O(1) cost of unmarking syscall.

Vrange syscall is for others, not itself. I mean if some process calls
vrange syscall, it would scacrifice his resource for others when
emergency happens so if the syscall is overhead rather expensive,
anybody doesn't want to use it.

One idea is put increasing counter in mm_struct and assign the token
to volatile vma. Maybe we can squeeze it into vma-vm_start's lower
bits if we don't want to bloat vma size because we always hold mmap_sem
with write-side lock when we handle vrange syscall.
And we can use the token and purged mark together to pte when the purge
happens. With this, we can bail out as soon as we found purged entry in
unmarking syscall so remained ptes still have purged pte although
unmarking syscall is done. But it's no problem because if the vma is
marked as volatile again, the token will be change(ie, increased) and
doesn't match with pte's token. When the page fault occur, we can compare
the token to emit SIGBUS. If it doesn't match, we can ignore and just
map new page to pte.

One problem is overflow of counter. In the case, we can deliver false
positive to user but it isn't severe, either because use have a preparation
to handle SIGBUS if he want to use vrange syscall with SIGBUS model.

 
 Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
 Hugh, and others for the great feedback and discussion at
 LSF-MM.
 
 thanks
 -john
 
 
 Volatile ranges provides a method for userland to inform the kernel
 that a range of memory is safe to discard (ie: can be regenerated)
 but userspace may want to try access it in the future.  It can be
 thought of as similar to MADV_DONTNEED, but that the actual freeing
 of the memory is delayed and only done under memory pressure, and the
 user can try to cancel the action and be able to quickly access any
 unpurged pages. The idea originated from Android's ashmem, but I've
 since learned that other OSes provide similar functionality.
 
 This functionality allows for a number of interesting uses. One such
 example is: Userland caches that have kernel triggered eviction under
 memory pressure. This allows for the kernel to rightsize userspace
 caches for current system-wide workload. Things like image bitmap
 caches, or rendered HTML in a hidden browser tab, where the data is
 not visible and can be regenerated if needed, are good examples.
 
 Both Chrome and Firefox already make use of volatile range-like
 functionality via the ashmem interface:
 https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34
 
 https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc
 
 
 The basic usage of volatile ranges is as so:
 1) Userland marks a range of memory that can be regenerated if
 necessary as volatile
 2) Before accessing the memory again, userland marks the memory as
 nonvolatile, and the kernel will provide notification if any pages in
 the range has been purged.
 
 If userland accesses memory while it is volatile, it will either
 get the value stored at that memory if there has been no memory
 pressure or the application will get a SIGBUS if the page 

[PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-04-29 Thread John Stultz
Another few weeks and another volatile ranges patchset...

After getting the sense that the a major objection to the earlier
patches was the introduction of a new syscall (and its somewhat
strange dual length/purged-bit return values), I spent some time
trying to rework the vma manipulations so we can be we won't fail
mid-way through changing volatility (basically making it atomic).
I think I have it working, and thus, there is no longer the
need for a new syscall, and we can go back to using madvise()
to set and unset pages as volatile.


New changes are:

o Reworked vma manipulations to be be atomic
o Converted back to using madvise() as syscall interface
o Integrated fix from Minchan to avoid SIGBUS faulting race
o Caught/fixed subtle use-after-free bug w/ vma merging
o Lots of minor cleanups and comment improvements


Still on the TODO list

o Sort out how best to do page accounting when the volatility
  is tracked on a per-mm basis.
o Revisit anonymous page aging on swapless systems
o Draft up re-adding tmpfs/shm file volatility support


Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
Hugh, and others for the great feedback and discussion at
LSF-MM.

thanks
-john


Volatile ranges provides a method for userland to inform the kernel
that a range of memory is safe to discard (ie: can be regenerated)
but userspace may want to try access it in the future.  It can be
thought of as similar to MADV_DONTNEED, but that the actual freeing
of the memory is delayed and only done under memory pressure, and the
user can try to cancel the action and be able to quickly access any
unpurged pages. The idea originated from Android's ashmem, but I've
since learned that other OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to "rightsize" userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile range-like
functionality via the ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


The basic usage of volatile ranges is as so:
1) Userland marks a range of memory that can be regenerated if
necessary as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in
the range has been purged.

If userland accesses memory while it is volatile, it will either
get the value stored at that memory if there has been no memory
pressure or the application will get a SIGBUS if the page has been
purged.

Reads or writes to the memory do not affect the volatility state of the
pages.

You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/592042/
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last few releases, this revision is reduced in
scope when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the
earlier approach, but it does simplify the approach. I'm open to
expanding functionality via flags arguments, but for now I'm wanting
to keep focus on what the right default behavior should be and keep
the use cases restricted to help get reviewer interest.

Additionally, since we don't handle volatility on tmpfs files with this
version of the patch, it is not able to be used to implement semantics
similar to Android's ashmem. But since shared volatiltiy on files is
more complex, my hope is to start small and hopefully grow from there.

Again, much of the logic in this patchset is based on Minchan's earlier
efforts, so I do want to make sure the credit goes to him for his major
contribution!

Cc: Andrew Morton 
Cc: Android Kernel Team 
Cc: Johannes Weiner 
Cc: Robert Love 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Dmitry Adamushko 
Cc: Neil Brown 
Cc: Andrea Arcangeli 
Cc: Mike Hommey 
Cc: Taras Glek 
Cc: Jan Kara 
Cc: KOSAKI Motohiro 
Cc: Michel Lespinasse 
Cc: Minchan Kim 
Cc: Keith Packard 
Cc: linux...@kvack.org 

John Stultz (4):
  swap: Cleanup how special swap file numbers are defined
  MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking
vmas
  MADV_VOLATILE: Add purged page 

[PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)

2014-04-29 Thread John Stultz
Another few weeks and another volatile ranges patchset...

After getting the sense that the a major objection to the earlier
patches was the introduction of a new syscall (and its somewhat
strange dual length/purged-bit return values), I spent some time
trying to rework the vma manipulations so we can be we won't fail
mid-way through changing volatility (basically making it atomic).
I think I have it working, and thus, there is no longer the
need for a new syscall, and we can go back to using madvise()
to set and unset pages as volatile.


New changes are:

o Reworked vma manipulations to be be atomic
o Converted back to using madvise() as syscall interface
o Integrated fix from Minchan to avoid SIGBUS faulting race
o Caught/fixed subtle use-after-free bug w/ vma merging
o Lots of minor cleanups and comment improvements


Still on the TODO list

o Sort out how best to do page accounting when the volatility
  is tracked on a per-mm basis.
o Revisit anonymous page aging on swapless systems
o Draft up re-adding tmpfs/shm file volatility support


Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
Hugh, and others for the great feedback and discussion at
LSF-MM.

thanks
-john


Volatile ranges provides a method for userland to inform the kernel
that a range of memory is safe to discard (ie: can be regenerated)
but userspace may want to try access it in the future.  It can be
thought of as similar to MADV_DONTNEED, but that the actual freeing
of the memory is delayed and only done under memory pressure, and the
user can try to cancel the action and be able to quickly access any
unpurged pages. The idea originated from Android's ashmem, but I've
since learned that other OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to rightsize userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile range-like
functionality via the ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


The basic usage of volatile ranges is as so:
1) Userland marks a range of memory that can be regenerated if
necessary as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in
the range has been purged.

If userland accesses memory while it is volatile, it will either
get the value stored at that memory if there has been no memory
pressure or the application will get a SIGBUS if the page has been
purged.

Reads or writes to the memory do not affect the volatility state of the
pages.

You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/592042/
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last few releases, this revision is reduced in
scope when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the
earlier approach, but it does simplify the approach. I'm open to
expanding functionality via flags arguments, but for now I'm wanting
to keep focus on what the right default behavior should be and keep
the use cases restricted to help get reviewer interest.

Additionally, since we don't handle volatility on tmpfs files with this
version of the patch, it is not able to be used to implement semantics
similar to Android's ashmem. But since shared volatiltiy on files is
more complex, my hope is to start small and hopefully grow from there.

Again, much of the logic in this patchset is based on Minchan's earlier
efforts, so I do want to make sure the credit goes to him for his major
contribution!

Cc: Andrew Morton a...@linux-foundation.org
Cc: Android Kernel Team kernel-t...@android.com
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Robert Love rl...@google.com
Cc: Mel Gorman m...@csn.ul.ie
Cc: Hugh Dickins hu...@google.com
Cc: Dave Hansen d...@sr71.net
Cc: Rik van Riel r...@redhat.com
Cc: Dmitry Adamushko dmitry.adamus...@gmail.com
Cc: Neil Brown ne...@suse.de
Cc: Andrea Arcangeli aarca...@redhat.com
Cc: Mike Hommey m...@glandium.org
Cc: Taras Glek tg...@mozilla.com
Cc: Jan Kara j...@suse.cz
Cc: KOSAKI Motohiro kosaki.motoh...@gmail.com
Cc: Michel