Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
Hello everyone, On Mon, Jun 16, 2014 at 01:12:41PM -0700, John Stultz wrote: > On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner wrote: > > That, however, truly is a separate virtual memory feature. Would it > > be possible for you to take MADV_FREE and MADV_REVIVE as a base and > > implement an madvise op that switches the no-page behavior of a VMA > > from zero-filling to SIGBUS delivery? > > I'll see if I can look into it if I get some time. However, I suspect > its more likely I'll just have to admit defeat on this one and let > someone else champion the effort. Interest and reviews have seemingly > dropped again here and with other work ramping up, I'm not sure if > I'll be able to justify further work on this. :( About adding an madvise op that switches the no-page behavior from zero-filling to SIGBUS delivery (right now only for anonymous vmas but we can evaluate to extend it) I've mostly completed the userfaultfd/madvise(MADV_USERFAULT) according to the design I described earlier. Like we discussed earlier that may fit the bill if extended to tmpfs? The first preliminary tests just passed last week. http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/?h=userfault If userfaultfd() isn't instantiated by the process, it only sends a SIBGUS to the thread accessing the unmapped virtual address (handle_mm_faults returns VM_FAULT_SIGBUS). The address of the fault is then available in siginfo->si_addr. You strictly need a memory externalization thread opening the userfaultfd and speaking the userfaultfd protocol only if you need to access the memory also through syscalls or drivers doing GUP calls. This allows memory mapped in a secondary MMU for example to be externalized without a single change to the secondary MMU code. The userfault becomes invisible to handle_mm_fault/gup()/gup_fast/FOLL_NOWAIT etc The only requirement is that the memory externalization thread never accesses any memory in the MADV_USERFAULT marked regions (and if it does because of a bug, the deadlock should be quite apparent by simply checking the stack trace of the externalization thread blocked in handle_userfault(), sigkill will then clear it up :). If you close the userfaultfd the SIGBUS behavior will immediately return for the MADV_USERFAULT marked regions and any hung task waiting to be waken will get an immediate SIGBUS. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner wrote: > On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote: >> On 04/29/2014 02:21 PM, John Stultz wrote: >> > Another few weeks and another volatile ranges patchset... >> > >> > After getting the sense that the a major objection to the earlier >> > patches was the introduction of a new syscall (and its somewhat >> > strange dual length/purged-bit return values), I spent some time >> > trying to rework the vma manipulations so we can be we won't fail >> > mid-way through changing volatility (basically making it atomic). >> > I think I have it working, and thus, there is no longer the >> > need for a new syscall, and we can go back to using madvise() >> > to set and unset pages as volatile. >> >> Johannes: To get some feedback, maybe I'll needle you directly here a >> bit. :) >> >> Does moving this interface to madvise help reduce your objections? I >> feel like your cleaning-the-dirty-bit idea didn't work out, but I was >> hoping that by reworking the vma manipulations to be atomic, we could >> move to madvise and still avoid the new syscall that you seemed bothered >> by. But I've not really heard much from you recently so I worry your >> concerns on this were actually elsewhere, and I'm just churning the >> patch needlessly. > > My objection was not the syscall. > > From a reclaim perspective, using the dirty state to denote whether a > swap-backed page needs writeback before reclaim is quite natural and I > much prefer Minchan's changes to the reclaim code over yours. > > From an interface point of view, I would prefer the simplicity of > cleaning dirty bits to invalidate pages, and a default of zero-filling > invalidated pages instead of sending SIGBUS. This also is quite > natural when you think of anon/shmem mappings as cache pages on top of > /dev/zero (see mmap_zero() and shmem_zero_setup()). And it translates > well to tmpfs. > > At the same time, I acknowledge that there are usecases that want > SIGBUS delivery for more than just convenience in order to implement > userspace fault handling, and this is the only place where I see a > real divergence in actual functionality from Minchan's code. Thanks for the clarification and feedback. Sorry for my slow response, as I was on vacation for a week and am just now catching up on this. So again, SIGBUS for userspace fault handling is really of a side-effect of having more userspace friendly semantics, and isn't really the primary goal/usage model. Zerofill semantics are mostly problematic because they make userspace mistakes harder to find and diagnose. Android's ashmem actually uses zerofill semantics, so while I see it as less ideal, technically zerofill would work here. However, combining zerofill with your preferred overloading of the dirty state is particularly problematic because it makes any dirtying of volatile data clear both the volatile state as well as the purged state for the entire page. The volatile state is surprising, but less problematic, but the clearing of the purged state means applications would possibly get a partial zero page (for whatever wasn't written) and no warning that their data was lost. This is a very surprising and unfriendly side-effect from a userspace perspective. For context, Android's ashmem preserves both the volatile and purged state on volatile page dirtying (since the volatility and purged state are kept in their own range structure independently from the VM). > That, however, truly is a separate virtual memory feature. Would it > be possible for you to take MADV_FREE and MADV_REVIVE as a base and > implement an madvise op that switches the no-page behavior of a VMA > from zero-filling to SIGBUS delivery? I'll see if I can look into it if I get some time. However, I suspect its more likely I'll just have to admit defeat on this one and let someone else champion the effort. Interest and reviews have seemingly dropped again here and with other work ramping up, I'm not sure if I'll be able to justify further work on this. :( thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner han...@cmpxchg.org wrote: On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote: On 04/29/2014 02:21 PM, John Stultz wrote: Another few weeks and another volatile ranges patchset... After getting the sense that the a major objection to the earlier patches was the introduction of a new syscall (and its somewhat strange dual length/purged-bit return values), I spent some time trying to rework the vma manipulations so we can be we won't fail mid-way through changing volatility (basically making it atomic). I think I have it working, and thus, there is no longer the need for a new syscall, and we can go back to using madvise() to set and unset pages as volatile. Johannes: To get some feedback, maybe I'll needle you directly here a bit. :) Does moving this interface to madvise help reduce your objections? I feel like your cleaning-the-dirty-bit idea didn't work out, but I was hoping that by reworking the vma manipulations to be atomic, we could move to madvise and still avoid the new syscall that you seemed bothered by. But I've not really heard much from you recently so I worry your concerns on this were actually elsewhere, and I'm just churning the patch needlessly. My objection was not the syscall. From a reclaim perspective, using the dirty state to denote whether a swap-backed page needs writeback before reclaim is quite natural and I much prefer Minchan's changes to the reclaim code over yours. From an interface point of view, I would prefer the simplicity of cleaning dirty bits to invalidate pages, and a default of zero-filling invalidated pages instead of sending SIGBUS. This also is quite natural when you think of anon/shmem mappings as cache pages on top of /dev/zero (see mmap_zero() and shmem_zero_setup()). And it translates well to tmpfs. At the same time, I acknowledge that there are usecases that want SIGBUS delivery for more than just convenience in order to implement userspace fault handling, and this is the only place where I see a real divergence in actual functionality from Minchan's code. Thanks for the clarification and feedback. Sorry for my slow response, as I was on vacation for a week and am just now catching up on this. So again, SIGBUS for userspace fault handling is really of a side-effect of having more userspace friendly semantics, and isn't really the primary goal/usage model. Zerofill semantics are mostly problematic because they make userspace mistakes harder to find and diagnose. Android's ashmem actually uses zerofill semantics, so while I see it as less ideal, technically zerofill would work here. However, combining zerofill with your preferred overloading of the dirty state is particularly problematic because it makes any dirtying of volatile data clear both the volatile state as well as the purged state for the entire page. The volatile state is surprising, but less problematic, but the clearing of the purged state means applications would possibly get a partial zero page (for whatever wasn't written) and no warning that their data was lost. This is a very surprising and unfriendly side-effect from a userspace perspective. For context, Android's ashmem preserves both the volatile and purged state on volatile page dirtying (since the volatility and purged state are kept in their own range structure independently from the VM). That, however, truly is a separate virtual memory feature. Would it be possible for you to take MADV_FREE and MADV_REVIVE as a base and implement an madvise op that switches the no-page behavior of a VMA from zero-filling to SIGBUS delivery? I'll see if I can look into it if I get some time. However, I suspect its more likely I'll just have to admit defeat on this one and let someone else champion the effort. Interest and reviews have seemingly dropped again here and with other work ramping up, I'm not sure if I'll be able to justify further work on this. :( thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
Hello everyone, On Mon, Jun 16, 2014 at 01:12:41PM -0700, John Stultz wrote: On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner han...@cmpxchg.org wrote: That, however, truly is a separate virtual memory feature. Would it be possible for you to take MADV_FREE and MADV_REVIVE as a base and implement an madvise op that switches the no-page behavior of a VMA from zero-filling to SIGBUS delivery? I'll see if I can look into it if I get some time. However, I suspect its more likely I'll just have to admit defeat on this one and let someone else champion the effort. Interest and reviews have seemingly dropped again here and with other work ramping up, I'm not sure if I'll be able to justify further work on this. :( About adding an madvise op that switches the no-page behavior from zero-filling to SIGBUS delivery (right now only for anonymous vmas but we can evaluate to extend it) I've mostly completed the userfaultfd/madvise(MADV_USERFAULT) according to the design I described earlier. Like we discussed earlier that may fit the bill if extended to tmpfs? The first preliminary tests just passed last week. http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/?h=userfault If userfaultfd() isn't instantiated by the process, it only sends a SIBGUS to the thread accessing the unmapped virtual address (handle_mm_faults returns VM_FAULT_SIGBUS). The address of the fault is then available in siginfo-si_addr. You strictly need a memory externalization thread opening the userfaultfd and speaking the userfaultfd protocol only if you need to access the memory also through syscalls or drivers doing GUP calls. This allows memory mapped in a secondary MMU for example to be externalized without a single change to the secondary MMU code. The userfault becomes invisible to handle_mm_fault/gup()/gup_fast/FOLL_NOWAIT etc The only requirement is that the memory externalization thread never accesses any memory in the MADV_USERFAULT marked regions (and if it does because of a bug, the deadlock should be quite apparent by simply checking the stack trace of the externalization thread blocked in handle_userfault(), sigkill will then clear it up :). If you close the userfaultfd the SIGBUS behavior will immediately return for the MADV_USERFAULT marked regions and any hung task waiting to be waken will get an immediate SIGBUS. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote: > On 04/29/2014 02:21 PM, John Stultz wrote: > > Another few weeks and another volatile ranges patchset... > > > > After getting the sense that the a major objection to the earlier > > patches was the introduction of a new syscall (and its somewhat > > strange dual length/purged-bit return values), I spent some time > > trying to rework the vma manipulations so we can be we won't fail > > mid-way through changing volatility (basically making it atomic). > > I think I have it working, and thus, there is no longer the > > need for a new syscall, and we can go back to using madvise() > > to set and unset pages as volatile. > > Johannes: To get some feedback, maybe I'll needle you directly here a > bit. :) > > Does moving this interface to madvise help reduce your objections? I > feel like your cleaning-the-dirty-bit idea didn't work out, but I was > hoping that by reworking the vma manipulations to be atomic, we could > move to madvise and still avoid the new syscall that you seemed bothered > by. But I've not really heard much from you recently so I worry your > concerns on this were actually elsewhere, and I'm just churning the > patch needlessly. My objection was not the syscall. >From a reclaim perspective, using the dirty state to denote whether a swap-backed page needs writeback before reclaim is quite natural and I much prefer Minchan's changes to the reclaim code over yours. >From an interface point of view, I would prefer the simplicity of cleaning dirty bits to invalidate pages, and a default of zero-filling invalidated pages instead of sending SIGBUS. This also is quite natural when you think of anon/shmem mappings as cache pages on top of /dev/zero (see mmap_zero() and shmem_zero_setup()). And it translates well to tmpfs. At the same time, I acknowledge that there are usecases that want SIGBUS delivery for more than just convenience in order to implement userspace fault handling, and this is the only place where I see a real divergence in actual functionality from Minchan's code. That, however, truly is a separate virtual memory feature. Would it be possible for you to take MADV_FREE and MADV_REVIVE as a base and implement an madvise op that switches the no-page behavior of a VMA from zero-filling to SIGBUS delivery? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote: On 04/29/2014 02:21 PM, John Stultz wrote: Another few weeks and another volatile ranges patchset... After getting the sense that the a major objection to the earlier patches was the introduction of a new syscall (and its somewhat strange dual length/purged-bit return values), I spent some time trying to rework the vma manipulations so we can be we won't fail mid-way through changing volatility (basically making it atomic). I think I have it working, and thus, there is no longer the need for a new syscall, and we can go back to using madvise() to set and unset pages as volatile. Johannes: To get some feedback, maybe I'll needle you directly here a bit. :) Does moving this interface to madvise help reduce your objections? I feel like your cleaning-the-dirty-bit idea didn't work out, but I was hoping that by reworking the vma manipulations to be atomic, we could move to madvise and still avoid the new syscall that you seemed bothered by. But I've not really heard much from you recently so I worry your concerns on this were actually elsewhere, and I'm just churning the patch needlessly. My objection was not the syscall. From a reclaim perspective, using the dirty state to denote whether a swap-backed page needs writeback before reclaim is quite natural and I much prefer Minchan's changes to the reclaim code over yours. From an interface point of view, I would prefer the simplicity of cleaning dirty bits to invalidate pages, and a default of zero-filling invalidated pages instead of sending SIGBUS. This also is quite natural when you think of anon/shmem mappings as cache pages on top of /dev/zero (see mmap_zero() and shmem_zero_setup()). And it translates well to tmpfs. At the same time, I acknowledge that there are usecases that want SIGBUS delivery for more than just convenience in order to implement userspace fault handling, and this is the only place where I see a real divergence in actual functionality from Minchan's code. That, however, truly is a separate virtual memory feature. Would it be possible for you to take MADV_FREE and MADV_REVIVE as a base and implement an madvise op that switches the no-page behavior of a VMA from zero-filling to SIGBUS delivery? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On Thu, May 08, 2014 at 10:04:49AM -0700, John Stultz wrote: > On 05/07/2014 10:58 PM, Minchan Kim wrote: > > On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote: > >> Another few weeks and another volatile ranges patchset... > >> > >> After getting the sense that the a major objection to the earlier > >> patches was the introduction of a new syscall (and its somewhat > >> strange dual length/purged-bit return values), I spent some time > >> trying to rework the vma manipulations so we can be we won't fail > >> mid-way through changing volatility (basically making it atomic). > >> I think I have it working, and thus, there is no longer the > >> need for a new syscall, and we can go back to using madvise() > >> to set and unset pages as volatile. > > As I said reply as other patch's reply, I'm ok with this but I'd > > like to make it clear to support zero-filled page as well as SIGBUS. > > If we want to use madvise, maybe we need another advise flag like > > MADV_VOLATILE_SIGBUS. > > I still disagree that zero-fill is more obvious behavior. And again, I > still support MADV_VOLATILE and MADV_FREE both being added, as they > really do have different use cases that I'd rather not try to fit into > one operation. As I replied previous mail, MADV_FREE is one-shot operation so upcoming faulted page couldn't be affected so caller should call the syscall again sometime to make the range volatile again and MADV_FREE is O(N) so vrange with zero-fill could avoid that totally. > > > >> > >> New changes are: > >> > >> o Reworked vma manipulations to be be atomic > >> o Converted back to using madvise() as syscall interface > >> o Integrated fix from Minchan to avoid SIGBUS faulting race > >> o Caught/fixed subtle use-after-free bug w/ vma merging > >> o Lots of minor cleanups and comment improvements > >> > >> > >> Still on the TODO list > >> > >> o Sort out how best to do page accounting when the volatility > >> is tracked on a per-mm basis. > > What's is your concern about page accouting? > > Could you elaborate it more for everybody to understand your concern > > clearly. > > Basically the issue is that since we keep the volatility in the vma, > when we mark a page as volatile, its only marking the page for that > processes, not globally (since the page may be COWed). This makes > keeping track of the number of actual pages that are volatile accurately > somewhat difficult, since we can't just add one for each page marked and > subtract one for each page unmarked (for tmpfs/shm file based > volatility, where volatility is shared globally, this will be much easier ;) > > It might not be too hard to keep a per-process-pages count of > volatility, but in that case we could see some strange effects where it > seems like there are 3x the number of actual volatile pages, and that > might throw off some of the scanning. So its something I've deferred a > bit to think about. Okay. So, why do you want to account volatile page? Originally, what I expected is to age anonymous LRU list until the number of count is zero so aging overhead would be zero if there is no volatile page any more in the system but downside of the approach is it makes vrange marking syscall cost O(N). That's why I suggested couting of volatile *vmas* instead of volatile *pages*. It could make unnecessary aging of anon lru list if there is no physical pages in the vma but I think it's good deal because we moved hot path overhead to slow path and that's one of design goal of vrange syscall. We might make an effort to make such aging not agressive in future, which would be another topic. > > > > >> o Revisit anonymous page aging on swapless systems > > One idea is that we can age forcefully on swapless system if system > > has volatile vma or lazyfree pages. If the number of volatile vma or > > lazyfree pages is zero, we can stop the aging automatically. > > I'll look into this some more. > > > > > >> o Draft up re-adding tmpfs/shm file volatility support > >> > > o One concern from minchan. > > I really like O(1) cost of unmarking syscall. > > > > Vrange syscall is for others, not itself. I mean if some process calls > > vrange syscall, it would scacrifice his resource for others when > > emergency happens so if the syscall is overhead rather expensive, > > anybody doesn't want to use it. > > So yes. I agree the cost is more expensive then I'd like. However, I'd > like to get a consensus on the expected behavior established and get > folks first agreeing to the semantics and the interface. Then we can > follow up with optimizations. Oops, I forgot mentioning "We could do it with optimization in future". I absolute agree with you. I don't want to do that in this stage but just want to record one idea to optimize it so don't get me wrong. It's not a objection. > > > One idea is put increasing counter in mm_struct and assign the token > > to volatile vma.
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On 04/29/2014 02:21 PM, John Stultz wrote: > Another few weeks and another volatile ranges patchset... > > After getting the sense that the a major objection to the earlier > patches was the introduction of a new syscall (and its somewhat > strange dual length/purged-bit return values), I spent some time > trying to rework the vma manipulations so we can be we won't fail > mid-way through changing volatility (basically making it atomic). > I think I have it working, and thus, there is no longer the > need for a new syscall, and we can go back to using madvise() > to set and unset pages as volatile. Johannes: To get some feedback, maybe I'll needle you directly here a bit. :) Does moving this interface to madvise help reduce your objections? I feel like your cleaning-the-dirty-bit idea didn't work out, but I was hoping that by reworking the vma manipulations to be atomic, we could move to madvise and still avoid the new syscall that you seemed bothered by. But I've not really heard much from you recently so I worry your concerns on this were actually elsewhere, and I'm just churning the patch needlessly. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On 05/07/2014 10:58 PM, Minchan Kim wrote: > On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote: >> Another few weeks and another volatile ranges patchset... >> >> After getting the sense that the a major objection to the earlier >> patches was the introduction of a new syscall (and its somewhat >> strange dual length/purged-bit return values), I spent some time >> trying to rework the vma manipulations so we can be we won't fail >> mid-way through changing volatility (basically making it atomic). >> I think I have it working, and thus, there is no longer the >> need for a new syscall, and we can go back to using madvise() >> to set and unset pages as volatile. > As I said reply as other patch's reply, I'm ok with this but I'd > like to make it clear to support zero-filled page as well as SIGBUS. > If we want to use madvise, maybe we need another advise flag like > MADV_VOLATILE_SIGBUS. I still disagree that zero-fill is more obvious behavior. And again, I still support MADV_VOLATILE and MADV_FREE both being added, as they really do have different use cases that I'd rather not try to fit into one operation. >> >> New changes are: >> >> o Reworked vma manipulations to be be atomic >> o Converted back to using madvise() as syscall interface >> o Integrated fix from Minchan to avoid SIGBUS faulting race >> o Caught/fixed subtle use-after-free bug w/ vma merging >> o Lots of minor cleanups and comment improvements >> >> >> Still on the TODO list >> >> o Sort out how best to do page accounting when the volatility >> is tracked on a per-mm basis. > What's is your concern about page accouting? > Could you elaborate it more for everybody to understand your concern > clearly. Basically the issue is that since we keep the volatility in the vma, when we mark a page as volatile, its only marking the page for that processes, not globally (since the page may be COWed). This makes keeping track of the number of actual pages that are volatile accurately somewhat difficult, since we can't just add one for each page marked and subtract one for each page unmarked (for tmpfs/shm file based volatility, where volatility is shared globally, this will be much easier ;) It might not be too hard to keep a per-process-pages count of volatility, but in that case we could see some strange effects where it seems like there are 3x the number of actual volatile pages, and that might throw off some of the scanning. So its something I've deferred a bit to think about. >> o Revisit anonymous page aging on swapless systems > One idea is that we can age forcefully on swapless system if system > has volatile vma or lazyfree pages. If the number of volatile vma or > lazyfree pages is zero, we can stop the aging automatically. I'll look into this some more. > >> o Draft up re-adding tmpfs/shm file volatility support >> > o One concern from minchan. > I really like O(1) cost of unmarking syscall. > > Vrange syscall is for others, not itself. I mean if some process calls > vrange syscall, it would scacrifice his resource for others when > emergency happens so if the syscall is overhead rather expensive, > anybody doesn't want to use it. So yes. I agree the cost is more expensive then I'd like. However, I'd like to get a consensus on the expected behavior established and get folks first agreeing to the semantics and the interface. Then we can follow up with optimizations. > One idea is put increasing counter in mm_struct and assign the token > to volatile vma. Maybe we can squeeze it into vma->vm_start's lower > bits if we don't want to bloat vma size because we always hold mmap_sem > with write-side lock when we handle vrange syscall. > And we can use the token and purged mark together to pte when the purge > happens. With this, we can bail out as soon as we found purged entry in > unmarking syscall so remained ptes still have purged pte although > unmarking syscall is done. But it's no problem because if the vma is > marked as volatile again, the token will be change(ie, increased) and > doesn't match with pte's token. When the page fault occur, we can compare > the token to emit SIGBUS. If it doesn't match, we can ignore and just > map new page to pte. > > One problem is overflow of counter. In the case, we can deliver false > positive to user but it isn't severe, either because use have a preparation > to handle SIGBUS if he want to use vrange syscall with SIGBUS model. This sounds like an interesting optimization. But again, I worry that adding these edge cases (which I honestly really don't see as problematic) muddies the water and keeps reviewers away. I'd rather wait until after we have something settled behavior wise, then start discussing these performance optimizations that may cause safe-but-false-postives. Thanks so much for your review and guidance here (I was worried I had lost everyone's attention again). I really appreciate the
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On 05/07/2014 10:58 PM, Minchan Kim wrote: On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote: Another few weeks and another volatile ranges patchset... After getting the sense that the a major objection to the earlier patches was the introduction of a new syscall (and its somewhat strange dual length/purged-bit return values), I spent some time trying to rework the vma manipulations so we can be we won't fail mid-way through changing volatility (basically making it atomic). I think I have it working, and thus, there is no longer the need for a new syscall, and we can go back to using madvise() to set and unset pages as volatile. As I said reply as other patch's reply, I'm ok with this but I'd like to make it clear to support zero-filled page as well as SIGBUS. If we want to use madvise, maybe we need another advise flag like MADV_VOLATILE_SIGBUS. I still disagree that zero-fill is more obvious behavior. And again, I still support MADV_VOLATILE and MADV_FREE both being added, as they really do have different use cases that I'd rather not try to fit into one operation. New changes are: o Reworked vma manipulations to be be atomic o Converted back to using madvise() as syscall interface o Integrated fix from Minchan to avoid SIGBUS faulting race o Caught/fixed subtle use-after-free bug w/ vma merging o Lots of minor cleanups and comment improvements Still on the TODO list o Sort out how best to do page accounting when the volatility is tracked on a per-mm basis. What's is your concern about page accouting? Could you elaborate it more for everybody to understand your concern clearly. Basically the issue is that since we keep the volatility in the vma, when we mark a page as volatile, its only marking the page for that processes, not globally (since the page may be COWed). This makes keeping track of the number of actual pages that are volatile accurately somewhat difficult, since we can't just add one for each page marked and subtract one for each page unmarked (for tmpfs/shm file based volatility, where volatility is shared globally, this will be much easier ;) It might not be too hard to keep a per-process-pages count of volatility, but in that case we could see some strange effects where it seems like there are 3x the number of actual volatile pages, and that might throw off some of the scanning. So its something I've deferred a bit to think about. o Revisit anonymous page aging on swapless systems One idea is that we can age forcefully on swapless system if system has volatile vma or lazyfree pages. If the number of volatile vma or lazyfree pages is zero, we can stop the aging automatically. I'll look into this some more. o Draft up re-adding tmpfs/shm file volatility support o One concern from minchan. I really like O(1) cost of unmarking syscall. Vrange syscall is for others, not itself. I mean if some process calls vrange syscall, it would scacrifice his resource for others when emergency happens so if the syscall is overhead rather expensive, anybody doesn't want to use it. So yes. I agree the cost is more expensive then I'd like. However, I'd like to get a consensus on the expected behavior established and get folks first agreeing to the semantics and the interface. Then we can follow up with optimizations. One idea is put increasing counter in mm_struct and assign the token to volatile vma. Maybe we can squeeze it into vma-vm_start's lower bits if we don't want to bloat vma size because we always hold mmap_sem with write-side lock when we handle vrange syscall. And we can use the token and purged mark together to pte when the purge happens. With this, we can bail out as soon as we found purged entry in unmarking syscall so remained ptes still have purged pte although unmarking syscall is done. But it's no problem because if the vma is marked as volatile again, the token will be change(ie, increased) and doesn't match with pte's token. When the page fault occur, we can compare the token to emit SIGBUS. If it doesn't match, we can ignore and just map new page to pte. One problem is overflow of counter. In the case, we can deliver false positive to user but it isn't severe, either because use have a preparation to handle SIGBUS if he want to use vrange syscall with SIGBUS model. This sounds like an interesting optimization. But again, I worry that adding these edge cases (which I honestly really don't see as problematic) muddies the water and keeps reviewers away. I'd rather wait until after we have something settled behavior wise, then start discussing these performance optimizations that may cause safe-but-false-postives. Thanks so much for your review and guidance here (I was worried I had lost everyone's attention again). I really appreciate the feedback! thanks -john -- To unsubscribe from this list: send the line unsubscribe
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On 04/29/2014 02:21 PM, John Stultz wrote: Another few weeks and another volatile ranges patchset... After getting the sense that the a major objection to the earlier patches was the introduction of a new syscall (and its somewhat strange dual length/purged-bit return values), I spent some time trying to rework the vma manipulations so we can be we won't fail mid-way through changing volatility (basically making it atomic). I think I have it working, and thus, there is no longer the need for a new syscall, and we can go back to using madvise() to set and unset pages as volatile. Johannes: To get some feedback, maybe I'll needle you directly here a bit. :) Does moving this interface to madvise help reduce your objections? I feel like your cleaning-the-dirty-bit idea didn't work out, but I was hoping that by reworking the vma manipulations to be atomic, we could move to madvise and still avoid the new syscall that you seemed bothered by. But I've not really heard much from you recently so I worry your concerns on this were actually elsewhere, and I'm just churning the patch needlessly. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On Thu, May 08, 2014 at 10:04:49AM -0700, John Stultz wrote: On 05/07/2014 10:58 PM, Minchan Kim wrote: On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote: Another few weeks and another volatile ranges patchset... After getting the sense that the a major objection to the earlier patches was the introduction of a new syscall (and its somewhat strange dual length/purged-bit return values), I spent some time trying to rework the vma manipulations so we can be we won't fail mid-way through changing volatility (basically making it atomic). I think I have it working, and thus, there is no longer the need for a new syscall, and we can go back to using madvise() to set and unset pages as volatile. As I said reply as other patch's reply, I'm ok with this but I'd like to make it clear to support zero-filled page as well as SIGBUS. If we want to use madvise, maybe we need another advise flag like MADV_VOLATILE_SIGBUS. I still disagree that zero-fill is more obvious behavior. And again, I still support MADV_VOLATILE and MADV_FREE both being added, as they really do have different use cases that I'd rather not try to fit into one operation. As I replied previous mail, MADV_FREE is one-shot operation so upcoming faulted page couldn't be affected so caller should call the syscall again sometime to make the range volatile again and MADV_FREE is O(N) so vrange with zero-fill could avoid that totally. New changes are: o Reworked vma manipulations to be be atomic o Converted back to using madvise() as syscall interface o Integrated fix from Minchan to avoid SIGBUS faulting race o Caught/fixed subtle use-after-free bug w/ vma merging o Lots of minor cleanups and comment improvements Still on the TODO list o Sort out how best to do page accounting when the volatility is tracked on a per-mm basis. What's is your concern about page accouting? Could you elaborate it more for everybody to understand your concern clearly. Basically the issue is that since we keep the volatility in the vma, when we mark a page as volatile, its only marking the page for that processes, not globally (since the page may be COWed). This makes keeping track of the number of actual pages that are volatile accurately somewhat difficult, since we can't just add one for each page marked and subtract one for each page unmarked (for tmpfs/shm file based volatility, where volatility is shared globally, this will be much easier ;) It might not be too hard to keep a per-process-pages count of volatility, but in that case we could see some strange effects where it seems like there are 3x the number of actual volatile pages, and that might throw off some of the scanning. So its something I've deferred a bit to think about. Okay. So, why do you want to account volatile page? Originally, what I expected is to age anonymous LRU list until the number of count is zero so aging overhead would be zero if there is no volatile page any more in the system but downside of the approach is it makes vrange marking syscall cost O(N). That's why I suggested couting of volatile *vmas* instead of volatile *pages*. It could make unnecessary aging of anon lru list if there is no physical pages in the vma but I think it's good deal because we moved hot path overhead to slow path and that's one of design goal of vrange syscall. We might make an effort to make such aging not agressive in future, which would be another topic. o Revisit anonymous page aging on swapless systems One idea is that we can age forcefully on swapless system if system has volatile vma or lazyfree pages. If the number of volatile vma or lazyfree pages is zero, we can stop the aging automatically. I'll look into this some more. o Draft up re-adding tmpfs/shm file volatility support o One concern from minchan. I really like O(1) cost of unmarking syscall. Vrange syscall is for others, not itself. I mean if some process calls vrange syscall, it would scacrifice his resource for others when emergency happens so if the syscall is overhead rather expensive, anybody doesn't want to use it. So yes. I agree the cost is more expensive then I'd like. However, I'd like to get a consensus on the expected behavior established and get folks first agreeing to the semantics and the interface. Then we can follow up with optimizations. Oops, I forgot mentioning We could do it with optimization in future. I absolute agree with you. I don't want to do that in this stage but just want to record one idea to optimize it so don't get me wrong. It's not a objection. One idea is put increasing counter in mm_struct and assign the token to volatile vma. Maybe we can squeeze it into vma-vm_start's lower bits if we don't want to bloat vma size because we always hold mmap_sem with write-side lock when we handle
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote: > Another few weeks and another volatile ranges patchset... > > After getting the sense that the a major objection to the earlier > patches was the introduction of a new syscall (and its somewhat > strange dual length/purged-bit return values), I spent some time > trying to rework the vma manipulations so we can be we won't fail > mid-way through changing volatility (basically making it atomic). > I think I have it working, and thus, there is no longer the > need for a new syscall, and we can go back to using madvise() > to set and unset pages as volatile. As I said reply as other patch's reply, I'm ok with this but I'd like to make it clear to support zero-filled page as well as SIGBUS. If we want to use madvise, maybe we need another advise flag like MADV_VOLATILE_SIGBUS. > > > New changes are: > > o Reworked vma manipulations to be be atomic > o Converted back to using madvise() as syscall interface > o Integrated fix from Minchan to avoid SIGBUS faulting race > o Caught/fixed subtle use-after-free bug w/ vma merging > o Lots of minor cleanups and comment improvements > > > Still on the TODO list > > o Sort out how best to do page accounting when the volatility > is tracked on a per-mm basis. What's is your concern about page accouting? Could you elaborate it more for everybody to understand your concern clearly. > o Revisit anonymous page aging on swapless systems One idea is that we can age forcefully on swapless system if system has volatile vma or lazyfree pages. If the number of volatile vma or lazyfree pages is zero, we can stop the aging automatically. > o Draft up re-adding tmpfs/shm file volatility support > o One concern from minchan. I really like O(1) cost of unmarking syscall. Vrange syscall is for others, not itself. I mean if some process calls vrange syscall, it would scacrifice his resource for others when emergency happens so if the syscall is overhead rather expensive, anybody doesn't want to use it. One idea is put increasing counter in mm_struct and assign the token to volatile vma. Maybe we can squeeze it into vma->vm_start's lower bits if we don't want to bloat vma size because we always hold mmap_sem with write-side lock when we handle vrange syscall. And we can use the token and purged mark together to pte when the purge happens. With this, we can bail out as soon as we found purged entry in unmarking syscall so remained ptes still have purged pte although unmarking syscall is done. But it's no problem because if the vma is marked as volatile again, the token will be change(ie, increased) and doesn't match with pte's token. When the page fault occur, we can compare the token to emit SIGBUS. If it doesn't match, we can ignore and just map new page to pte. One problem is overflow of counter. In the case, we can deliver false positive to user but it isn't severe, either because use have a preparation to handle SIGBUS if he want to use vrange syscall with SIGBUS model. > > Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik, > Hugh, and others for the great feedback and discussion at > LSF-MM. > > thanks > -john > > > Volatile ranges provides a method for userland to inform the kernel > that a range of memory is safe to discard (ie: can be regenerated) > but userspace may want to try access it in the future. It can be > thought of as similar to MADV_DONTNEED, but that the actual freeing > of the memory is delayed and only done under memory pressure, and the > user can try to cancel the action and be able to quickly access any > unpurged pages. The idea originated from Android's ashmem, but I've > since learned that other OSes provide similar functionality. > > This functionality allows for a number of interesting uses. One such > example is: Userland caches that have kernel triggered eviction under > memory pressure. This allows for the kernel to "rightsize" userspace > caches for current system-wide workload. Things like image bitmap > caches, or rendered HTML in a hidden browser tab, where the data is > not visible and can be regenerated if needed, are good examples. > > Both Chrome and Firefox already make use of volatile range-like > functionality via the ashmem interface: > https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34 > > https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc > > > The basic usage of volatile ranges is as so: > 1) Userland marks a range of memory that can be regenerated if > necessary as volatile > 2) Before accessing the memory again, userland marks the memory as > nonvolatile, and the kernel will provide notification if any pages in > the range has been purged. > > If userland accesses memory while it is volatile, it will either > get the value stored at that memory if there has
Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote: Another few weeks and another volatile ranges patchset... After getting the sense that the a major objection to the earlier patches was the introduction of a new syscall (and its somewhat strange dual length/purged-bit return values), I spent some time trying to rework the vma manipulations so we can be we won't fail mid-way through changing volatility (basically making it atomic). I think I have it working, and thus, there is no longer the need for a new syscall, and we can go back to using madvise() to set and unset pages as volatile. As I said reply as other patch's reply, I'm ok with this but I'd like to make it clear to support zero-filled page as well as SIGBUS. If we want to use madvise, maybe we need another advise flag like MADV_VOLATILE_SIGBUS. New changes are: o Reworked vma manipulations to be be atomic o Converted back to using madvise() as syscall interface o Integrated fix from Minchan to avoid SIGBUS faulting race o Caught/fixed subtle use-after-free bug w/ vma merging o Lots of minor cleanups and comment improvements Still on the TODO list o Sort out how best to do page accounting when the volatility is tracked on a per-mm basis. What's is your concern about page accouting? Could you elaborate it more for everybody to understand your concern clearly. o Revisit anonymous page aging on swapless systems One idea is that we can age forcefully on swapless system if system has volatile vma or lazyfree pages. If the number of volatile vma or lazyfree pages is zero, we can stop the aging automatically. o Draft up re-adding tmpfs/shm file volatility support o One concern from minchan. I really like O(1) cost of unmarking syscall. Vrange syscall is for others, not itself. I mean if some process calls vrange syscall, it would scacrifice his resource for others when emergency happens so if the syscall is overhead rather expensive, anybody doesn't want to use it. One idea is put increasing counter in mm_struct and assign the token to volatile vma. Maybe we can squeeze it into vma-vm_start's lower bits if we don't want to bloat vma size because we always hold mmap_sem with write-side lock when we handle vrange syscall. And we can use the token and purged mark together to pte when the purge happens. With this, we can bail out as soon as we found purged entry in unmarking syscall so remained ptes still have purged pte although unmarking syscall is done. But it's no problem because if the vma is marked as volatile again, the token will be change(ie, increased) and doesn't match with pte's token. When the page fault occur, we can compare the token to emit SIGBUS. If it doesn't match, we can ignore and just map new page to pte. One problem is overflow of counter. In the case, we can deliver false positive to user but it isn't severe, either because use have a preparation to handle SIGBUS if he want to use vrange syscall with SIGBUS model. Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik, Hugh, and others for the great feedback and discussion at LSF-MM. thanks -john Volatile ranges provides a method for userland to inform the kernel that a range of memory is safe to discard (ie: can be regenerated) but userspace may want to try access it in the future. It can be thought of as similar to MADV_DONTNEED, but that the actual freeing of the memory is delayed and only done under memory pressure, and the user can try to cancel the action and be able to quickly access any unpurged pages. The idea originated from Android's ashmem, but I've since learned that other OSes provide similar functionality. This functionality allows for a number of interesting uses. One such example is: Userland caches that have kernel triggered eviction under memory pressure. This allows for the kernel to rightsize userspace caches for current system-wide workload. Things like image bitmap caches, or rendered HTML in a hidden browser tab, where the data is not visible and can be regenerated if needed, are good examples. Both Chrome and Firefox already make use of volatile range-like functionality via the ashmem interface: https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34 https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc The basic usage of volatile ranges is as so: 1) Userland marks a range of memory that can be regenerated if necessary as volatile 2) Before accessing the memory again, userland marks the memory as nonvolatile, and the kernel will provide notification if any pages in the range has been purged. If userland accesses memory while it is volatile, it will either get the value stored at that memory if there has been no memory pressure or the application will get a SIGBUS if the page
[PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
Another few weeks and another volatile ranges patchset... After getting the sense that the a major objection to the earlier patches was the introduction of a new syscall (and its somewhat strange dual length/purged-bit return values), I spent some time trying to rework the vma manipulations so we can be we won't fail mid-way through changing volatility (basically making it atomic). I think I have it working, and thus, there is no longer the need for a new syscall, and we can go back to using madvise() to set and unset pages as volatile. New changes are: o Reworked vma manipulations to be be atomic o Converted back to using madvise() as syscall interface o Integrated fix from Minchan to avoid SIGBUS faulting race o Caught/fixed subtle use-after-free bug w/ vma merging o Lots of minor cleanups and comment improvements Still on the TODO list o Sort out how best to do page accounting when the volatility is tracked on a per-mm basis. o Revisit anonymous page aging on swapless systems o Draft up re-adding tmpfs/shm file volatility support Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik, Hugh, and others for the great feedback and discussion at LSF-MM. thanks -john Volatile ranges provides a method for userland to inform the kernel that a range of memory is safe to discard (ie: can be regenerated) but userspace may want to try access it in the future. It can be thought of as similar to MADV_DONTNEED, but that the actual freeing of the memory is delayed and only done under memory pressure, and the user can try to cancel the action and be able to quickly access any unpurged pages. The idea originated from Android's ashmem, but I've since learned that other OSes provide similar functionality. This functionality allows for a number of interesting uses. One such example is: Userland caches that have kernel triggered eviction under memory pressure. This allows for the kernel to "rightsize" userspace caches for current system-wide workload. Things like image bitmap caches, or rendered HTML in a hidden browser tab, where the data is not visible and can be regenerated if needed, are good examples. Both Chrome and Firefox already make use of volatile range-like functionality via the ashmem interface: https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34 https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc The basic usage of volatile ranges is as so: 1) Userland marks a range of memory that can be regenerated if necessary as volatile 2) Before accessing the memory again, userland marks the memory as nonvolatile, and the kernel will provide notification if any pages in the range has been purged. If userland accesses memory while it is volatile, it will either get the value stored at that memory if there has been no memory pressure or the application will get a SIGBUS if the page has been purged. Reads or writes to the memory do not affect the volatility state of the pages. You can read more about the history of volatile ranges here (~reverse chronological order): https://lwn.net/Articles/592042/ https://lwn.net/Articles/590991/ http://permalink.gmane.org/gmane.linux.kernel.mm/98848 http://permalink.gmane.org/gmane.linux.kernel.mm/98676 https://lwn.net/Articles/522135/ https://lwn.net/Kernel/Index/#Volatile_ranges Continuing from the last few releases, this revision is reduced in scope when compared to earlier attempts. I've only focused on handled volatility on anonymous memory, and we're storing the volatility in the VMA. This may have performance implications compared with the earlier approach, but it does simplify the approach. I'm open to expanding functionality via flags arguments, but for now I'm wanting to keep focus on what the right default behavior should be and keep the use cases restricted to help get reviewer interest. Additionally, since we don't handle volatility on tmpfs files with this version of the patch, it is not able to be used to implement semantics similar to Android's ashmem. But since shared volatiltiy on files is more complex, my hope is to start small and hopefully grow from there. Again, much of the logic in this patchset is based on Minchan's earlier efforts, so I do want to make sure the credit goes to him for his major contribution! Cc: Andrew Morton Cc: Android Kernel Team Cc: Johannes Weiner Cc: Robert Love Cc: Mel Gorman Cc: Hugh Dickins Cc: Dave Hansen Cc: Rik van Riel Cc: Dmitry Adamushko Cc: Neil Brown Cc: Andrea Arcangeli Cc: Mike Hommey Cc: Taras Glek Cc: Jan Kara Cc: KOSAKI Motohiro Cc: Michel Lespinasse Cc: Minchan Kim Cc: Keith Packard Cc: linux...@kvack.org John Stultz (4): swap: Cleanup how special swap file numbers are defined MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas MADV_VOLATILE: Add purged page
[PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
Another few weeks and another volatile ranges patchset... After getting the sense that the a major objection to the earlier patches was the introduction of a new syscall (and its somewhat strange dual length/purged-bit return values), I spent some time trying to rework the vma manipulations so we can be we won't fail mid-way through changing volatility (basically making it atomic). I think I have it working, and thus, there is no longer the need for a new syscall, and we can go back to using madvise() to set and unset pages as volatile. New changes are: o Reworked vma manipulations to be be atomic o Converted back to using madvise() as syscall interface o Integrated fix from Minchan to avoid SIGBUS faulting race o Caught/fixed subtle use-after-free bug w/ vma merging o Lots of minor cleanups and comment improvements Still on the TODO list o Sort out how best to do page accounting when the volatility is tracked on a per-mm basis. o Revisit anonymous page aging on swapless systems o Draft up re-adding tmpfs/shm file volatility support Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik, Hugh, and others for the great feedback and discussion at LSF-MM. thanks -john Volatile ranges provides a method for userland to inform the kernel that a range of memory is safe to discard (ie: can be regenerated) but userspace may want to try access it in the future. It can be thought of as similar to MADV_DONTNEED, but that the actual freeing of the memory is delayed and only done under memory pressure, and the user can try to cancel the action and be able to quickly access any unpurged pages. The idea originated from Android's ashmem, but I've since learned that other OSes provide similar functionality. This functionality allows for a number of interesting uses. One such example is: Userland caches that have kernel triggered eviction under memory pressure. This allows for the kernel to rightsize userspace caches for current system-wide workload. Things like image bitmap caches, or rendered HTML in a hidden browser tab, where the data is not visible and can be regenerated if needed, are good examples. Both Chrome and Firefox already make use of volatile range-like functionality via the ashmem interface: https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34 https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc The basic usage of volatile ranges is as so: 1) Userland marks a range of memory that can be regenerated if necessary as volatile 2) Before accessing the memory again, userland marks the memory as nonvolatile, and the kernel will provide notification if any pages in the range has been purged. If userland accesses memory while it is volatile, it will either get the value stored at that memory if there has been no memory pressure or the application will get a SIGBUS if the page has been purged. Reads or writes to the memory do not affect the volatility state of the pages. You can read more about the history of volatile ranges here (~reverse chronological order): https://lwn.net/Articles/592042/ https://lwn.net/Articles/590991/ http://permalink.gmane.org/gmane.linux.kernel.mm/98848 http://permalink.gmane.org/gmane.linux.kernel.mm/98676 https://lwn.net/Articles/522135/ https://lwn.net/Kernel/Index/#Volatile_ranges Continuing from the last few releases, this revision is reduced in scope when compared to earlier attempts. I've only focused on handled volatility on anonymous memory, and we're storing the volatility in the VMA. This may have performance implications compared with the earlier approach, but it does simplify the approach. I'm open to expanding functionality via flags arguments, but for now I'm wanting to keep focus on what the right default behavior should be and keep the use cases restricted to help get reviewer interest. Additionally, since we don't handle volatility on tmpfs files with this version of the patch, it is not able to be used to implement semantics similar to Android's ashmem. But since shared volatiltiy on files is more complex, my hope is to start small and hopefully grow from there. Again, much of the logic in this patchset is based on Minchan's earlier efforts, so I do want to make sure the credit goes to him for his major contribution! Cc: Andrew Morton a...@linux-foundation.org Cc: Android Kernel Team kernel-t...@android.com Cc: Johannes Weiner han...@cmpxchg.org Cc: Robert Love rl...@google.com Cc: Mel Gorman m...@csn.ul.ie Cc: Hugh Dickins hu...@google.com Cc: Dave Hansen d...@sr71.net Cc: Rik van Riel r...@redhat.com Cc: Dmitry Adamushko dmitry.adamus...@gmail.com Cc: Neil Brown ne...@suse.de Cc: Andrea Arcangeli aarca...@redhat.com Cc: Mike Hommey m...@glandium.org Cc: Taras Glek tg...@mozilla.com Cc: Jan Kara j...@suse.cz Cc: KOSAKI Motohiro kosaki.motoh...@gmail.com Cc: Michel