Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On Fri, Feb 01, 2019 at 11:46:00AM +, Jean-Philippe Brucker wrote: > On 01/02/2019 03:51, Peter Xu wrote: > > On Thu, Jan 31, 2019 at 12:25:40PM +, Jean-Philippe Brucker wrote: > >> On 31/01/2019 07:59, Peter Xu wrote: > >>> On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote: > Hi Peter, > >>> > >>> Hi, Jean, > >>> > > On 30/01/2019 05:57, Peter Xu wrote: > > AMD IOMMU driver is using the clear_flush_young() to do cache flushing > > but that's actually already covered by invalidate_range(). Remove the > > extra notifier and the chunks. > > Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If > I understood correctly, when doing reclaim the kernel clears the young > bit from the PTE. This requires flushing secondary TLBs (ATCs), so that > new accesses from devices will go through the IOMMU, set the young bit > again and the kernel can accurately track page use. I didn't see > invalidate_range() being called by rmap or vmscan in this case, but > might just be missing it. > > Two MMU notifiers are used for the young bit, clear_flush_young() and > clear_young(). I believe the former should invalidate ATCs so that DMA > accesses participate in PTE aging. Otherwise the kernel can't know that > the device is still using entries marked 'old'. The latter, > clear_young() is exempted from invalidating the secondary TLBs by > mmu_notifier.h and the IOMMU driver doesn't need to implement it. > >>> > >>> Yes I agree most of your analysis, but IMHO the problem is that the > >>> AMD IOMMU is not really participating in the PTE aging after all. > >>> Please have a look at mn_clear_flush_young() below at [1] - it's > >>> always returning zero, no matter whether the page has been accessed by > >>> device or not. A real user of it could be someone like KVM (please > >>> see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the > >>> shadow PTEs and do test-and-clear on that, then return the result to > >>> the core mm. That's why I think currently the AMD driver was only > >>> trying to use that as a way to do an extra flush however IMHO it's > >>> redundant. > >> > >> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the > >> flush, since the level-1 page table is shared with the CPU. But removing > >> the flush gets you in the following situation: > >> > >> (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE > >> young and the entry is cached in the ATC. > >> > >> (2) The CPU does ptep_clear_flush_young_notify(), clears young, > >> notices that the page is being used. > >> > >> (3) Device accesses $addr again. Given that we didn't invalidate the > >> ATC in (2) it accesses the page directly, without going through the IOMMU. > >> > >> (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't > >> have the young bit, which means the page isn't being used and can be > >> reclaimed. > >> > >> That's not accurate since the page is being used by the device. At step > >> (2) we should invalidate the ATC, so that (3) fetches the PTE again and > >> marks it young. > >> > >> I can agree that the clear_flush_young() notifier is too brutal for this > >> purpose, since we send spurious ATC invalidation even when the PTE > >> wasn't young (and ATC inv is expensive). There should be another MMU > >> notifier "flush_young()" that is called by > >> ptep_clear_flush_young_notify() only when the page was actually young. > >> But for the moment it's the best we have to avoid the situation above. > >> > >> I don't know enough about mm to understand exactly how the > >> page_referenced() information is used, but I believe the problem is only > >> about accuracy and not correctness. So applying this patch might not > >> break anything (after all, intel-svm.c never implemented the notifier) > >> but I think I'll keep the notifier in my SVA rework [1]. > > > > I see your point. I'm not an expert of mm either, but I'd say AFAIU > > this happens in CPU TLB too. Please have a look at > > ptep_clear_flush_young(): > > > > int ptep_clear_flush_young(struct vm_area_struct *vma, > >unsigned long address, pte_t *ptep) > > { > > /* > > * On x86 CPUs, clearing the accessed bit without a TLB flush > > * doesn't cause data corruption. [ It could cause incorrect > > * page aging and the (mistaken) reclaim of hot pages, but the > > * chance of that should be relatively low. ] > > * > > * So as a performance optimization don't flush the TLB when > > * clearing the accessed bit, it will eventually be flushed by > > * a context switch or a VM operation anyway. [ In the rare > > * event of it not getting flushed for a long time the delay > > * shouldn't really matter because there's no real memory > > * pressure for swapout to react to. ] > > */ > > retur
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On 01/02/2019 03:51, Peter Xu wrote: > On Thu, Jan 31, 2019 at 12:25:40PM +, Jean-Philippe Brucker wrote: >> On 31/01/2019 07:59, Peter Xu wrote: >>> On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote: Hi Peter, >>> >>> Hi, Jean, >>> On 30/01/2019 05:57, Peter Xu wrote: > AMD IOMMU driver is using the clear_flush_young() to do cache flushing > but that's actually already covered by invalidate_range(). Remove the > extra notifier and the chunks. Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If I understood correctly, when doing reclaim the kernel clears the young bit from the PTE. This requires flushing secondary TLBs (ATCs), so that new accesses from devices will go through the IOMMU, set the young bit again and the kernel can accurately track page use. I didn't see invalidate_range() being called by rmap or vmscan in this case, but might just be missing it. Two MMU notifiers are used for the young bit, clear_flush_young() and clear_young(). I believe the former should invalidate ATCs so that DMA accesses participate in PTE aging. Otherwise the kernel can't know that the device is still using entries marked 'old'. The latter, clear_young() is exempted from invalidating the secondary TLBs by mmu_notifier.h and the IOMMU driver doesn't need to implement it. >>> >>> Yes I agree most of your analysis, but IMHO the problem is that the >>> AMD IOMMU is not really participating in the PTE aging after all. >>> Please have a look at mn_clear_flush_young() below at [1] - it's >>> always returning zero, no matter whether the page has been accessed by >>> device or not. A real user of it could be someone like KVM (please >>> see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the >>> shadow PTEs and do test-and-clear on that, then return the result to >>> the core mm. That's why I think currently the AMD driver was only >>> trying to use that as a way to do an extra flush however IMHO it's >>> redundant. >> >> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the >> flush, since the level-1 page table is shared with the CPU. But removing >> the flush gets you in the following situation: >> >> (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE >> young and the entry is cached in the ATC. >> >> (2) The CPU does ptep_clear_flush_young_notify(), clears young, >> notices that the page is being used. >> >> (3) Device accesses $addr again. Given that we didn't invalidate the >> ATC in (2) it accesses the page directly, without going through the IOMMU. >> >> (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't >> have the young bit, which means the page isn't being used and can be >> reclaimed. >> >> That's not accurate since the page is being used by the device. At step >> (2) we should invalidate the ATC, so that (3) fetches the PTE again and >> marks it young. >> >> I can agree that the clear_flush_young() notifier is too brutal for this >> purpose, since we send spurious ATC invalidation even when the PTE >> wasn't young (and ATC inv is expensive). There should be another MMU >> notifier "flush_young()" that is called by >> ptep_clear_flush_young_notify() only when the page was actually young. >> But for the moment it's the best we have to avoid the situation above. >> >> I don't know enough about mm to understand exactly how the >> page_referenced() information is used, but I believe the problem is only >> about accuracy and not correctness. So applying this patch might not >> break anything (after all, intel-svm.c never implemented the notifier) >> but I think I'll keep the notifier in my SVA rework [1]. > > I see your point. I'm not an expert of mm either, but I'd say AFAIU > this happens in CPU TLB too. Please have a look at > ptep_clear_flush_young(): > > int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > /* >* On x86 CPUs, clearing the accessed bit without a TLB flush >* doesn't cause data corruption. [ It could cause incorrect >* page aging and the (mistaken) reclaim of hot pages, but the >* chance of that should be relatively low. ] >* >* So as a performance optimization don't flush the TLB when >* clearing the accessed bit, it will eventually be flushed by >* a context switch or a VM operation anyway. [ In the rare >* event of it not getting flushed for a long time the delay >* shouldn't really matter because there's no real memory >* pressure for swapout to react to. ] >*/ > return ptep_test_and_clear_young(vma, address, ptep); > } Aha I see. The arm64 version of ptep_clear_flush_young() does invalidate the TLB if the PTE was young (perhaps because we don't invalidate the TLB on context switch). For SVA I
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On Thu, Jan 31, 2019 at 12:25:40PM +, Jean-Philippe Brucker wrote: > On 31/01/2019 07:59, Peter Xu wrote: > > On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote: > >> Hi Peter, > > > > Hi, Jean, > > > >> > >> On 30/01/2019 05:57, Peter Xu wrote: > >>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing > >>> but that's actually already covered by invalidate_range(). Remove the > >>> extra notifier and the chunks. > >> > >> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If > >> I understood correctly, when doing reclaim the kernel clears the young > >> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that > >> new accesses from devices will go through the IOMMU, set the young bit > >> again and the kernel can accurately track page use. I didn't see > >> invalidate_range() being called by rmap or vmscan in this case, but > >> might just be missing it. > >> > >> Two MMU notifiers are used for the young bit, clear_flush_young() and > >> clear_young(). I believe the former should invalidate ATCs so that DMA > >> accesses participate in PTE aging. Otherwise the kernel can't know that > >> the device is still using entries marked 'old'. The latter, > >> clear_young() is exempted from invalidating the secondary TLBs by > >> mmu_notifier.h and the IOMMU driver doesn't need to implement it. > > > > Yes I agree most of your analysis, but IMHO the problem is that the > > AMD IOMMU is not really participating in the PTE aging after all. > > Please have a look at mn_clear_flush_young() below at [1] - it's > > always returning zero, no matter whether the page has been accessed by > > device or not. A real user of it could be someone like KVM (please > > see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the > > shadow PTEs and do test-and-clear on that, then return the result to > > the core mm. That's why I think currently the AMD driver was only > > trying to use that as a way to do an extra flush however IMHO it's > > redundant. > > Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the > flush, since the level-1 page table is shared with the CPU. But removing > the flush gets you in the following situation: > > (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE > young and the entry is cached in the ATC. > > (2) The CPU does ptep_clear_flush_young_notify(), clears young, > notices that the page is being used. > > (3) Device accesses $addr again. Given that we didn't invalidate the > ATC in (2) it accesses the page directly, without going through the IOMMU. > > (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't > have the young bit, which means the page isn't being used and can be > reclaimed. > > That's not accurate since the page is being used by the device. At step > (2) we should invalidate the ATC, so that (3) fetches the PTE again and > marks it young. > > I can agree that the clear_flush_young() notifier is too brutal for this > purpose, since we send spurious ATC invalidation even when the PTE > wasn't young (and ATC inv is expensive). There should be another MMU > notifier "flush_young()" that is called by > ptep_clear_flush_young_notify() only when the page was actually young. > But for the moment it's the best we have to avoid the situation above. > > I don't know enough about mm to understand exactly how the > page_referenced() information is used, but I believe the problem is only > about accuracy and not correctness. So applying this patch might not > break anything (after all, intel-svm.c never implemented the notifier) > but I think I'll keep the notifier in my SVA rework [1]. I see your point. I'm not an expert of mm either, but I'd say AFAIU this happens in CPU TLB too. Please have a look at ptep_clear_flush_young(): int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { /* * On x86 CPUs, clearing the accessed bit without a TLB flush * doesn't cause data corruption. [ It could cause incorrect * page aging and the (mistaken) reclaim of hot pages, but the * chance of that should be relatively low. ] * * So as a performance optimization don't flush the TLB when * clearing the accessed bit, it will eventually be flushed by * a context switch or a VM operation anyway. [ In the rare * event of it not getting flushed for a long time the delay * shouldn't really matter because there's no real memory * pressure for swapout to react to. ] */ return ptep_test_and_clear_young(vma, address, ptep); } So maybe it is a tradeoff between performance and accuracy. IMHO the IOMMU cache flushing might affect the performance even more than CPU TLB flushing if the invalidation command takes a long time to run (e.g., amd_iommu_flush_page is far slower than a TL
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On 31/01/2019 07:59, Peter Xu wrote: > On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote: >> Hi Peter, > > Hi, Jean, > >> >> On 30/01/2019 05:57, Peter Xu wrote: >>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing >>> but that's actually already covered by invalidate_range(). Remove the >>> extra notifier and the chunks. >> >> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If >> I understood correctly, when doing reclaim the kernel clears the young >> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that >> new accesses from devices will go through the IOMMU, set the young bit >> again and the kernel can accurately track page use. I didn't see >> invalidate_range() being called by rmap or vmscan in this case, but >> might just be missing it. >> >> Two MMU notifiers are used for the young bit, clear_flush_young() and >> clear_young(). I believe the former should invalidate ATCs so that DMA >> accesses participate in PTE aging. Otherwise the kernel can't know that >> the device is still using entries marked 'old'. The latter, >> clear_young() is exempted from invalidating the secondary TLBs by >> mmu_notifier.h and the IOMMU driver doesn't need to implement it. > > Yes I agree most of your analysis, but IMHO the problem is that the > AMD IOMMU is not really participating in the PTE aging after all. > Please have a look at mn_clear_flush_young() below at [1] - it's > always returning zero, no matter whether the page has been accessed by > device or not. A real user of it could be someone like KVM (please > see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the > shadow PTEs and do test-and-clear on that, then return the result to > the core mm. That's why I think currently the AMD driver was only > trying to use that as a way to do an extra flush however IMHO it's > redundant. Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the flush, since the level-1 page table is shared with the CPU. But removing the flush gets you in the following situation: (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE young and the entry is cached in the ATC. (2) The CPU does ptep_clear_flush_young_notify(), clears young, notices that the page is being used. (3) Device accesses $addr again. Given that we didn't invalidate the ATC in (2) it accesses the page directly, without going through the IOMMU. (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't have the young bit, which means the page isn't being used and can be reclaimed. That's not accurate since the page is being used by the device. At step (2) we should invalidate the ATC, so that (3) fetches the PTE again and marks it young. I can agree that the clear_flush_young() notifier is too brutal for this purpose, since we send spurious ATC invalidation even when the PTE wasn't young (and ATC inv is expensive). There should be another MMU notifier "flush_young()" that is called by ptep_clear_flush_young_notify() only when the page was actually young. But for the moment it's the best we have to avoid the situation above. I don't know enough about mm to understand exactly how the page_referenced() information is used, but I believe the problem is only about accuracy and not correctness. So applying this patch might not break anything (after all, intel-svm.c never implemented the notifier) but I think I'll keep the notifier in my SVA rework [1]. Thanks, Jean [1] https://www.spinics.net/lists/iommu/msg30081.html ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote: > Hi Peter, Hi, Jean, > > On 30/01/2019 05:57, Peter Xu wrote: > > AMD IOMMU driver is using the clear_flush_young() to do cache flushing > > but that's actually already covered by invalidate_range(). Remove the > > extra notifier and the chunks. > > Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If > I understood correctly, when doing reclaim the kernel clears the young > bit from the PTE. This requires flushing secondary TLBs (ATCs), so that > new accesses from devices will go through the IOMMU, set the young bit > again and the kernel can accurately track page use. I didn't see > invalidate_range() being called by rmap or vmscan in this case, but > might just be missing it. > > Two MMU notifiers are used for the young bit, clear_flush_young() and > clear_young(). I believe the former should invalidate ATCs so that DMA > accesses participate in PTE aging. Otherwise the kernel can't know that > the device is still using entries marked 'old'. The latter, > clear_young() is exempted from invalidating the secondary TLBs by > mmu_notifier.h and the IOMMU driver doesn't need to implement it. Yes I agree most of your analysis, but IMHO the problem is that the AMD IOMMU is not really participating in the PTE aging after all. Please have a look at mn_clear_flush_young() below at [1] - it's always returning zero, no matter whether the page has been accessed by device or not. A real user of it could be someone like KVM (please see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the shadow PTEs and do test-and-clear on that, then return the result to the core mm. That's why I think currently the AMD driver was only trying to use that as a way to do an extra flush however IMHO it's redundant. Thanks, > > -static int mn_clear_flush_young(struct mmu_notifier *mn, > > - struct mm_struct *mm, > > - unsigned long start, > > - unsigned long end) > > -{ > > - for (; start < end; start += PAGE_SIZE) > > - __mn_flush_page(mn, start); > > - > > - return 0; [1] -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On Wed, Jan 30, 2019 at 01:57:58PM +0800, Peter Xu wrote: > AMD IOMMU driver is using the clear_flush_young() to do cache flushing > but that's actually already covered by invalidate_range(). Remove the > extra notifier and the chunks. > > Signed-off-by: Peter Xu Applied to x86/amd, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
Hi Peter, On 30/01/2019 05:57, Peter Xu wrote: > AMD IOMMU driver is using the clear_flush_young() to do cache flushing > but that's actually already covered by invalidate_range(). Remove the > extra notifier and the chunks. Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If I understood correctly, when doing reclaim the kernel clears the young bit from the PTE. This requires flushing secondary TLBs (ATCs), so that new accesses from devices will go through the IOMMU, set the young bit again and the kernel can accurately track page use. I didn't see invalidate_range() being called by rmap or vmscan in this case, but might just be missing it. Two MMU notifiers are used for the young bit, clear_flush_young() and clear_young(). I believe the former should invalidate ATCs so that DMA accesses participate in PTE aging. Otherwise the kernel can't know that the device is still using entries marked 'old'. The latter, clear_young() is exempted from invalidating the secondary TLBs by mmu_notifier.h and the IOMMU driver doesn't need to implement it. Thanks, Jean > > Signed-off-by: Peter Xu > --- > drivers/iommu/amd_iommu_v2.c | 24 > 1 file changed, 24 deletions(-) > > diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c > index 23dae9348ace..5d7ef750e4a0 100644 > --- a/drivers/iommu/amd_iommu_v2.c > +++ b/drivers/iommu/amd_iommu_v2.c > @@ -370,29 +370,6 @@ static struct pasid_state *mn_to_state(struct > mmu_notifier *mn) > return container_of(mn, struct pasid_state, mn); > } > > -static void __mn_flush_page(struct mmu_notifier *mn, > - unsigned long address) > -{ > - struct pasid_state *pasid_state; > - struct device_state *dev_state; > - > - pasid_state = mn_to_state(mn); > - dev_state = pasid_state->device_state; > - > - amd_iommu_flush_page(dev_state->domain, pasid_state->pasid, address); > -} > - > -static int mn_clear_flush_young(struct mmu_notifier *mn, > - struct mm_struct *mm, > - unsigned long start, > - unsigned long end) > -{ > - for (; start < end; start += PAGE_SIZE) > - __mn_flush_page(mn, start); > - > - return 0; > -} > - > static void mn_invalidate_range(struct mmu_notifier *mn, > struct mm_struct *mm, > unsigned long start, unsigned long end) > @@ -430,7 +407,6 @@ static void mn_release(struct mmu_notifier *mn, struct > mm_struct *mm) > > static const struct mmu_notifier_ops iommu_mn = { > .release= mn_release, > - .clear_flush_young = mn_clear_flush_young, > .invalidate_range = mn_invalidate_range, > }; > > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
AMD IOMMU driver is using the clear_flush_young() to do cache flushing but that's actually already covered by invalidate_range(). Remove the extra notifier and the chunks. Signed-off-by: Peter Xu --- drivers/iommu/amd_iommu_v2.c | 24 1 file changed, 24 deletions(-) diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c index 23dae9348ace..5d7ef750e4a0 100644 --- a/drivers/iommu/amd_iommu_v2.c +++ b/drivers/iommu/amd_iommu_v2.c @@ -370,29 +370,6 @@ static struct pasid_state *mn_to_state(struct mmu_notifier *mn) return container_of(mn, struct pasid_state, mn); } -static void __mn_flush_page(struct mmu_notifier *mn, - unsigned long address) -{ - struct pasid_state *pasid_state; - struct device_state *dev_state; - - pasid_state = mn_to_state(mn); - dev_state = pasid_state->device_state; - - amd_iommu_flush_page(dev_state->domain, pasid_state->pasid, address); -} - -static int mn_clear_flush_young(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) -{ - for (; start < end; start += PAGE_SIZE) - __mn_flush_page(mn, start); - - return 0; -} - static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end) @@ -430,7 +407,6 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm) static const struct mmu_notifier_ops iommu_mn = { .release= mn_release, - .clear_flush_young = mn_clear_flush_young, .invalidate_range = mn_invalidate_range, }; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu