Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On May 3, 2020, at 2:39 PM, Joerg Roedel wrote: > > Can I add your Tested-by when I > send them to the mailing list tomorrow? Sure. Feel free to add, Tested-by: Qian Cai ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
Hi Qian, On Sun, May 03, 2020 at 09:04:03AM -0400, Qian Cai wrote: > > On Apr 29, 2020, at 7:20 AM, Joerg Roedel wrote: > > Can you please test this branch: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes > > > > It has the previous fix in it and a couple more to make sure the > > device-table is updated and flushed before increase_address_space() > > updates domain->pt_root. > > I believe this closed the existing races as it had survived for many > days. Great work! Thanks a lot for testing these changes! Can I add your Tested-by when I send them to the mailing list tomorrow? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On Apr 29, 2020, at 7:20 AM, Joerg Roedel wrote: > > On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote: >> No dice. There could be some other races. For example, > > Can you please test this branch: > > > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes > > It has the previous fix in it and a couple more to make sure the > device-table is updated and flushed before increase_address_space() > updates domain->pt_root. I believe this closed the existing races as it had survived for many days. Great work! ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On Apr 29, 2020, at 7:20 AM, Joerg Roedel wrote: > > On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote: >> No dice. There could be some other races. For example, > > Can you please test this branch: > > > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes > > It has the previous fix in it and a couple more to make sure the > device-table is updated and flushed before increase_address_space() > updates domain->pt_root. It looks promising as it survives for a day’s stress testing. I’ll keep it running for a few days to be sure. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote: > No dice. There could be some other races. For example, Can you please test this branch: https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes It has the previous fix in it and a couple more to make sure the device-table is updated and flushed before increase_address_space() updates domain->pt_root. Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
Hi Qian, On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote: > > No dice. There could be some other races. For example, Okay, I think I know what is happening. The increase_address_space() function increases the address space, but does not update the DTE and does not flush the old DTE from the caches. But this needs to happen before domain->pt_root is updated, because otherwise another CPU can come along and map something into the increased address-space which is not yet accessible by the device because the DTE is not updated yet. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On Apr 18, 2020, at 2:34 PM, Joerg Roedel wrote: > > On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote: >> Hard to tell without testing further. I’ll leave that optimization in >> the future, and focus on fixing those races first. > > Yeah right, we should fix the existing races first before introducing > new ones ;) > > Btw, THANKS A LOT for tracking down all these race condition bugs, I am > not even remotely able to trigger them with the hardware I have around. > > I did some hacking and the attached diff shows how I think this race > condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1, > but did no further testing. Can you test it please? No dice. There could be some other races. For example, > @@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain > *domain, > unsigned long address, > unsigned long *page_size) ... > amd_iommu_domain_get_pgtable(domain, ); > > if (address > PM_LEVEL_SIZE(pgtable.mode)) > return NULL; > > level = pgtable.mode - 1; > pte= [PM_LEVEL_INDEX(level, address)]; <— increase_address_space() > *page_size = PTE_LEVEL_PAGE_SIZE(level); > while (level > 0) { *page_size = PTE_LEVEL_PAGE_SIZE(level); Then in iommu_unmap_page(), while (unmapped < page_size) { pte = fetch_pte(dom, bus_addr, _size); … bus_addr = (bus_addr & ~(unmap_size - 1)) + unmap_size; bus_addr would be unsync with dom->mode when it enter fetch_pte() again. Could that be a problem? [ 5159.274829][ T4057] LTP: starting oom02 [ 5160.382787][ C52] perf: interrupt took too long (7443 > 6208), lowering kernel.perf_event_max_sample_rate to 26800 [ 5167.951785][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc flags=0x0010] [ 5167.964540][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc1000 flags=0x0010] [ 5167.977442][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc1900 flags=0x0010] [ 5167.989901][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc1d00 flags=0x0010] [ 5168.002291][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc2000 flags=0x0010] [ 5168.014665][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc2400 flags=0x0010] [ 5168.027132][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc2800 flags=0x0010] [ 5168.039566][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc2c00 flags=0x0010] [ 5168.051956][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc3000 flags=0x0010] [ 5168.064310][ T812] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffc3400 flags=0x0010] [ 5168.076652][ T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffc3800 flags=0x0010] [ 5168.088290][ T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffc3c00 flags=0x0010] [ 5183.695829][C8] smartpqi :23:00.0: controller is offline: status code 0x14803 [ 5183.704390][C8] smartpqi :23:00.0: controller offline [ 5183.756594][ C101] blk_update_request: I/O error, dev sda, sector 22306304 op 0x1:(WRITE) flags 0x800 phys_seg 4 prio class 0 [ 5183.756628][ C34] sd 0:1:0:0: [sda] tag#655 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.756759][ C56] blk_update_request: I/O error, dev sda, sector 58480128 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.756810][ C79] sd 0:1:0:0: [sda] tag#234 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.756816][ C121] sd 0:1:0:0: [sda] tag#104 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.756837][ T53] sd 0:1:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.756882][ C121] sd 0:1:0:0: [sda] tag#104 CDB: opcode=0x2a 2a 00 00 4d d4 00 00 02 00 00 [ 5183.756892][ C79] sd 0:1:0:0: [sda] tag#234 CDB: opcode=0x2a 2a 00 02 03 e4 00 00 02 00 00 [ 5183.756909][ C121] blk_update_request: I/O error, dev sda, sector 5100544 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.756920][ C79] blk_update_request: I/O error, dev sda, sector 33809408 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.756939][ T53] sd 0:1:0:0: [sda] tag#4 CDB: opcode=0x2a 2a 00 02 4b f8 00 00 02 00 00 [ 5183.756967][ T53] blk_update_request: I/O error, dev sda, sector 38533120 op 0x1:(WRITE) flags 0x8004000
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On Apr 18, 2020, at 2:34 PM, Joerg Roedel wrote: > > On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote: >> Hard to tell without testing further. I’ll leave that optimization in >> the future, and focus on fixing those races first. > > Yeah right, we should fix the existing races first before introducing > new ones ;) > > Btw, THANKS A LOT for tracking down all these race condition bugs, I am > not even remotely able to trigger them with the hardware I have around. > > I did some hacking and the attached diff shows how I think this race > condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1, > but did no further testing. Can you test it please? Sure, give it a few days to see if it could survive. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote: > Hard to tell without testing further. I’ll leave that optimization in > the future, and focus on fixing those races first. Yeah right, we should fix the existing races first before introducing new ones ;) Btw, THANKS A LOT for tracking down all these race condition bugs, I am not even remotely able to trigger them with the hardware I have around. I did some hacking and the attached diff shows how I think this race condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1, but did no further testing. Can you test it please? diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 20cce366e951..28229a38af4d 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -151,6 +151,26 @@ static struct protection_domain *to_pdomain(struct iommu_domain *dom) return container_of(dom, struct protection_domain, domain); } +static void amd_iommu_domain_get_pgtable(struct protection_domain *domain, +struct domain_pgtable *pgtable) +{ + u64 pt_root = atomic64_read(>pt_root); + + pgtable->root = (u64 *)(pt_root & PAGE_MASK); + pgtable->mode = pt_root & 7; /* lowest 3 bits encode pgtable mode */ +} + +static u64 amd_iommu_domain_encode_pgtable(u64 *root, int mode) +{ + u64 pt_root; + + /* lowest 3 bits encode pgtable mode */ + pt_root = mode & 7; + pt_root |= (u64)root; + + return pt_root; +} + static struct iommu_dev_data *alloc_dev_data(u16 devid) { struct iommu_dev_data *dev_data; @@ -1397,13 +1417,18 @@ static struct page *free_sub_pt(unsigned long root, int mode, static void free_pagetable(struct protection_domain *domain) { - unsigned long root = (unsigned long)domain->pt_root; + struct domain_pgtable pgtable; struct page *freelist = NULL; + unsigned long root; + + amd_iommu_domain_get_pgtable(domain, ); + atomic64_set(>pt_root, 0); - BUG_ON(domain->mode < PAGE_MODE_NONE || - domain->mode > PAGE_MODE_6_LEVEL); + BUG_ON(pgtable.mode < PAGE_MODE_NONE || + pgtable.mode > PAGE_MODE_6_LEVEL); - freelist = free_sub_pt(root, domain->mode, freelist); + root = (unsigned long)pgtable.root; + freelist = free_sub_pt(root, pgtable.mode, freelist); free_page_list(freelist); } @@ -1417,24 +1442,28 @@ static bool increase_address_space(struct protection_domain *domain, unsigned long address, gfp_t gfp) { + struct domain_pgtable pgtable; unsigned long flags; bool ret = false; - u64 *pte; + u64 *pte, root; spin_lock_irqsave(>lock, flags); - if (address <= PM_LEVEL_SIZE(domain->mode) || - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL)) + amd_iommu_domain_get_pgtable(domain, ); + + if (address <= PM_LEVEL_SIZE(pgtable.mode) || + WARN_ON_ONCE(pgtable.mode == PAGE_MODE_6_LEVEL)) goto out; pte = (void *)get_zeroed_page(gfp); if (!pte) goto out; - *pte = PM_LEVEL_PDE(domain->mode, - iommu_virt_to_phys(domain->pt_root)); - domain->pt_root = pte; - domain->mode+= 1; + *pte = PM_LEVEL_PDE(pgtable.mode, iommu_virt_to_phys(pgtable.root)); + + root = amd_iommu_domain_encode_pgtable(pte, pgtable.mode + 1); + + atomic64_set(>pt_root, root); ret = true; @@ -1451,16 +1480,22 @@ static u64 *alloc_pte(struct protection_domain *domain, gfp_t gfp, bool *updated) { + struct domain_pgtable pgtable; int level, end_lvl; u64 *pte, *page; BUG_ON(!is_power_of_2(page_size)); - while (address > PM_LEVEL_SIZE(domain->mode)) + amd_iommu_domain_get_pgtable(domain, ); + + while (address > PM_LEVEL_SIZE(pgtable.mode)) { *updated = increase_address_space(domain, address, gfp) || *updated; + amd_iommu_domain_get_pgtable(domain, ); + } + - level = domain->mode - 1; - pte = >pt_root[PM_LEVEL_INDEX(level, address)]; + level = pgtable.mode - 1; + pte = [PM_LEVEL_INDEX(level, address)]; address = PAGE_SIZE_ALIGN(address, page_size); end_lvl = PAGE_SIZE_LEVEL(page_size); @@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain, unsigned long address, unsigned long *page_size) { + struct domain_pgtable pgtable; int level; u64 *pte; *page_size = 0; - if (address > PM_LEVEL_SIZE(domain->mode)) + amd_iommu_domain_get_pgtable(domain, ); + + if (address > PM_LEVEL_SIZE(pgtable.mode)) return NULL; - level =
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On Apr 18, 2020, at 8:10 AM, Joerg Roedel wrote: > > Yes, your patch still looks racy. You need to atomically read > domain->pt_root to a stack variable and derive the pt_root pointer and > the mode from that variable instead of domain->pt_root directly. If you > read the domain->pt_root twice there could still be an update between > the two reads. > Probably the lock in increase_address_space() can also be avoided if > pt_root is updated using cmpxchg()? Hard to tell without testing further. I’ll leave that optimization in the future, and focus on fixing those races first. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
On Thu, Apr 16, 2020 at 09:42:41PM -0400, Qian Cai wrote: > So, this is still not enough that would still trigger storage driver offline > under > memory pressure for a bit longer. It looks to me that in fetch_pte() there are > could still racy? Yes, your patch still looks racy. You need to atomically read domain->pt_root to a stack variable and derive the pt_root pointer and the mode from that variable instead of domain->pt_root directly. If you read the domain->pt_root twice there could still be an update between the two reads. Probably the lock in increase_address_space() can also be avoided if pt_root is updated using cmpxchg()? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On Apr 13, 2020, at 9:36 PM, Qian Cai wrote: > > > >> On Apr 8, 2020, at 10:19 AM, Joerg Roedel wrote: >> >> Hi Qian, >> >> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote: >>> After further testing, the change along is insufficient. What I am chasing >>> right >>> now is the swap device will go offline after heavy memory pressure below. >>> The >>> symptom is similar to what we have in the commit, >>> >>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”) >>> >>> Apparently, it is no possible to take the domain->lock in fetch_pte() >>> because it >>> could sleep. >> >> Thanks a lot for finding and tracking down another race in the AMD IOMMU >> page-table code. The domain->lock is a spin-lock and taking it can't >> sleep. But fetch_pte() is a fast-path and must not take any locks. >> >> I think the best fix is to update the pt_root and mode of the domain >> atomically by storing the mode in the lower 12 bits of pt_root. This way >> they are stored together and can be read/write atomically. > > Like this? So, this is still not enough that would still trigger storage driver offline under memory pressure for a bit longer. It looks to me that in fetch_pte() there are could still racy? level = domain->mode - 1; pte= >pt_root[PM_LEVEL_INDEX(level, address)]; <— increase_address_space(); *page_size = PTE_LEVEL_PAGE_SIZE(level); while (level > 0) { *page_size = PTE_LEVEL_PAGE_SIZE(level); Then in iommu_unmap_page(), while (unmapped < page_size) { pte = fetch_pte(dom, bus_addr, _size); … bus_addr = (bus_addr & ~(unmap_size - 1)) + unmap_size; bus_addr would be unsync with dom->mode when it enter fetch_pte() again. Could that be a problem? > > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 20cce366e951..b36c6b07cbfd 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, > int mode, > > static void free_pagetable(struct protection_domain *domain) > { > - unsigned long root = (unsigned long)domain->pt_root; > + int level = iommu_get_mode(domain->pt_root); > + unsigned long root = iommu_get_root(domain->pt_root); > struct page *freelist = NULL; > > - BUG_ON(domain->mode < PAGE_MODE_NONE || > -domain->mode > PAGE_MODE_6_LEVEL); > + BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL); > > - freelist = free_sub_pt(root, domain->mode, freelist); > + freelist = free_sub_pt(root, level, freelist); > > free_page_list(freelist); > } > @@ -1417,24 +1417,27 @@ static bool increase_address_space(struct > protection_domain *domain, > unsigned long address, > gfp_t gfp) > { > + int level; > unsigned long flags; > bool ret = false; > u64 *pte; > > spin_lock_irqsave(>lock, flags); > > - if (address <= PM_LEVEL_SIZE(domain->mode) || > - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL)) > + level = iommu_get_mode(domain->pt_root); > + > + if (address <= PM_LEVEL_SIZE(level) || > + WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL)) > goto out; > > pte = (void *)get_zeroed_page(gfp); > if (!pte) > goto out; > > - *pte = PM_LEVEL_PDE(domain->mode, > - iommu_virt_to_phys(domain->pt_root)); > - domain->pt_root = pte; > - domain->mode+= 1; > + *pte = PM_LEVEL_PDE(level, > + iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root))); > + > + WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1); > > ret = true; > > @@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain > *domain, > bool *updated) > { > int level, end_lvl; > - u64 *pte, *page; > + u64 *pte, *page, *pt_root, *root; > > BUG_ON(!is_power_of_2(page_size)); > > - while (address > PM_LEVEL_SIZE(domain->mode)) > + while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root))) > *updated = increase_address_space(domain, address, gfp) || > *updated; > > - level = domain->mode - 1; > - pte = >pt_root[PM_LEVEL_INDEX(level, address)]; > + pt_root = READ_ONCE(domain->pt_root); > + root= (void *)iommu_get_root(pt_root); > + level = iommu_get_mode(pt_root) - 1; > + pte = [PM_LEVEL_INDEX(level, address)]; > address = PAGE_SIZE_ALIGN(address, page_size); > end_lvl = PAGE_SIZE_LEVEL(page_size); > > @@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain > *domain, > unsigned long address, > unsigned long *page_size) > { > -
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On Apr 8, 2020, at 10:19 AM, Joerg Roedel wrote: > > Hi Qian, > > On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote: >> After further testing, the change along is insufficient. What I am chasing >> right >> now is the swap device will go offline after heavy memory pressure below. The >> symptom is similar to what we have in the commit, >> >> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”) >> >> Apparently, it is no possible to take the domain->lock in fetch_pte() >> because it >> could sleep. > > Thanks a lot for finding and tracking down another race in the AMD IOMMU > page-table code. The domain->lock is a spin-lock and taking it can't > sleep. But fetch_pte() is a fast-path and must not take any locks. > > I think the best fix is to update the pt_root and mode of the domain > atomically by storing the mode in the lower 12 bits of pt_root. This way > they are stored together and can be read/write atomically. Like this? diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 20cce366e951..b36c6b07cbfd 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode, static void free_pagetable(struct protection_domain *domain) { - unsigned long root = (unsigned long)domain->pt_root; + int level = iommu_get_mode(domain->pt_root); + unsigned long root = iommu_get_root(domain->pt_root); struct page *freelist = NULL; - BUG_ON(domain->mode < PAGE_MODE_NONE || - domain->mode > PAGE_MODE_6_LEVEL); + BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL); - freelist = free_sub_pt(root, domain->mode, freelist); + freelist = free_sub_pt(root, level, freelist); free_page_list(freelist); } @@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain, unsigned long address, gfp_t gfp) { + int level; unsigned long flags; bool ret = false; u64 *pte; spin_lock_irqsave(>lock, flags); - if (address <= PM_LEVEL_SIZE(domain->mode) || - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL)) + level = iommu_get_mode(domain->pt_root); + + if (address <= PM_LEVEL_SIZE(level) || + WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL)) goto out; pte = (void *)get_zeroed_page(gfp); if (!pte) goto out; - *pte = PM_LEVEL_PDE(domain->mode, - iommu_virt_to_phys(domain->pt_root)); - domain->pt_root = pte; - domain->mode+= 1; + *pte = PM_LEVEL_PDE(level, + iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root))); + + WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1); ret = true; @@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain, bool *updated) { int level, end_lvl; - u64 *pte, *page; + u64 *pte, *page, *pt_root, *root; BUG_ON(!is_power_of_2(page_size)); - while (address > PM_LEVEL_SIZE(domain->mode)) + while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root))) *updated = increase_address_space(domain, address, gfp) || *updated; - level = domain->mode - 1; - pte = >pt_root[PM_LEVEL_INDEX(level, address)]; + pt_root = READ_ONCE(domain->pt_root); + root= (void *)iommu_get_root(pt_root); + level = iommu_get_mode(pt_root) - 1; + pte = [PM_LEVEL_INDEX(level, address)]; address = PAGE_SIZE_ALIGN(address, page_size); end_lvl = PAGE_SIZE_LEVEL(page_size); @@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain, unsigned long address, unsigned long *page_size) { - int level; u64 *pte; + u64 *pt_root = READ_ONCE(domain->pt_root); + u64 *root= (void *)iommu_get_root(pt_root); + int level= iommu_get_mode(pt_root); *page_size = 0; - if (address > PM_LEVEL_SIZE(domain->mode)) + if (address > PM_LEVEL_SIZE(level)) return NULL; - level = domain->mode - 1; - pte= >pt_root[PM_LEVEL_INDEX(level, address)]; + level--; + pte= [PM_LEVEL_INDEX(level, address)]; *page_size = PTE_LEVEL_PAGE_SIZE(level); while (level > 0) { @@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void) if (protection_domain_init(domain)) goto free_domain; - domain->mode = PAGE_MODE_3_LEVEL; domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL); domain->flags = PD_DMA_OPS_MASK; if (!domain->pt_root) goto free_domain; +
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
Hi Qian, On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote: > After further testing, the change along is insufficient. What I am chasing > right > now is the swap device will go offline after heavy memory pressure below. The > symptom is similar to what we have in the commit, > > 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”) > > Apparently, it is no possible to take the domain->lock in fetch_pte() because > it > could sleep. Thanks a lot for finding and tracking down another race in the AMD IOMMU page-table code. The domain->lock is a spin-lock and taking it can't sleep. But fetch_pte() is a fast-path and must not take any locks. I think the best fix is to update the pt_root and mode of the domain atomically by storing the mode in the lower 12 bits of pt_root. This way they are stored together and can be read/write atomically. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()
> On Apr 6, 2020, at 10:12 PM, Qian Cai wrote: > > fetch_pte() could race with increase_address_space() because it held no > lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could > see a stale domain->pt_root and a new increased domain->mode from > increase_address_space(). As the result, it could trigger invalid > accesses later on. Fix it by using a pair of smp_[w|r]mb in those > places. After further testing, the change along is insufficient. What I am chasing right now is the swap device will go offline after heavy memory pressure below. The symptom is similar to what we have in the commit, 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”) Apparently, it is no possible to take the domain->lock in fetch_pte() because it could sleep. Thoughts? [ 7292.727377][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd8 flags=0x0010] [ 7292.740571][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd81000 flags=0x0010] [ 7292.753268][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd81900 flags=0x0010] [ 7292.766078][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd81d00 flags=0x0010] [ 7292.778447][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd82000 flags=0x0010] [ 7292.790724][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd82400 flags=0x0010] [ 7292.803195][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd82800 flags=0x0010] [ 7292.815664][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd82c00 flags=0x0010] [ 7292.828089][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd83000 flags=0x0010] [ 7292.840468][ T814] smartpqi :23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffd83400 flags=0x0010] [ 7292.852795][ T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffd83800 flags=0x0010] [ 7292.864566][ T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffd83c00 flags=0x0010] [ 7326.583908][C8] smartpqi :23:00.0: controller is offline: status code 0x14803 [ 7326.592386][C8] smartpqi :23:00.0: controller offline [ 7326.663728][ C66] sd 0:1:0:0: [sda] tag#467 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=6s [ 7326.664321][ T2738] smartpqi :23:00.0: resetting scsi 0:1:0:0 [ 7326.673849][ C66] sd 0:1:0:0: [sda] tag#467 CDB: opcode=0x28 28 00 02 ee 2e 60 00 00 08 00 [ 7326.680118][ T2738] smartpqi :23:00.0: reset of scsi 0:1:0:0: FAILED [ 7326.688612][ C66] blk_update_request: I/O error, dev sda, sector 49163872 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 7326.695456][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery [ 7326.706632][ C66] Read-error on swap-device (254:1:45833824) [ 7326.714030][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery [ 7326.723382][T45523] sd 0:1:0:0: rejecting I/O to offline device [ 7326.727352][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery [ 7326.727379][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery [ 7326.727442][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery > > kernel BUG at drivers/iommu/amd_iommu.c:1704! > BUG_ON(unmapped && !is_power_of_2(unmapped)); > Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 > 07/10/2019 > RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0 > Call Trace: > > __iommu_unmap+0x106/0x320 > iommu_unmap_fast+0xe/0x10 > __iommu_dma_unmap+0xdc/0x1a0 > iommu_dma_unmap_sg+0xae/0xd0 > scsi_dma_unmap+0xe7/0x150 > pqi_raid_io_complete+0x37/0x60 [smartpqi] > pqi_irq_handler+0x1fc/0x13f0 [smartpqi] > __handle_irq_event_percpu+0x78/0x4f0 > handle_irq_event_percpu+0x70/0x100 > handle_irq_event+0x5a/0x8b > handle_edge_irq+0x10c/0x370 > do_IRQ+0x9e/0x1e0 > common_interrupt+0xf/0xf > > > Signed-off-by: Qian Cai > --- > drivers/iommu/amd_iommu.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 20cce366e951..22328a23335f 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -1434,6 +1434,11 @@ static bool increase_address_space(struct > protection_domain *domain, > *pte = PM_LEVEL_PDE(domain->mode, > iommu_virt_to_phys(domain->pt_root)); > domain->pt_root = pte; > + /* > + * Make sure fetch_pte() will see the new domain->pt_root before it > + * snapshots domain->mode. > +