Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-05-03 Thread Qian Cai



> On May 3, 2020, at 2:39 PM, Joerg Roedel  wrote:
> 
> Can I add your Tested-by when I
> send them to the mailing list tomorrow?

Sure. Feel free to add,

Tested-by: Qian Cai 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-05-03 Thread Joerg Roedel
Hi Qian,

On Sun, May 03, 2020 at 09:04:03AM -0400, Qian Cai wrote:
> > On Apr 29, 2020, at 7:20 AM, Joerg Roedel  wrote:
> > Can you please test this branch:
> > 
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
> > 
> > It has the previous fix in it and a couple more to make sure the
> > device-table is updated and flushed before increase_address_space()
> > updates domain->pt_root.
> 
> I believe this closed the existing races as it had survived for many
> days. Great work!

Thanks a lot for testing these changes! Can I add your Tested-by when I
send them to the mailing list tomorrow?

Regards,

Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-05-03 Thread Qian Cai



> On Apr 29, 2020, at 7:20 AM, Joerg Roedel  wrote:
> 
> On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>> No dice. There could be some other races. For example,
> 
> Can you please test this branch:
> 
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
> 
> It has the previous fix in it and a couple more to make sure the
> device-table is updated and flushed before increase_address_space()
> updates domain->pt_root.

I believe this closed the existing races as it had survived for many days. 
Great work!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-29 Thread Qian Cai


> On Apr 29, 2020, at 7:20 AM, Joerg Roedel  wrote:
> 
> On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>> No dice. There could be some other races. For example,
> 
> Can you please test this branch:
> 
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
> 
> It has the previous fix in it and a couple more to make sure the
> device-table is updated and flushed before increase_address_space()
> updates domain->pt_root.

It looks promising as it survives for a day’s stress testing. I’ll keep it 
running for a few days to be sure.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-29 Thread Joerg Roedel
On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
> No dice. There could be some other races. For example,

Can you please test this branch:


https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes

It has the previous fix in it and a couple more to make sure the
device-table is updated and flushed before increase_address_space()
updates domain->pt_root.

Thanks,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-29 Thread Joerg Roedel
Hi Qian,

On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
> 
> No dice. There could be some other races. For example,

Okay, I think I know what is happening. The increase_address_space()
function increases the address space, but does not update the
DTE and does not flush the old DTE from the caches. But this needs to
happen before domain->pt_root is updated, because otherwise another CPU
can come along and map something into the increased address-space which
is not yet accessible by the device because the DTE is not updated yet.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-20 Thread Qian Cai


> On Apr 18, 2020, at 2:34 PM, Joerg Roedel  wrote:
> 
> On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
>> Hard to tell without testing further. I’ll leave that optimization in
>> the future, and focus on fixing those races first.
> 
> Yeah right, we should fix the existing races first before introducing
> new ones ;)
> 
> Btw, THANKS A LOT for tracking down all these race condition bugs, I am
> not even remotely able to trigger them with the hardware I have around.
> 
> I did some hacking and the attached diff shows how I think this race
> condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
> but did no further testing. Can you test it please?

No dice. There could be some other races. For example,

> @@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain 
> *domain,
> unsigned long address,
> unsigned long *page_size)
...
>   amd_iommu_domain_get_pgtable(domain, );
> 
>   if (address > PM_LEVEL_SIZE(pgtable.mode))
>   return NULL;
> 
>   level  =  pgtable.mode - 1;
>   pte= [PM_LEVEL_INDEX(level, address)];

<— increase_address_space()

>   *page_size =  PTE_LEVEL_PAGE_SIZE(level);
> 

while (level > 0) {
*page_size = PTE_LEVEL_PAGE_SIZE(level);

Then in iommu_unmap_page(),

while (unmapped < page_size) {
pte = fetch_pte(dom, bus_addr, _size);
…
bus_addr  = (bus_addr & ~(unmap_size - 1)) + unmap_size;

bus_addr would be unsync with dom->mode when it enter fetch_pte() again.
Could that be a problem?


[ 5159.274829][ T4057] LTP: starting oom02
[ 5160.382787][   C52] perf: interrupt took too long (7443 > 6208), lowering 
kernel.perf_event_max_sample_rate to 26800
[ 5167.951785][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc flags=0x0010]
[ 5167.964540][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc1000 flags=0x0010]
[ 5167.977442][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc1900 flags=0x0010]
[ 5167.989901][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc1d00 flags=0x0010]
[ 5168.002291][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc2000 flags=0x0010]
[ 5168.014665][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc2400 flags=0x0010]
[ 5168.027132][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc2800 flags=0x0010]
[ 5168.039566][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc2c00 flags=0x0010]
[ 5168.051956][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc3000 flags=0x0010]
[ 5168.064310][  T812] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xfffc3400 flags=0x0010]
[ 5168.076652][  T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 
domain=0x0027 address=0xfffc3800 flags=0x0010]
[ 5168.088290][  T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 
domain=0x0027 address=0xfffc3c00 flags=0x0010]
[ 5183.695829][C8] smartpqi :23:00.0: controller is offline: status 
code 0x14803
[ 5183.704390][C8] smartpqi :23:00.0: controller offline
[ 5183.756594][  C101] blk_update_request: I/O error, dev sda, sector 22306304 
op 0x1:(WRITE) flags 0x800 phys_seg 4 prio class 0
[ 5183.756628][   C34] sd 0:1:0:0: [sda] tag#655 UNKNOWN(0x2003) Result: 
hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756759][   C56] blk_update_request: I/O error, dev sda, sector 58480128 
op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756810][   C79] sd 0:1:0:0: [sda] tag#234 UNKNOWN(0x2003) Result: 
hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756816][  C121] sd 0:1:0:0: [sda] tag#104 UNKNOWN(0x2003) Result: 
hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756837][   T53] sd 0:1:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: 
hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756882][  C121] sd 0:1:0:0: [sda] tag#104 CDB: opcode=0x2a 2a 00 00 4d 
d4 00 00 02 00 00
[ 5183.756892][   C79] sd 0:1:0:0: [sda] tag#234 CDB: opcode=0x2a 2a 00 02 03 
e4 00 00 02 00 00
[ 5183.756909][  C121] blk_update_request: I/O error, dev sda, sector 5100544 
op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756920][   C79] blk_update_request: I/O error, dev sda, sector 33809408 
op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756939][   T53] sd 0:1:0:0: [sda] tag#4 CDB: opcode=0x2a 2a 00 02 4b f8 
00 00 02 00 00
[ 5183.756967][   T53] blk_update_request: I/O error, dev sda, sector 38533120 
op 0x1:(WRITE) flags 0x8004000 

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-19 Thread Qian Cai


> On Apr 18, 2020, at 2:34 PM, Joerg Roedel  wrote:
> 
> On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
>> Hard to tell without testing further. I’ll leave that optimization in
>> the future, and focus on fixing those races first.
> 
> Yeah right, we should fix the existing races first before introducing
> new ones ;)
> 
> Btw, THANKS A LOT for tracking down all these race condition bugs, I am
> not even remotely able to trigger them with the hardware I have around.
> 
> I did some hacking and the attached diff shows how I think this race
> condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
> but did no further testing. Can you test it please?

Sure, give it a few days to see if it could survive.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-18 Thread Joerg Roedel
On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
> Hard to tell without testing further. I’ll leave that optimization in
> the future, and focus on fixing those races first.

Yeah right, we should fix the existing races first before introducing
new ones ;)

Btw, THANKS A LOT for tracking down all these race condition bugs, I am
not even remotely able to trigger them with the hardware I have around.

I did some hacking and the attached diff shows how I think this race
condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
but did no further testing. Can you test it please?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..28229a38af4d 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -151,6 +151,26 @@ static struct protection_domain *to_pdomain(struct 
iommu_domain *dom)
return container_of(dom, struct protection_domain, domain);
 }
 
+static void amd_iommu_domain_get_pgtable(struct protection_domain *domain,
+struct domain_pgtable *pgtable)
+{
+   u64 pt_root = atomic64_read(>pt_root);
+
+   pgtable->root = (u64 *)(pt_root & PAGE_MASK);
+   pgtable->mode = pt_root & 7; /* lowest 3 bits encode pgtable mode */
+}
+
+static u64 amd_iommu_domain_encode_pgtable(u64 *root, int mode)
+{
+   u64 pt_root;
+
+   /* lowest 3 bits encode pgtable mode */
+   pt_root = mode & 7;
+   pt_root |= (u64)root;
+
+   return pt_root;
+}
+
 static struct iommu_dev_data *alloc_dev_data(u16 devid)
 {
struct iommu_dev_data *dev_data;
@@ -1397,13 +1417,18 @@ static struct page *free_sub_pt(unsigned long root, int 
mode,
 
 static void free_pagetable(struct protection_domain *domain)
 {
-   unsigned long root = (unsigned long)domain->pt_root;
+   struct domain_pgtable pgtable;
struct page *freelist = NULL;
+   unsigned long root;
+
+   amd_iommu_domain_get_pgtable(domain, );
+   atomic64_set(>pt_root, 0);
 
-   BUG_ON(domain->mode < PAGE_MODE_NONE ||
-  domain->mode > PAGE_MODE_6_LEVEL);
+   BUG_ON(pgtable.mode < PAGE_MODE_NONE ||
+  pgtable.mode > PAGE_MODE_6_LEVEL);
 
-   freelist = free_sub_pt(root, domain->mode, freelist);
+   root = (unsigned long)pgtable.root;
+   freelist = free_sub_pt(root, pgtable.mode, freelist);
 
free_page_list(freelist);
 }
@@ -1417,24 +1442,28 @@ static bool increase_address_space(struct 
protection_domain *domain,
   unsigned long address,
   gfp_t gfp)
 {
+   struct domain_pgtable pgtable;
unsigned long flags;
bool ret = false;
-   u64 *pte;
+   u64 *pte, root;
 
spin_lock_irqsave(>lock, flags);
 
-   if (address <= PM_LEVEL_SIZE(domain->mode) ||
-   WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+   amd_iommu_domain_get_pgtable(domain, );
+
+   if (address <= PM_LEVEL_SIZE(pgtable.mode) ||
+   WARN_ON_ONCE(pgtable.mode == PAGE_MODE_6_LEVEL))
goto out;
 
pte = (void *)get_zeroed_page(gfp);
if (!pte)
goto out;
 
-   *pte = PM_LEVEL_PDE(domain->mode,
-   iommu_virt_to_phys(domain->pt_root));
-   domain->pt_root  = pte;
-   domain->mode+= 1;
+   *pte = PM_LEVEL_PDE(pgtable.mode, iommu_virt_to_phys(pgtable.root));
+
+   root = amd_iommu_domain_encode_pgtable(pte, pgtable.mode + 1);
+
+   atomic64_set(>pt_root, root);
 
ret = true;
 
@@ -1451,16 +1480,22 @@ static u64 *alloc_pte(struct protection_domain *domain,
  gfp_t gfp,
  bool *updated)
 {
+   struct domain_pgtable pgtable;
int level, end_lvl;
u64 *pte, *page;
 
BUG_ON(!is_power_of_2(page_size));
 
-   while (address > PM_LEVEL_SIZE(domain->mode))
+   amd_iommu_domain_get_pgtable(domain, );
+
+   while (address > PM_LEVEL_SIZE(pgtable.mode)) {
*updated = increase_address_space(domain, address, gfp) || 
*updated;
+   amd_iommu_domain_get_pgtable(domain, );
+   }
+
 
-   level   = domain->mode - 1;
-   pte = >pt_root[PM_LEVEL_INDEX(level, address)];
+   level   = pgtable.mode - 1;
+   pte = [PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);
 
@@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain,
  unsigned long address,
  unsigned long *page_size)
 {
+   struct domain_pgtable pgtable;
int level;
u64 *pte;
 
*page_size = 0;
 
-   if (address > PM_LEVEL_SIZE(domain->mode))
+   amd_iommu_domain_get_pgtable(domain, );
+
+   if (address > PM_LEVEL_SIZE(pgtable.mode))
return NULL;
 
-   level  =  

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-18 Thread Qian Cai


> On Apr 18, 2020, at 8:10 AM, Joerg Roedel  wrote:
> 
> Yes, your patch still looks racy. You need to atomically read
> domain->pt_root to a stack variable and derive the pt_root pointer and
> the mode from that variable instead of domain->pt_root directly. If you
> read the domain->pt_root twice there could still be an update between
> the two reads.
> Probably the lock in increase_address_space() can also be avoided if
> pt_root is updated using cmpxchg()?

Hard to tell without testing further. I’ll leave that optimization in the 
future, and focus on fixing those races first.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-18 Thread Joerg Roedel
On Thu, Apr 16, 2020 at 09:42:41PM -0400, Qian Cai wrote:
> So, this is still not enough that would still trigger storage driver offline 
> under
> memory pressure for a bit longer. It looks to me that in fetch_pte() there are
> could still racy?

Yes, your patch still looks racy. You need to atomically read
domain->pt_root to a stack variable and derive the pt_root pointer and
the mode from that variable instead of domain->pt_root directly. If you
read the domain->pt_root twice there could still be an update between
the two reads.
Probably the lock in increase_address_space() can also be avoided if
pt_root is updated using cmpxchg()?

Regards,

Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-16 Thread Qian Cai


> On Apr 13, 2020, at 9:36 PM, Qian Cai  wrote:
> 
> 
> 
>> On Apr 8, 2020, at 10:19 AM, Joerg Roedel  wrote:
>> 
>> Hi Qian,
>> 
>> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>>> After further testing, the change along is insufficient. What I am chasing 
>>> right
>>> now is the swap device will go offline after heavy memory pressure below. 
>>> The
>>> symptom is similar to what we have in the commit,
>>> 
>>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>>> 
>>> Apparently, it is no possible to take the domain->lock in fetch_pte() 
>>> because it
>>> could sleep.
>> 
>> Thanks a lot for finding and tracking down another race in the AMD IOMMU
>> page-table code.  The domain->lock is a spin-lock and taking it can't
>> sleep. But fetch_pte() is a fast-path and must not take any locks.
>> 
>> I think the best fix is to update the pt_root and mode of the domain
>> atomically by storing the mode in the lower 12 bits of pt_root. This way
>> they are stored together and can be read/write atomically.
> 
> Like this?

So, this is still not enough that would still trigger storage driver offline 
under
memory pressure for a bit longer. It looks to me that in fetch_pte() there are
could still racy?

level  =  domain->mode - 1;
pte= >pt_root[PM_LEVEL_INDEX(level, address)];
<— 
increase_address_space();
*page_size =  PTE_LEVEL_PAGE_SIZE(level);

while (level > 0) {
*page_size = PTE_LEVEL_PAGE_SIZE(level);

Then in iommu_unmap_page(),

while (unmapped < page_size) {
pte = fetch_pte(dom, bus_addr, _size);
…
bus_addr  = (bus_addr & ~(unmap_size - 1)) + unmap_size;

bus_addr would be unsync with dom->mode when it enter fetch_pte() again.
Could that be a problem?

> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 20cce366e951..b36c6b07cbfd 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, 
> int mode,
> 
> static void free_pagetable(struct protection_domain *domain)
> {
> - unsigned long root = (unsigned long)domain->pt_root;
> + int level = iommu_get_mode(domain->pt_root);
> + unsigned long root = iommu_get_root(domain->pt_root);
>   struct page *freelist = NULL;
> 
> - BUG_ON(domain->mode < PAGE_MODE_NONE ||
> -domain->mode > PAGE_MODE_6_LEVEL);
> + BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);
> 
> - freelist = free_sub_pt(root, domain->mode, freelist);
> + freelist = free_sub_pt(root, level, freelist);
> 
>   free_page_list(freelist);
> }
> @@ -1417,24 +1417,27 @@ static bool increase_address_space(struct 
> protection_domain *domain,
>  unsigned long address,
>  gfp_t gfp)
> {
> + int level;
>   unsigned long flags;
>   bool ret = false;
>   u64 *pte;
> 
>   spin_lock_irqsave(>lock, flags);
> 
> - if (address <= PM_LEVEL_SIZE(domain->mode) ||
> - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
> + level = iommu_get_mode(domain->pt_root);
> +
> + if (address <= PM_LEVEL_SIZE(level) ||
> + WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
>   goto out;
> 
>   pte = (void *)get_zeroed_page(gfp);
>   if (!pte)
>   goto out;
> 
> - *pte = PM_LEVEL_PDE(domain->mode,
> - iommu_virt_to_phys(domain->pt_root));
> - domain->pt_root  = pte;
> - domain->mode+= 1;
> + *pte = PM_LEVEL_PDE(level,
> + iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
> +
> + WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);
> 
>   ret = true;
> 
> @@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain 
> *domain,
> bool *updated)
> {
>   int level, end_lvl;
> - u64 *pte, *page;
> + u64 *pte, *page, *pt_root, *root;
> 
>   BUG_ON(!is_power_of_2(page_size));
> 
> - while (address > PM_LEVEL_SIZE(domain->mode))
> + while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
>   *updated = increase_address_space(domain, address, gfp) || 
> *updated;
> 
> - level   = domain->mode - 1;
> - pte = >pt_root[PM_LEVEL_INDEX(level, address)];
> + pt_root = READ_ONCE(domain->pt_root);
> + root= (void *)iommu_get_root(pt_root);
> + level   = iommu_get_mode(pt_root) - 1;
> + pte = [PM_LEVEL_INDEX(level, address)];
>   address = PAGE_SIZE_ALIGN(address, page_size);
>   end_lvl = PAGE_SIZE_LEVEL(page_size);
> 
> @@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain 
> *domain,
> unsigned long address,
> unsigned long *page_size)
> {
> - 

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-13 Thread Qian Cai


> On Apr 8, 2020, at 10:19 AM, Joerg Roedel  wrote:
> 
> Hi Qian,
> 
> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>> After further testing, the change along is insufficient. What I am chasing 
>> right
>> now is the swap device will go offline after heavy memory pressure below. The
>> symptom is similar to what we have in the commit,
>> 
>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>> 
>> Apparently, it is no possible to take the domain->lock in fetch_pte() 
>> because it
>> could sleep.
> 
> Thanks a lot for finding and tracking down another race in the AMD IOMMU
> page-table code.  The domain->lock is a spin-lock and taking it can't
> sleep. But fetch_pte() is a fast-path and must not take any locks.
> 
> I think the best fix is to update the pt_root and mode of the domain
> atomically by storing the mode in the lower 12 bits of pt_root. This way
> they are stored together and can be read/write atomically.

Like this?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..b36c6b07cbfd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int 
mode,
 
 static void free_pagetable(struct protection_domain *domain)
 {
-   unsigned long root = (unsigned long)domain->pt_root;
+   int level = iommu_get_mode(domain->pt_root);
+   unsigned long root = iommu_get_root(domain->pt_root);
struct page *freelist = NULL;
 
-   BUG_ON(domain->mode < PAGE_MODE_NONE ||
-  domain->mode > PAGE_MODE_6_LEVEL);
+   BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);
 
-   freelist = free_sub_pt(root, domain->mode, freelist);
+   freelist = free_sub_pt(root, level, freelist);
 
free_page_list(freelist);
 }
@@ -1417,24 +1417,27 @@ static bool increase_address_space(struct 
protection_domain *domain,
   unsigned long address,
   gfp_t gfp)
 {
+   int level;
unsigned long flags;
bool ret = false;
u64 *pte;
 
spin_lock_irqsave(>lock, flags);
 
-   if (address <= PM_LEVEL_SIZE(domain->mode) ||
-   WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+   level = iommu_get_mode(domain->pt_root);
+
+   if (address <= PM_LEVEL_SIZE(level) ||
+   WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
goto out;
 
pte = (void *)get_zeroed_page(gfp);
if (!pte)
goto out;
 
-   *pte = PM_LEVEL_PDE(domain->mode,
-   iommu_virt_to_phys(domain->pt_root));
-   domain->pt_root  = pte;
-   domain->mode+= 1;
+   *pte = PM_LEVEL_PDE(level,
+   iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
+
+   WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);
 
ret = true;
 
@@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain,
  bool *updated)
 {
int level, end_lvl;
-   u64 *pte, *page;
+   u64 *pte, *page, *pt_root, *root;
 
BUG_ON(!is_power_of_2(page_size));
 
-   while (address > PM_LEVEL_SIZE(domain->mode))
+   while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
*updated = increase_address_space(domain, address, gfp) || 
*updated;
 
-   level   = domain->mode - 1;
-   pte = >pt_root[PM_LEVEL_INDEX(level, address)];
+   pt_root = READ_ONCE(domain->pt_root);
+   root= (void *)iommu_get_root(pt_root);
+   level   = iommu_get_mode(pt_root) - 1;
+   pte = [PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);
 
@@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain,
  unsigned long address,
  unsigned long *page_size)
 {
-   int level;
u64 *pte;
+   u64 *pt_root = READ_ONCE(domain->pt_root);
+   u64 *root= (void *)iommu_get_root(pt_root);
+   int level= iommu_get_mode(pt_root);
 
*page_size = 0;
 
-   if (address > PM_LEVEL_SIZE(domain->mode))
+   if (address > PM_LEVEL_SIZE(level))
return NULL;
 
-   level  =  domain->mode - 1;
-   pte= >pt_root[PM_LEVEL_INDEX(level, address)];
+   level--;
+   pte= [PM_LEVEL_INDEX(level, address)];
*page_size =  PTE_LEVEL_PAGE_SIZE(level);
 
while (level > 0) {
@@ -1814,12 +1821,13 @@ static struct protection_domain 
*dma_ops_domain_alloc(void)
if (protection_domain_init(domain))
goto free_domain;
 
-   domain->mode = PAGE_MODE_3_LEVEL;
domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
domain->flags = PD_DMA_OPS_MASK;
if (!domain->pt_root)
goto free_domain;
 
+  

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-08 Thread Joerg Roedel
Hi Qian,

On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
> After further testing, the change along is insufficient. What I am chasing 
> right
> now is the swap device will go offline after heavy memory pressure below. The
> symptom is similar to what we have in the commit,
> 
> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
> 
> Apparently, it is no possible to take the domain->lock in fetch_pte() because 
> it
> could sleep.

Thanks a lot for finding and tracking down another race in the AMD IOMMU
page-table code.  The domain->lock is a spin-lock and taking it can't
sleep. But fetch_pte() is a fast-path and must not take any locks.

I think the best fix is to update the pt_root and mode of the domain
atomically by storing the mode in the lower 12 bits of pt_root. This way
they are stored together and can be read/write atomically.

Regards,

Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-07 Thread Qian Cai


> On Apr 6, 2020, at 10:12 PM, Qian Cai  wrote:
> 
> fetch_pte() could race with increase_address_space() because it held no
> lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could
> see a stale domain->pt_root and a new increased domain->mode from
> increase_address_space(). As the result, it could trigger invalid
> accesses later on. Fix it by using a pair of smp_[w|r]mb in those
> places.

After further testing, the change along is insufficient. What I am chasing right
now is the swap device will go offline after heavy memory pressure below. The
symptom is similar to what we have in the commit,

754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)

Apparently, it is no possible to take the domain->lock in fetch_pte() because it
could sleep.

Thoughts?

[ 7292.727377][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd8 flags=0x0010]
[ 7292.740571][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd81000 flags=0x0010]
[ 7292.753268][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd81900 flags=0x0010]
[ 7292.766078][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd81d00 flags=0x0010]
[ 7292.778447][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd82000 flags=0x0010]
[ 7292.790724][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd82400 flags=0x0010]
[ 7292.803195][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd82800 flags=0x0010]
[ 7292.815664][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd82c00 flags=0x0010]
[ 7292.828089][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd83000 flags=0x0010]
[ 7292.840468][  T814] smartpqi :23:00.0: AMD-Vi: Event logged 
[IO_PAGE_FAULT domain=0x0027 address=0xffd83400 flags=0x0010]
[ 7292.852795][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 
domain=0x0027 address=0xffd83800 flags=0x0010]
[ 7292.864566][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 
domain=0x0027 address=0xffd83c00 flags=0x0010]
[ 7326.583908][C8] smartpqi :23:00.0: controller is offline: status 
code 0x14803
[ 7326.592386][C8] smartpqi :23:00.0: controller offline
[ 7326.663728][   C66] sd 0:1:0:0: [sda] tag#467 UNKNOWN(0x2003) Result: 
hostbyte=0x01 driverbyte=0x00 cmd_age=6s
[ 7326.664321][ T2738] smartpqi :23:00.0: resetting scsi 0:1:0:0
[ 7326.673849][   C66] sd 0:1:0:0: [sda] tag#467 CDB: opcode=0x28 28 00 02 ee 
2e 60 00 00 08 00
[ 7326.680118][ T2738] smartpqi :23:00.0: reset of scsi 0:1:0:0: FAILED
[ 7326.688612][   C66] blk_update_request: I/O error, dev sda, sector 49163872 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 7326.695456][ T2738] sd 0:1:0:0: Device offlined - not ready after error 
recovery
[ 7326.706632][   C66] Read-error on swap-device (254:1:45833824)
[ 7326.714030][ T2738] sd 0:1:0:0: Device offlined - not ready after error 
recovery
[ 7326.723382][T45523] sd 0:1:0:0: rejecting I/O to offline device
[ 7326.727352][ T2738] sd 0:1:0:0: Device offlined - not ready after error 
recovery
[ 7326.727379][ T2738] sd 0:1:0:0: Device offlined - not ready after error 
recovery
[ 7326.727442][ T2738] sd 0:1:0:0: Device offlined - not ready after error 
recovery

> 
> kernel BUG at drivers/iommu/amd_iommu.c:1704!
> BUG_ON(unmapped && !is_power_of_2(unmapped));
> Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 
> 07/10/2019
> RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0
> Call Trace:
>  
>  __iommu_unmap+0x106/0x320
>  iommu_unmap_fast+0xe/0x10
>  __iommu_dma_unmap+0xdc/0x1a0
>  iommu_dma_unmap_sg+0xae/0xd0
>  scsi_dma_unmap+0xe7/0x150
>  pqi_raid_io_complete+0x37/0x60 [smartpqi]
>  pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
>  __handle_irq_event_percpu+0x78/0x4f0
>  handle_irq_event_percpu+0x70/0x100
>  handle_irq_event+0x5a/0x8b
>  handle_edge_irq+0x10c/0x370
>  do_IRQ+0x9e/0x1e0
>  common_interrupt+0xf/0xf
>  
> 
> Signed-off-by: Qian Cai 
> ---
> drivers/iommu/amd_iommu.c | 9 +
> 1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 20cce366e951..22328a23335f 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1434,6 +1434,11 @@ static bool increase_address_space(struct 
> protection_domain *domain,
>   *pte = PM_LEVEL_PDE(domain->mode,
>   iommu_virt_to_phys(domain->pt_root));
>   domain->pt_root  = pte;
> + /*
> +  * Make sure fetch_pte() will see the new domain->pt_root before it
> +  * snapshots domain->mode.
> +