Re: [PATCH v3 1/3] scsi: BusLogic remove bus_to_virt

2022-06-24 Thread Khalid Aziz

On 6/24/22 09:52, Arnd Bergmann wrote:

From: Arnd Bergmann 

The BusLogic driver is the last remaining driver that relies on the
deprecated bus_to_virt() function, which in turn only works on a few
architectures, and is incompatible with both swiotlb and iommu support.

Before commit 391e2f25601e ("[SCSI] BusLogic: Port driver to 64-bit."),
the driver had a dependency on x86-32, presumably because of this
problem. However, the change introduced another bug that made it still
impossible to use the driver on any 64-bit machine.

This was in turn fixed in commit 56f396146af2 ("scsi: BusLogic: Fix
64-bit system enumeration error for Buslogic"), 8 years later, which
shows that there are not a lot of users.

Maciej is still using the driver on 32-bit hardware, and Khalid mentioned
that the driver works with the device emulation used in VirtualBox
and VMware. Both of those only emulate it for Windows 2000 and older
operating systems that did not ship with the better LSI logic driver.

Do a minimum fix that searches through the list of descriptors to find
one that matches the bus address. This is clearly as inefficient as
was indicated in the code comment about the lack of a bus_to_virt()
replacement. A better fix would likely involve changing out the entire
descriptor allocation for a simpler one, but that would be much
more invasive.

Cc: Maciej W. Rozycki 
Cc: Matt Wang 
Tested-by: Khalid Aziz 
Reviewed-by: Robin Murphy 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Arnd Bergmann 
---
v3: Address issues pointed out by Khalid Aziz
v2: Attempt to fix the driver instead of removing it
---
  drivers/scsi/BusLogic.c | 35 +++
  drivers/scsi/Kconfig|  2 +-
  2 files changed, 24 insertions(+), 13 deletions(-)



This looks good.

Acked-by: Khalid Aziz 



diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c
index a897c8f914cf..f2abffce2659 100644
--- a/drivers/scsi/BusLogic.c
+++ b/drivers/scsi/BusLogic.c
@@ -2515,12 +2515,26 @@ static int blogic_resultcode(struct blogic_adapter 
*adapter,
return (hoststatus << 16) | tgt_status;
  }
  
+/*

+ * turn the dma address from an inbox into a ccb pointer
+ * This is rather inefficient.
+ */
+static struct blogic_ccb *
+blogic_inbox_to_ccb(struct blogic_adapter *adapter, struct blogic_inbox *inbox)
+{
+   struct blogic_ccb *ccb;
+
+   for (ccb = adapter->all_ccbs; ccb; ccb = ccb->next_all)
+   if (inbox->ccb == ccb->dma_handle)
+   break;
+
+   return ccb;
+}
  
  /*

blogic_scan_inbox scans the Incoming Mailboxes saving any
Incoming Mailbox entries for completion processing.
  */
-
  static void blogic_scan_inbox(struct blogic_adapter *adapter)
  {
/*
@@ -2540,17 +2554,14 @@ static void blogic_scan_inbox(struct blogic_adapter 
*adapter)
enum blogic_cmplt_code comp_code;
  
  	while ((comp_code = next_inbox->comp_code) != BLOGIC_INBOX_FREE) {

-   /*
-  We are only allowed to do this because we limit our
-  architectures we run on to machines where bus_to_virt(
-  actually works.  There *needs* to be a dma_addr_to_virt()
-  in the new PCI DMA mapping interface to replace
-  bus_to_virt() or else this code is going to become very
-  innefficient.
-*/
-   struct blogic_ccb *ccb =
-   (struct blogic_ccb *) bus_to_virt(next_inbox->ccb);
-   if (comp_code != BLOGIC_CMD_NOTFOUND) {
+   struct blogic_ccb *ccb = blogic_inbox_to_ccb(adapter, 
next_inbox);
+   if (!ccb) {
+   /*
+* This should never happen, unless the CCB list is
+* corrupted in memory.
+*/
+   blogic_warn("Could not find CCB for dma address %x\n", 
adapter, next_inbox->ccb);
+   } else if (comp_code != BLOGIC_CMD_NOTFOUND) {
if (ccb->status == BLOGIC_CCB_ACTIVE ||
ccb->status == BLOGIC_CCB_RESET) {
/*
diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 6e3a04107bb6..689186f3a908 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -514,7 +514,7 @@ config SCSI_HPTIOP
  
  config SCSI_BUSLOGIC

tristate "BusLogic SCSI support"
-   depends on PCI && SCSI && VIRT_TO_BUS
+   depends on PCI && SCSI
help
  This is support for BusLogic MultiMaster and FlashPoint SCSI Host
  Adapters. Consult the SCSI-HOWTO, available from


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/3] scsi: BusLogic remove bus_to_virt

2022-06-24 Thread Khalid Aziz

On 6/23/22 08:47, Arnd Bergmann wrote:



Can you test it again with this patch on top?

diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c
index d057abfcdd5c..9e67f2ee25ee 100644
--- a/drivers/scsi/BusLogic.c
+++ b/drivers/scsi/BusLogic.c
@@ -2554,8 +2554,14 @@ static void blogic_scan_inbox(struct
blogic_adapter *adapter)
 enum blogic_cmplt_code comp_code;

 while ((comp_code = next_inbox->comp_code) != BLOGIC_INBOX_FREE) {
-   struct blogic_ccb *ccb = blogic_inbox_to_ccb(adapter,
adapter->next_inbox);
-   if (comp_code != BLOGIC_CMD_NOTFOUND) {
+   struct blogic_ccb *ccb = blogic_inbox_to_ccb(adapter,
next_inbox);
+   if (!ccb) {
+   /*
+* This should never happen, unless the CCB list is
+* corrupted in memory.
+*/
+   blogic_warn("Could not find CCB for dma
address 0x%x\n", adapter, next_inbox->ccb);
+   } else if (comp_code != BLOGIC_CMD_NOTFOUND) {
 if (ccb->status == BLOGIC_CCB_ACTIVE ||
 ccb->status == BLOGIC_CCB_RESET) {


Hi Arnd,

Driver works with this change. next_inbox is the correct pointer to pass.

Thanks,
Khalid
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/3] scsi: BusLogic remove bus_to_virt

2022-06-21 Thread Khalid Aziz

On 6/17/22 06:57, Arnd Bergmann wrote:

From: Arnd Bergmann 

The BusLogic driver is the last remaining driver that relies on the
deprecated bus_to_virt() function, which in turn only works on a few
architectures, and is incompatible with both swiotlb and iommu support.

Before commit 391e2f25601e ("[SCSI] BusLogic: Port driver to 64-bit."),
the driver had a dependency on x86-32, presumably because of this
problem. However, the change introduced another bug that made it still
impossible to use the driver on any 64-bit machine.

This was in turn fixed in commit 56f396146af2 ("scsi: BusLogic: Fix
64-bit system enumeration error for Buslogic"), 8 years later, which
shows that there are not a lot of users.

Maciej is still using the driver on 32-bit hardware, and Khalid mentioned
that the driver works with the device emulation used in VirtualBox
and VMware. Both of those only emulate it for Windows 2000 and older
operating systems that did not ship with the better LSI logic driver.

Do a minimum fix that searches through the list of descriptors to find
one that matches the bus address. This is clearly as inefficient as
was indicated in the code comment about the lack of a bus_to_virt()
replacement. A better fix would likely involve changing out the entire
descriptor allocation for a simpler one, but that would be much
more invasive.

Cc: Maciej W. Rozycki 
Cc: Matt Wang 
Cc: Khalid Aziz 
Signed-off-by: Arnd Bergmann 
---
  drivers/scsi/BusLogic.c | 27 ---
  drivers/scsi/Kconfig|  2 +-
  2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c
index a897c8f914cf..d057abfcdd5c 100644
--- a/drivers/scsi/BusLogic.c
+++ b/drivers/scsi/BusLogic.c
@@ -2515,12 +2515,26 @@ static int blogic_resultcode(struct blogic_adapter 
*adapter,
return (hoststatus << 16) | tgt_status;
  }
  
+/*

+ * turn the dma address from an inbox into a ccb pointer
+ * This is rather inefficient.
+ */
+static struct blogic_ccb *
+blogic_inbox_to_ccb(struct blogic_adapter *adapter, struct blogic_inbox *inbox)
+{
+   struct blogic_ccb *ccb;
+
+   for (ccb = adapter->all_ccbs; ccb; ccb = ccb->next_all)
+   if (inbox->ccb == ccb->dma_handle)
+   break;
+
+   return ccb;
+}
  
  /*

blogic_scan_inbox scans the Incoming Mailboxes saving any
Incoming Mailbox entries for completion processing.
  */
-
  static void blogic_scan_inbox(struct blogic_adapter *adapter)
  {
/*
@@ -2540,16 +2554,7 @@ static void blogic_scan_inbox(struct blogic_adapter 
*adapter)
enum blogic_cmplt_code comp_code;
  
  	while ((comp_code = next_inbox->comp_code) != BLOGIC_INBOX_FREE) {

-   /*
-  We are only allowed to do this because we limit our
-  architectures we run on to machines where bus_to_virt(
-  actually works.  There *needs* to be a dma_addr_to_virt()
-  in the new PCI DMA mapping interface to replace
-  bus_to_virt() or else this code is going to become very
-  innefficient.
-*/
-   struct blogic_ccb *ccb =
-   (struct blogic_ccb *) bus_to_virt(next_inbox->ccb);
+   struct blogic_ccb *ccb = blogic_inbox_to_ccb(adapter, 
adapter->next_inbox);


This change looks good enough as workaround to not use bus_to_virt() for 
now. There are two problems I see though. One, I do worry about 
blogic_inbox_to_ccb() returning NULL for ccb which should not happen 
unless the mailbox pointer was corrupted which would indicate a bigger 
problem. Nevertheless a NULL pointer causing kernel panic concerns me. 
How about adding a check before we dereference ccb?


Second, with this patch applied, I am seeing errors from the driver:

=
[ 1623.902685]  sdb: sdb1 sdb2
[ 1623.903245] sd 2:0:0:0: [sdb] Attached SCSI disk
[ 1623.911000] scsi2: Illegal CCB #76 status 2 in Incoming Mailbox
[ 1623.911005] scsi2: Illegal CCB #76 status 2 in Incoming Mailbox
[ 1623.911070] scsi2: Illegal CCB #79 status 2 in Incoming Mailbox
[ 1651.458008] scsi2: Warning: Partition Table appears to have Geometry 
256/63 which is
[ 1651.458013] scsi2: not compatible with current BusLogic Host Adapter 
Geometry 255/63

[ 1658.797609] scsi2: Resetting BusLogic BT-958D Failed
[ 1659.533208] sd 2:0:0:0: Device offlined - not ready after error recovery
[ 1659.51] sd 2:0:0:0: Device offlined - not ready after error recovery
[ 1659.53] sd 2:0:0:0: Device offlined - not ready after error recovery
[ 1659.533342] sd 2:0:0:0: [sdb] tag#101 FAILED Result: 
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
[ 1659.533345] sd 2:0:0:0: [sdb] tag#101 CDB: Read(10) 28 00 00 00 00 28 
00 00 10 00
[ 1659.533346] I/O error, dev sdb, sector 40 op 0x0:(READ) flags 0x80700 
phys_seg 1 prio class 0


=

This i

Re: [PATCH 5/6] scsi: remove stale BusLogic driver

2022-06-06 Thread Khalid Aziz

On 6/6/22 02:41, Arnd Bergmann wrote:

From: Arnd Bergmann

The BusLogic driver is the last remaining driver that relies on the
deprecated bus_to_virt() function, which in turn only works on a few
architectures, and is incompatible with both swiotlb and iommu support.

Before commit 391e2f25601e ("[SCSI] BusLogic: Port driver to 64-bit."),
the driver had a dependency on x86-32, presumably because of this
problem. However, the change introduced another bug that made it still
impossible to use the driver on any 64-bit machine.

This was in turn fixed in commit 56f396146af2 ("scsi: BusLogic: Fix
64-bit system enumeration error for Buslogic"), 8 years later.

The fact that this was found at all is an indication that there are
users, and it seems that Maciej, Matt and Khalid all have access to
this hardware, but if it took eight years to find the problem,
it's likely that nobody actually relies on it.

Remove it as part of the global virt_to_bus()/bus_to_virt() removal.
If anyone is still interested in keeping this driver, the alternative
is to stop it from using bus_to_virt(), possibly along the lines of
how dpt_i2o gets around the same issue.

Cc: Maciej W. Rozycki
Cc: Matt Wang
Cc: Khalid Aziz
Signed-off-by: Arnd Bergmann
---
  Documentation/scsi/BusLogic.rst   |  581 ---
  Documentation/scsi/FlashPoint.rst |  176 -
  MAINTAINERS   |7 -
  drivers/scsi/BusLogic.c   | 3727 --
  drivers/scsi/BusLogic.h   | 1284 -
  drivers/scsi/FlashPoint.c | 7560 -
  drivers/scsi/Kconfig  |   24 -
  7 files changed, 13359 deletions(-)
  delete mode 100644 Documentation/scsi/BusLogic.rst
  delete mode 100644 Documentation/scsi/FlashPoint.rst
  delete mode 100644 drivers/scsi/BusLogic.c
  delete mode 100644 drivers/scsi/BusLogic.h
  delete mode 100644 drivers/scsi/FlashPoint.c


I would say no to removing BusLogic driver. Virtualbox is another 
consumer of this driver. This driver is very old but I would rather fix 
the issues than remove it until we do not have any users.


Thanks,
Khalid
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-05-01 Thread Khalid Aziz
On 5/1/19 8:49 AM, Waiman Long wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> 
>> index 858b6c0b9a15..9b36da94760e 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -2997,6 +2997,12 @@
>>
>>   nox2apic    [X86-64,APIC] Do not enable x2APIC mode.
>>
>> +    noxpfo    [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
>> +    when CONFIG_XPFO is on. Physical pages mapped into
>> +    user applications will also be mapped in the
>> +    kernel's address space as if CONFIG_XPFO was not
>> +    enabled.
>> +
>>   cpu0_hotplug    [X86] Turn on CPU0 hotplug feature when
>>   CONFIG_BO OTPARAM_HOTPLUG_CPU0 is off.
>>   Some features depend on CPU0. Known dependencies are:
> 
> Given the big performance impact that XPFO can have. It should be off by
> default when configured. Instead, the xpfo option should be used to
> enable it.

Agreed. I plan to disable it by default in the next version of the
patch. This is likely to end up being a feature for extreme security
conscious folks only, unless I or someone else comes up with further
significant performance boost.

Thanks,
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-22 Thread Khalid Aziz
On 4/18/19 8:34 AM, Khalid Aziz wrote:
> On 4/17/19 11:41 PM, Kees Cook wrote:
>> On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski  wrote:
>>> I don't think this type of NX goof was ever the argument for XPFO.
>>> The main argument I've heard is that a malicious user program writes a
>>> ROP payload into user memory (regular anonymous user memory) and then
>>> gets the kernel to erroneously set RSP (*not* RIP) to point there.
>>
>> Well, more than just ROP. Any of the various attack primitives. The NX
>> stuff is about moving RIP: SMEP-bypassing. But there is still basic
>> SMAP-bypassing for putting a malicious structure in userspace and
>> having the kernel access it via the linear mapping, etc.
>>
>>> I find this argument fairly weak for a couple reasons.  First, if
>>> we're worried about this, let's do in-kernel CFI, not XPFO, to
>>
>> CFI is getting much closer. Getting the kernel happy under Clang, LTO,
>> and CFI is under active development. (It's functional for arm64
>> already, and pieces have been getting upstreamed.)
>>
> 
> CFI theoretically offers protection with fairly low overhead. I have not
> played much with CFI in clang. I agree with Linus that probability of
> bugs in XPFO implementation itself is a cause of concern. If CFI in
> Clang can provide us the same level of protection as XPFO does, I
> wouldn't want to push for an expensive change like XPFO.
> 
> If Clang/CFI can't get us there for extended period of time, does it
> make sense to continue to poke at XPFO?

Any feedback on continued effort on XPFO? If it makes sense to have XPFO
available as a solution for ret2dir issue in case Clang/CFI does not
work out, I will continue to refine it.

--
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-18 Thread Khalid Aziz
On 4/17/19 11:41 PM, Kees Cook wrote:
> On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski  wrote:
>> I don't think this type of NX goof was ever the argument for XPFO.
>> The main argument I've heard is that a malicious user program writes a
>> ROP payload into user memory (regular anonymous user memory) and then
>> gets the kernel to erroneously set RSP (*not* RIP) to point there.
> 
> Well, more than just ROP. Any of the various attack primitives. The NX
> stuff is about moving RIP: SMEP-bypassing. But there is still basic
> SMAP-bypassing for putting a malicious structure in userspace and
> having the kernel access it via the linear mapping, etc.
> 
>> I find this argument fairly weak for a couple reasons.  First, if
>> we're worried about this, let's do in-kernel CFI, not XPFO, to
> 
> CFI is getting much closer. Getting the kernel happy under Clang, LTO,
> and CFI is under active development. (It's functional for arm64
> already, and pieces have been getting upstreamed.)
> 

CFI theoretically offers protection with fairly low overhead. I have not
played much with CFI in clang. I agree with Linus that probability of
bugs in XPFO implementation itself is a cause of concern. If CFI in
Clang can provide us the same level of protection as XPFO does, I
wouldn't want to push for an expensive change like XPFO.

If Clang/CFI can't get us there for extended period of time, does it
make sense to continue to poke at XPFO?

Thanks,
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-17 Thread Khalid Aziz
On 4/17/19 1:49 PM, Andy Lutomirski wrote:
> On Wed, Apr 17, 2019 at 10:33 AM Khalid Aziz  wrote:
>>
>> On 4/17/19 11:09 AM, Ingo Molnar wrote:
>>>
>>> * Khalid Aziz  wrote:
>>>
>>>>> I.e. the original motivation of the XPFO patches was to prevent execution
>>>>> of direct kernel mappings. Is this motivation still present if those
>>>>> mappings are non-executable?
>>>>>
>>>>> (Sorry if this has been asked and answered in previous discussions.)
>>>>
>>>> Hi Ingo,
>>>>
>>>> That is a good question. Because of the cost of XPFO, we have to be very
>>>> sure we need this protection. The paper from Vasileios, Michalis and
>>>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>>>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>>>> and 6.2.
>>>
>>> So it would be nice if you could generally summarize external arguments
>>> when defending a patchset, instead of me having to dig through a PDF
>>> which not only causes me to spend time that you probably already spent
>>> reading that PDF, but I might also interpret it incorrectly. ;-)
>>
>> Sorry, you are right. Even though that paper explains it well, a summary
>> is always useful.
>>
>>>
>>> The PDF you cited says this:
>>>
>>>   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced
>>>in many platforms, including x86-64.  In our example, the content of
>>>user address 0xBEEF000 is also accessible through kernel address
>>>0x87FF9F08 as plain, executable code."
>>>
>>> Is this actually true of modern x86-64 kernels? We've locked down W^X
>>> protections in general.
>>>
>>> I.e. this conclusion:
>>>
>>>   "Therefore, by simply overwriting kfptr with 0x87FF9F08 and
>>>triggering the kernel to dereference it, an attacker can directly
>>>execute shell code with kernel privileges."
>>>
>>> ... appears to be predicated on imperfect W^X protections on the x86-64
>>> kernel.
>>>
>>> Do such holes exist on the latest x86-64 kernel? If yes, is there a
>>> reason to believe that these W^X holes cannot be fixed, or that any fix
>>> would be more expensive than XPFO?
>>
>> Even if physmap is not executable, return-oriented programming (ROP) can
>> still be used to launch an attack. Instead of placing executable code at
>> user address 0xBEEF000, attacker can place an ROP payload there. kfptr
>> is then overwritten to point to a stack-pivoting gadget. Using the
>> physmap address aliasing, the ROP payload becomes kernel-mode stack. The
>> execution can then be hijacked upon execution of ret instruction. This
>> is a gist of the subsection titled "Non-executable physmap" under
>> section 6.2 and it looked convincing enough to me. If you have a
>> different take on this, I am very interested in your point of view.
> 
> My issue with all this is that XPFO is really very expensive.  I think
> that, if we're going to seriously consider upstreaming expensive
> exploit mitigations like this, we should consider others first, in
> particular CFI techniques.  grsecurity's RAP would be a great start.
> I also proposed using a gcc plugin (or upstream gcc feature) to add
> some instrumentation to any code that pops RSP to verify that the
> resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes.
> This will make ROP quite a bit harder.
> 

Yes, XPFO is expensive. I have been able to reduce the overhead of XPFO
from 2537% to 28% (on large servers) but 28% is still quite significant.
Alternative mitigation techniques with lower impact would easily be more
acceptable as long as they provide same level of protection. If we have
to go with XPFO, we will continue to look for more performance
improvement to bring that number down further from 28%. Hopefully what
Tycho is working on will yield better results. I am continuing to look
for improvements to XPFO in parallel.

Thanks,
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-17 Thread Khalid Aziz
On 4/17/19 11:09 AM, Ingo Molnar wrote:
> 
> * Khalid Aziz  wrote:
> 
>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>> of direct kernel mappings. Is this motivation still present if those 
>>> mappings are non-executable?
>>>
>>> (Sorry if this has been asked and answered in previous discussions.)
>>
>> Hi Ingo,
>>
>> That is a good question. Because of the cost of XPFO, we have to be very
>> sure we need this protection. The paper from Vasileios, Michalis and
>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>> and 6.2.
> 
> So it would be nice if you could generally summarize external arguments 
> when defending a patchset, instead of me having to dig through a PDF 
> which not only causes me to spend time that you probably already spent 
> reading that PDF, but I might also interpret it incorrectly. ;-)

Sorry, you are right. Even though that paper explains it well, a summary
is always useful.

> 
> The PDF you cited says this:
> 
>   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>in many platforms, including x86-64.  In our example, the content of 
>user address 0xBEEF000 is also accessible through kernel address 
>0x87FF9F08 as plain, executable code."
> 
> Is this actually true of modern x86-64 kernels? We've locked down W^X 
> protections in general.
> 
> I.e. this conclusion:
> 
>   "Therefore, by simply overwriting kfptr with 0x87FF9F08 and 
>triggering the kernel to dereference it, an attacker can directly 
>execute shell code with kernel privileges."
> 
> ... appears to be predicated on imperfect W^X protections on the x86-64 
> kernel.
> 
> Do such holes exist on the latest x86-64 kernel? If yes, is there a 
> reason to believe that these W^X holes cannot be fixed, or that any fix 
> would be more expensive than XPFO?

Even if physmap is not executable, return-oriented programming (ROP) can
still be used to launch an attack. Instead of placing executable code at
user address 0xBEEF000, attacker can place an ROP payload there. kfptr
is then overwritten to point to a stack-pivoting gadget. Using the
physmap address aliasing, the ROP payload becomes kernel-mode stack. The
execution can then be hijacked upon execution of ret instruction. This
is a gist of the subsection titled "Non-executable physmap" under
section 6.2 and it looked convincing enough to me. If you have a
different take on this, I am very interested in your point of view.

Thanks,
Khalid


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-17 Thread Khalid Aziz
On 4/17/19 10:15 AM, Ingo Molnar wrote:
> 
> [ Sorry, had to trim the Cc: list from hell. Tried to keep all the 
>   mailing lists and all x86 developers. ]
> 
> * Khalid Aziz  wrote:
> 
>> From: Juerg Haefliger 
>>
>> This patch adds basic support infrastructure for XPFO which protects 
>> against 'ret2dir' kernel attacks. The basic idea is to enforce 
>> exclusive ownership of page frames by either the kernel or userspace, 
>> unless explicitly requested by the kernel. Whenever a page destined for 
>> userspace is allocated, it is unmapped from physmap (the kernel's page 
>> table). When such a page is reclaimed from userspace, it is mapped back 
>> to physmap. Individual architectures can enable full XPFO support using 
>> this infrastructure by supplying architecture specific pieces.
> 
> I have a higher level, meta question:
> 
> Is there any updated analysis outlining why this XPFO overhead would be 
> required on x86-64 kernels running on SMAP/SMEP CPUs which should be all 
> recent Intel and AMD CPUs, and with kernel that mark all direct kernel 
> mappings as non-executable - which should be all reasonably modern 
> kernels later than v4.0 or so?
> 
> I.e. the original motivation of the XPFO patches was to prevent execution 
> of direct kernel mappings. Is this motivation still present if those 
> mappings are non-executable?
> 
> (Sorry if this has been asked and answered in previous discussions.)

Hi Ingo,

That is a good question. Because of the cost of XPFO, we have to be very
sure we need this protection. The paper from Vasileios, Michalis and
Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
and 6.2.

Thanks,
Khalid


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-05 Thread Khalid Aziz
On 4/5/19 10:27 AM, Andy Lutomirski wrote:
> At the risk of asking stupid questions: we already have a mechanism for this: 
> highmem.  Can we enable highmem on x86_64, maybe with some heuristics to make 
> it work well?
> 

I proposed using highmem infrastructure for xpfo in my cover letter as
well as in my earlier replies discussing redesigning
xpfo_kmap/xpfo_kunmap. So that sounds like a reasonable question to me
:) Looks like we might be getting to an agreement that highmem
infrastructure is a good thing to use here.

--
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-05 Thread Khalid Aziz
On 4/5/19 9:24 AM, Andy Lutomirski wrote:
> 
> 
>> On Apr 5, 2019, at 8:44 AM, Dave Hansen  wrote:
>>
>> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
 process. Is that an acceptable trade-off?
>>> You are not seriously asking whether creating a user controllable ret2dir
>>> attack window is a acceptable trade-off? April 1st was a few days ago.
>>
>> Well, let's not forget that this set at least takes us from "always
>> vulnerable to ret2dir" to a choice between:
>>
>> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
>> 2. slow and "mitigated against ret2dir"
>>
>> Sounds like we need a mechanism that will do the deferred XPFO TLB
>> flushes whenever the kernel is entered, and not _just_ at context switch
>> time.  This permits an app to run in userspace with stale kernel TLB
>> entries as long as it wants... that's harmless.
> 
> I don’t think this is good enough. The bad guys can enter the kernel and 
> arrange for the kernel to wait, *in kernel*, for long enough to set up the 
> attack.  userfaultfd is the most obvious way, but there are plenty. I suppose 
> we could do the flush at context switch *and* entry.  I bet that performance 
> still utterly sucks, though — on many workloads, this turns every entry into 
> a full flush, and we already know exactly how much that sucks — it’s 
> identical to KPTI without PCID.  (And yes, if we go this route, we need to 
> merge this logic together — we shouldn’t write CR3 twice on entry).

Performance impact might not be all that much from flush at kernel
entry. This flush will happen only if there is a pending flush posted to
the processor and will be done in lieu of flush at the next context
switch. So we are not looking at adding more TLB flushes, rather change
where they might happen. That still can result in some performance
impact and measuring it with real code will be the only way to get that
number.

> 
> I feel like this whole approach is misguided. ret2dir is not such a game 
> changer that fixing it is worth huge slowdowns. I think all this effort 
> should be spent on some kind of sensible CFI. For example, we should be able 
> to mostly squash ret2anything by inserting a check that the high bits of RSP 
> match the value on the top of the stack before any code that pops RSP.  On an 
> FPO build, there aren’t all that many hot POP RSP instructions, I think.
> 
> (Actually, checking the bits is suboptimal. Do:
> 
> unsigned long offset = *rsp - rsp;
> offset >>= THREAD_SHIFT;
> if (unlikely(offset))
> BUG();
> POP RSP;
> 
> This means that it’s also impossible to trick a function to return into a 
> buffer that is on that function’s stack.)
> 
> In other words, I think that ret2dir is an insufficient justification for 
> XPFO.
> 

That is something we may want to explore further. Closing down
/proc//pagemap has already helped reduce one way to mount ret2dir
attack. physmap spraying technique still remains viable. XPFO
implementation is expensive. Can we do something different to mitigate
physmap spraying?

Thanks,
Khalid


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-05 Thread Khalid Aziz
On 4/5/19 8:44 AM, Dave Hansen wrote:
> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>> process. Is that an acceptable trade-off?
>> You are not seriously asking whether creating a user controllable ret2dir
>> attack window is a acceptable trade-off? April 1st was a few days ago.
> 
> Well, let's not forget that this set at least takes us from "always
> vulnerable to ret2dir" to a choice between:
> 
> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
> 2. slow and "mitigated against ret2dir"
> 
> Sounds like we need a mechanism that will do the deferred XPFO TLB
> flushes whenever the kernel is entered, and not _just_ at context switch
> time.  This permits an app to run in userspace with stale kernel TLB
> entries as long as it wants... that's harmless.

That sounds like a good idea. This TLB flush only needs to be done at
kernel entry when there is a pending flush posted for that cpu. What
this does is move the cost of TLB flush from next context switch to
kernel entry and does not add any more flushes than what we are doing
with these xpfo patches. So the overall performance impact might not
change much. It seems worth coding up.

Thanks,
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-04 Thread Khalid Aziz
On 4/3/19 10:10 PM, Andy Lutomirski wrote:
> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
>>
>> XPFO flushes kernel space TLB entries for pages that are now mapped
>> in userspace on not only the current CPU but also all other CPUs
>> synchronously. Processes on each core allocating pages causes a
>> flood of IPI messages to all other cores to flush TLB entries.
>> Many of these messages are to flush the entire TLB on the core if
>> the number of entries being flushed from local core exceeds
>> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
>> unmapping pages from physmap goes up dramatically on machines with
>> high core count.
>>
>> This patch flushes relevant TLB entries for current process or
>> entire TLB depending upon number of entries for the current CPU
>> and posts a pending TLB flush on all other CPUs when a page is
>> unmapped from kernel space and mapped in userspace. Each core
>> checks the pending TLB flush flag for itself on every context
>> switch, flushes its TLB if the flag is set and clears it.
>> This patch potentially aggregates multiple TLB flushes into one.
>> This has very significant impact especially on machines with large
>> core counts.
> 
> Why is this a reasonable strategy?

Ideally when pages are unmapped from physmap, all CPUs would be sent IPI
synchronously to flush TLB entry for those pages immediately. This may
be ideal from correctness and consistency point of view, but it also
results in IPI storm and repeated TLB flushes on all processors. Any
time a page is allocated to userspace, we are going to go through this
and it is very expensive. On a 96-core server, performance degradation
is 26x!!

When xpfo unmaps a page from physmap only (after mapping the page in
userspace in response to an allocation request from userspace) on one
processor, there is a small window of opportunity for ret2dir attack on
other cpus until the TLB entry in physmap for the unmapped pages on
other cpus is cleared. Forcing that to happen synchronously is the
expensive part. A multiple of these requests can come in over a very
short time across multiple processors resulting in every cpu asking
every other cpusto flush TLB just to close this small window of
vulnerability in the kernel. If each request is processed synchronously,
each CPU will do multiple TLB flushes in short order. If we could
consolidate these TLB flush requests instead and do one TLB flush on
each cpu at the time of context switch, we can reduce the performance
impact significantly. This bears out in real life measuring the system
time when doing a parallel kernel build on a large server. Without this,
system time on 96-core server when doing "make -j60 all" went up 26x.
After this optimization, impact went down to 1.44x.

The trade-off with this strategy is, the kernel on a cpu is vulnerable
for a short time if the current running processor is the malicious
process. Is that an acceptable trade-off?

I am open to other ideas on reducing the performance impact due to xpfo.

> 
>> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
>> +{
>> +   struct cpumask tmp_mask;
>> +
>> +   /*
>> +* Balance as user space task's flush, a bit conservative.
>> +* Do a local flush immediately and post a pending flush on all
>> +* other CPUs. Local flush can be a range flush or full flush
>> +* depending upon the number of entries to be flushed. Remote
>> +* flushes will be done by individual processors at the time of
>> +* context switch and this allows multiple flush requests from
>> +* other CPUs to be batched together.
>> +*/
> 
> I don't like this function at all.  A core function like this is a
> contract of sorts between the caller and the implementation.  There is
> no such thing as an "xpfo" flush, and this function's behavior isn't
> at all well defined.  For flush_tlb_kernel_range(), I can tell you
> exactly what that function does, and the implementation is either
> correct or incorrect.  With this function, I have no idea what is
> actually required, and I can't possibly tell whether it's correct.
> 
> As far as I can see, xpfo_flush_tlb_kernel_range() actually means
> "flush this range on this CPU right now, and flush it on remote CPUs
> eventually".  It would be valid, but probably silly, to flush locally
> and to never flush at all on remote CPUs.  This makes me wonder what
> the point is.
> 

I would restate that as "flush this range on this cpu right now, and
flush it on remote cpus at the next context switch". A better name for
the routine and a better description is a reasonable change to make.

Thanks,
Khalid


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap

2019-04-04 Thread Khalid Aziz
On 4/4/19 1:56 AM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:12AM -0600, Khalid Aziz wrote:
>> From: Julian Stecklina 
>>
>> Only the xpfo_kunmap call that needs to actually unmap the page
>> needs to be serialized. We need to be careful to handle the case,
>> where after the atomic decrement of the mapcount, a xpfo_kmap
>> increased the mapcount again. In this case, we can safely skip
>> modifying the page table.
>>
>> Model-checked with up to 4 concurrent callers with Spin.
>>
>> Signed-off-by: Julian Stecklina 
>> Signed-off-by: Khalid Aziz 
>> Cc: Khalid Aziz 
>> Cc: x...@kernel.org
>> Cc: kernel-harden...@lists.openwall.com
>> Cc: Vasileios P. Kemerlis 
>> Cc: Juerg Haefliger 
>> Cc: Tycho Andersen 
>> Cc: Marco Benatto 
>> Cc: David Woodhouse 
>> ---
>>  include/linux/xpfo.h | 24 +++-
>>  1 file changed, 15 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> index 2318c7eb5fb7..37e7f52fa6ce 100644
>> --- a/include/linux/xpfo.h
>> +++ b/include/linux/xpfo.h
>> @@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page 
>> *page)
>>  static inline void xpfo_kunmap(void *kaddr, struct page *page)
>>  {
>>  unsigned long flags;
>> +bool flush_tlb = false;
>>  
>>  if (!static_branch_unlikely(_inited))
>>  return;
>> @@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page 
>> *page)
>>   * The page is to be allocated back to user space, so unmap it from
>>   * the kernel, flush the TLB and tag it as a user page.
>>   */
>> -spin_lock_irqsave(>xpfo_lock, flags);
>> -
>>  if (atomic_dec_return(>xpfo_mapcount) == 0) {
>> -#ifdef CONFIG_XPFO_DEBUG
>> -WARN_ON(PageXpfoUnmapped(page));
>> -#endif
>> -SetPageXpfoUnmapped(page);
>> -set_kpte(kaddr, page, __pgprot(0));
>> -xpfo_flush_kernel_tlb(page, 0);
>> +spin_lock_irqsave(>xpfo_lock, flags);
>> +
>> +/*
>> + * In the case, where we raced with kmap after the
>> + * atomic_dec_return, we must not nuke the mapping.
>> + */
>> +if (atomic_read(>xpfo_mapcount) == 0) {
>> +SetPageXpfoUnmapped(page);
>> +set_kpte(kaddr, page, __pgprot(0));
>> +flush_tlb = true;
>> +}
>> +spin_unlock_irqrestore(>xpfo_lock, flags);
>>  }
>>  
>> -spin_unlock_irqrestore(>xpfo_lock, flags);
>> +if (flush_tlb)
>> +xpfo_flush_kernel_tlb(page, 0);
>>  }
> 
> This doesn't help with the TLB invalidation issue, AFAICT this is still
> completely buggered. kunmap_atomic() can be called from IRQ context.
> 

OK. xpfo_kmap/xpfo_kunmap need redesign.

--
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64

2019-04-04 Thread Khalid Aziz
[trimmed To: and Cc: It was too large]

On 4/4/19 1:52 AM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:05AM -0600, Khalid Aziz wrote:
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 2779ace16d23..5c0e1581fa56 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
>>  return boot_cpu_has_bug(X86_BUG_L1TF);
>>  }
>>  
>> +/*
>> + * The current flushing context - we pass it instead of 5 arguments:
>> + */
>> +struct cpa_data {
>> +unsigned long   *vaddr;
>> +pgd_t   *pgd;
>> +pgprot_tmask_set;
>> +pgprot_tmask_clr;
>> +unsigned long   numpages;
>> +unsigned long   curpage;
>> +unsigned long   pfn;
>> +unsigned intflags;
>> +unsigned intforce_split : 1,
>> +force_static_prot   : 1;
>> +struct page **pages;
>> +};
>> +
>> +
>> +int
>> +should_split_large_page(pte_t *kpte, unsigned long address,
>> +struct cpa_data *cpa);
>> +extern spinlock_t cpa_lock;
>> +int
>> +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
>> +   struct page *base);
>> +
> 
> I really hate exposing all that.

I believe this was done so set_kpte() could split large pages if needed.
I will look into creating a helper function instead so this does not
have to be exposed.

> 
>>  #include 
>>  #endif  /* __ASSEMBLY__ */
>>  
> 
>> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
>> new file mode 100644
>> index ..3045bb7e4659
>> --- /dev/null
>> +++ b/arch/x86/mm/xpfo.c
>> @@ -0,0 +1,123 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + *
>> + * Authors:
>> + *   Juerg Haefliger 
>> + *   Vasileios P. Kemerlis 
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published 
>> by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include 
>> +
>> +#include 
>> +
>> +extern spinlock_t cpa_lock;
>> +
>> +/* Update a single kernel page table entry */
>> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
>> +{
>> +unsigned int level;
>> +pgprot_t msk_clr;
>> +pte_t *pte = lookup_address((unsigned long)kaddr, );
>> +
>> +if (unlikely(!pte)) {
>> +WARN(1, "xpfo: invalid address %p\n", kaddr);
>> +return;
>> +}
>> +
>> +switch (level) {
>> +case PG_LEVEL_4K:
>> +set_pte_atomic(pte, pfn_pte(page_to_pfn(page),
>> +   canon_pgprot(prot)));
> 
> (sorry, do we also need a nikon_pgprot() ? :-)

Are we trying to encourage nikon as well to sponsor this patch :)

> 
>> +break;
>> +case PG_LEVEL_2M:
>> +case PG_LEVEL_1G: {
>> +struct cpa_data cpa = { };
>> +int do_split;
>> +
>> +if (level == PG_LEVEL_2M)
>> +msk_clr = pmd_pgprot(*(pmd_t *)pte);
>> +else
>> +msk_clr = pud_pgprot(*(pud_t *)pte);
>> +
>> +cpa.vaddr = kaddr;
>> +cpa.pages = 
>> +cpa.mask_set = prot;
>> +cpa.mask_clr = msk_clr;
>> +cpa.numpages = 1;
>> +cpa.flags = 0;
>> +cpa.curpage = 0;
>> +cpa.force_split = 0;
>> +
>> +
>> +do_split = should_split_large_page(pte, (unsigned long)kaddr,
>> +   );
>> +if (do_split) {
>> +struct page *base;
>> +
>> +base = alloc_pages(GFP_ATOMIC, 0);
>> +if (!base) {
>> +WARN(1, "xpfo: failed to split large page\n");
> 
> You have to be fcking kidding right? A WARN when a GFP_ATOMIC allocation
> fails?!
> 

Not sure what the reasoning was for this WARN in original patch, but I
think this is trying to warn about failure to split the large page in
order to unmap this single page as opposed to warning about allocation

Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-04 Thread Khalid Aziz
On 4/4/19 1:43 AM, Peter Zijlstra wrote:
> 
> You must be so glad I no longer use kmap_atomic from NMI context :-)
> 
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
>> +static inline void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +unsigned long flags;
>> +
>> +if (!static_branch_unlikely(_inited))
>> +return;
>> +
>> +if (!PageXpfoUser(page))
>> +return;
>> +
>> +/*
>> + * The page was previously allocated to user space, so
>> + * map it back into the kernel if needed. No TLB flush required.
>> + */
>> +spin_lock_irqsave(>xpfo_lock, flags);
>> +
>> +if ((atomic_inc_return(>xpfo_mapcount) == 1) &&
>> +TestClearPageXpfoUnmapped(page))
>> +set_kpte(kaddr, page, PAGE_KERNEL);
>> +
>> +spin_unlock_irqrestore(>xpfo_lock, flags);
> 
> That's a really sad sequence, not wrong, but sad. _3_ atomic operations,
> 2 on likely the same cacheline. And mostly all pointless.
> 
> This patch makes xpfo_mapcount an atomic, but then all modifications are
> under the spinlock, what gives?
> 
> Anyway, a possibly saner sequence might be:
> 
>   if (atomic_inc_not_zero(>xpfo_mapcount))
>   return;
> 
>   spin_lock_irqsave(>xpfo_lock, flag);
>   if ((atomic_inc_return(>xpfo_mapcount) == 1) &&
>   TestClearPageXpfoUnmapped(page))
>   set_kpte(kaddr, page, PAGE_KERNEL);
>   spin_unlock_irqrestore(>xpfo_lock, flags);
> 
>> +}
>> +
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +unsigned long flags;
>> +
>> +if (!static_branch_unlikely(_inited))
>> +return;
>> +
>> +if (!PageXpfoUser(page))
>> +return;
>> +
>> +/*
>> + * The page is to be allocated back to user space, so unmap it from
>> + * the kernel, flush the TLB and tag it as a user page.
>> + */
>> +spin_lock_irqsave(>xpfo_lock, flags);
>> +
>> +if (atomic_dec_return(>xpfo_mapcount) == 0) {
>> +#ifdef CONFIG_XPFO_DEBUG
>> +WARN_ON(PageXpfoUnmapped(page));
>> +#endif
>> +SetPageXpfoUnmapped(page);
>> +set_kpte(kaddr, page, __pgprot(0));
>> +xpfo_flush_kernel_tlb(page, 0);
> 
> You didn't speak about the TLB invalidation anywhere. But basically this
> is one that x86 cannot do.
> 
>> +}
>> +
>> +spin_unlock_irqrestore(>xpfo_lock, flags);
> 
> Idem:
> 
>   if (atomic_add_unless(>xpfo_mapcount, -1, 1))
>   return;
> 
>   
> 
> 
>> +}
> 
> Also I'm failing to see the point of PG_xpfo_unmapped, afaict it
> is identical to !atomic_read(>xpfo_mapcount).
> 

Thanks Peter. I really appreciate your review. Your feedback helps make
this code better and closer to where I can feel comfortable not calling
it RFC any more.

The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get
uncomfortable with it. As you pointed out about calling kmap_atomic from
NMI context, that just makes the kmap_atomic code look even worse. I
pointed out one problem with this code in cover letter and suggested a
rewrite. I see these problems with this code:

1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir
attack security hole again even if just for the duration of kmap. A kmap
can stay around for some time if the page is being used for I/O.

2. This code uses spinlock which leads to problems. If it does not
disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables
IRQ, I think it can still deadlock around pgd_lock.

I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map
the page at a new virtual address similar to what kmap_high for i386
does. This avoids re-opening the ret2dir security hole. We can also
possibly do away with xpfo_lock saving bytes in page-frame and the not
so sane code sequence can go away.

Good point about PG_xpfo_unmapped flag. You are right that it can be
replaced with !atomic_read(>xpfo_mapcount).

Thanks,
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

This patch adds support for XPFO for x86-64. It uses the generic
infrastructure in place for XPFO and adds the architecture specific
bits to enable XPFO on x86-64.

CC: x...@kernel.org
Suggested-by: Vasileios P. Kemerlis 
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Marco Benatto 
Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 .../admin-guide/kernel-parameters.txt |  10 +-
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/pgtable.h|  26 
 arch/x86/mm/Makefile  |   2 +
 arch/x86/mm/pageattr.c|  23 +---
 arch/x86/mm/xpfo.c| 123 ++
 include/linux/xpfo.h  |   2 +
 7 files changed, 162 insertions(+), 25 deletions(-)
 create mode 100644 arch/x86/mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9b36da94760e..e65e3bc1efe0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,11 +2997,11 @@
 
nox2apic[X86-64,APIC] Do not enable x2APIC mode.
 
-   noxpfo  [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
-   when CONFIG_XPFO is on. Physical pages mapped into
-   user applications will also be mapped in the
-   kernel's address space as if CONFIG_XPFO was not
-   enabled.
+   noxpfo  [XPFO,X86-64] Disable eXclusive Page Frame
+   Ownership (XPFO) when CONFIG_XPFO is on. Physical
+   pages mapped into user applications will also be
+   mapped in the kernel's address space as if
+   CONFIG_XPFO was not enabled.
 
cpu0_hotplug[X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68261430fe6e..122786604252 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -209,6 +209,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select ARCH_SUPPORTS_XPFO   if X86_64
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..5c0e1581fa56 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+   unsigned long   *vaddr;
+   pgd_t   *pgd;
+   pgprot_tmask_set;
+   pgprot_tmask_clr;
+   unsigned long   numpages;
+   unsigned long   curpage;
+   unsigned long   pfn;
+   unsigned intflags;
+   unsigned intforce_split : 1,
+   force_static_prot   : 1;
+   struct page **pages;
+};
+
+
+int
+should_split_large_page(pte_t *kpte, unsigned long address,
+   struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+  struct page *base);
+
 #include 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..93b0fdaf4a99 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)+= pti.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_boot.o
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 14e6119838a6..530b5df0617e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -28,23 +28,6 @@
 
 #include "mm_internal.h"
 
-/*
- * The current flushing context - we pass it instead of 5 arguments:
- */
-struct cpa_data {
-   unsigned long   *vaddr;
-   pgd_t   *pgd;
-   pgprot_tmask_set;
-   pgprot_tmask_clr;
-   unsigned long   numpages;
-   unsigned long   curpage;
-   unsigned long   pfn;
-   unsigned intflags;
-   unsigned intforce_split : 1,
-   force_static_prot   : 1;
-   struct page **pages;
-};
-
 enum cpa_warn {
CPA_CONFLICT,
CPA_PROTECT,
@@ -59,7 +42,7 @@ static const int cpa_warn_level = CPA_PROTECT;
  * entries change the page attribute in parallel to some other cpu
  * splitting a large page entry along with changing the attribute.
  

[RFC PATCH v9 05/13] mm: add a user_virt_to_phys symbol

2019-04-03 Thread Khalid Aziz
From: Tycho Andersen 

We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.

CC: linux-arm-ker...@lists.infradead.org
CC: x...@kernel.org
Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
v7: * make user_virt_to_phys a GPL symbol

v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case

 arch/x86/mm/xpfo.c   | 57 
 include/linux/xpfo.h |  4 
 2 files changed, 61 insertions(+)

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 3045bb7e4659..b42513347865 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -121,3 +121,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int 
order)
flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
 EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+   phys_addr_t phys_addr;
+   unsigned long offset;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset(current->mm, addr);
+   if (pgd_none(*pgd))
+   return 0;
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d))
+   return 0;
+
+   if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+   phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+   offset = addr & ~P4D_MASK;
+   goto out;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return 0;
+
+   if (pud_large(*pud) || !pud_present(*pud)) {
+   phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+   offset = addr & ~PUD_MASK;
+   goto out;
+   }
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return 0;
+
+   if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+   phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+   offset = addr & ~PMD_MASK;
+   goto out;
+   }
+
+   pte =  pte_offset_kernel(pmd, addr);
+   phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+   offset = addr & ~PAGE_MASK;
+
+out:
+   return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index c1d232da7ee0..5d8d06e4b796 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -89,6 +89,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page)
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
 void xpfo_free_pages(struct page *page, int order);
 
+phys_addr_t user_virt_to_phys(unsigned long addr);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -102,6 +104,8 @@ static inline void xpfo_free_pages(struct page *page, int 
order) { }
 static inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) { }
 static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { }
 
+static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+
 #endif /* CONFIG_XPFO */
 
 #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 01/13] mm: add MAP_HUGETLB support to vm_mmap

2019-04-03 Thread Khalid Aziz
From: Tycho Andersen 

vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.

Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
Tested-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 include/linux/mm.h |  2 ++
 mm/mmap.c  | 19 +--
 mm/util.c  | 32 
 3 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..3e4f6525d06b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2412,6 +2412,8 @@ struct vm_unmapped_area_info {
 extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
 extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
 /*
  * Search for an unmapped address range.
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index fc1809b1bed6..65382d942598 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1582,24 +1582,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, 
unsigned long len,
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
-   struct user_struct *user = NULL;
-   struct hstate *hs;
-
-   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
-   if (!hs)
-   return -EINVAL;
-
-   len = ALIGN(len, huge_page_size(hs));
-   /*
-* VM_NORESERVE is used because the reservations will be
-* taken when vm_ops->mmap() is called
-* A dummy user value is used because we are not locking
-* memory so no accounting is necessary
-*/
-   file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
-   VM_NORESERVE,
-   , HUGETLB_ANONHUGE_INODE,
-   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+   file = map_hugetlb_setup(, flags);
if (IS_ERR(file))
return PTR_ERR(file);
}
diff --git a/mm/util.c b/mm/util.c
index 379319b1bcfd..86b763861828 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned 
long addr,
return ret;
 }
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+   struct user_struct *user = NULL;
+   struct hstate *hs;
+
+   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+   if (!hs)
+   return ERR_PTR(-EINVAL);
+
+   *len = ALIGN(*len, huge_page_size(hs));
+
+   /*
+* VM_NORESERVE is used because the reservations will be
+* taken when vm_ops->mmap() is called
+* A dummy user value is used because we are not locking
+* memory so no accounting is necessary
+*/
+   return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+   VM_NORESERVE,
+   , HUGETLB_ANONHUGE_INODE,
+   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
 unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
@@ -366,6 +389,15 @@ unsigned long vm_mmap(struct file *file, unsigned long 
addr,
if (unlikely(offset_in_page(offset)))
return -EINVAL;
 
+   if (flag & MAP_HUGETLB) {
+   if (file)
+   return -EINVAL;
+
+   file = map_hugetlb_setup(, flag);
+   if (IS_ERR(file))
+   return PTR_ERR(file);
+   }
+
return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
 }
 EXPORT_SYMBOL(vm_mmap);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 07/13] arm64/mm: Add support for XPFO

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).

XPFO doesn't support section/contiguous mappings yet, so let's disable
it if XPFO is turned on.

Thanks to Laura Abbot for the simplification from v5, and Mark Rutland
for pointing out we need NO_CONT_MAPPINGS too.

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 .../admin-guide/kernel-parameters.txt |  2 +-
 arch/arm64/Kconfig|  1 +
 arch/arm64/mm/Makefile|  2 +
 arch/arm64/mm/mmu.c   |  2 +-
 arch/arm64/mm/xpfo.c  | 66 +++
 5 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index e65e3bc1efe0..9fcf8c83031a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,7 +2997,7 @@
 
nox2apic[X86-64,APIC] Do not enable x2APIC mode.
 
-   noxpfo  [XPFO,X86-64] Disable eXclusive Page Frame
+   noxpfo  [XPFO,X86-64,ARM64] Disable eXclusive Page Frame
Ownership (XPFO) when CONFIG_XPFO is on. Physical
pages mapped into user applications will also be
mapped in the kernel's address space as if
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d366127..9a8d8e649cf8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -174,6 +174,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+   select ARCH_SUPPORTS_XPFO
help
  ARM 64-bit (AArch64) Linux support.
 
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..cca3808d9776 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
 
 obj-$(CONFIG_KASAN)+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o:= n
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b6f5aa52ac67..1673f7443d62 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -453,7 +453,7 @@ static void __init map_mem(pgd_t *pgdp)
struct memblock_region *reg;
int flags = 0;
 
-   if (rodata_full || debug_pagealloc_enabled())
+   if (rodata_full || debug_pagealloc_enabled() || xpfo_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
/*
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index ..7866c5acfffb
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger 
+ *   Vasileios P. Kemerlis 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+#include 
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd))
+   return NULL;
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+
+   return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+   pte_t *pte = lookup_address((unsigned long)kaddr);
+
+   if (unlikely(!pte)) {
+   WARN(1, "xpfo: invalid address %p\n", kaddr);
+   return;
+   }
+
+   set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+EXPORT_SYMBOL_GPL(set_kpte);
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+   unsigned long kaddr = (unsigned long)page_address(page);
+   unsigned long size = PAGE_SIZE;
+
+   flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
+EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 13/13] xpfo, mm: Optimize XPFO TLB flushes by batching them together

2019-04-03 Thread Khalid Aziz
When XPFO forces a TLB flush on all cores, the performance impact is
very significant. Batching as many of these TLB updates as
possible can help lower this impact. When a userspace allocates a
page, kernel tries to get that page from the per-cpu free list.
This free list is replenished in bulk when it runs low. Free
list is being replenished for future allocation to userspace is a
good opportunity to update TLB entries in batch and reduce the
impact of multiple TLB flushes later. This patch adds new tags for
the page so a page can be marked as available for userspace
allocation and unmapped from kernel address space. All such pages
are removed from kernel address space in bulk at the time they are
added to per-cpu free list. This patch when combined with deferred
TLB flushes improves performance further. Using the same benchmark
as before of building kernel in parallel, here are the system
times on two differently sized systems:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

5.0 913.862s
5.0+XPFO+Deferred flush+Batch update1165.259s   1.28x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

5.0 610.642s
5.0+XPFO+Deferred flush+Batch update773.075s1.27x

Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
Signed-off-by: Tycho Andersen 
---
v9:
- Do not map a page freed by userspace back into kernel. Mark
 it as unmapped instead and map it back in only when needed. This
 avoids the cost of unmap and TLBV flush if the page is allocated
 back to userspace.

 arch/x86/include/asm/pgtable.h |  2 +-
 arch/x86/mm/pageattr.c |  9 --
 arch/x86/mm/xpfo.c | 11 +--
 include/linux/xpfo.h   | 11 +++
 mm/page_alloc.c|  9 ++
 mm/xpfo.c  | 54 +++---
 6 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5c0e1581fa56..61f64c6c687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1461,7 +1461,7 @@ should_split_large_page(pte_t *kpte, unsigned long 
address,
 extern spinlock_t cpa_lock;
 int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
-  struct page *base);
+  struct page *base, bool xpfo_split);
 
 #include 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 530b5df0617e..8fe86ac6bff0 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -911,7 +911,7 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, 
unsigned long pfn,
 
 int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
-  struct page *base)
+  struct page *base, bool xpfo_split)
 {
unsigned long lpaddr, lpinc, ref_pfn, pfn, pfninc = 1;
pte_t *pbase = (pte_t *)page_address(base);
@@ -1008,7 +1008,10 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, 
unsigned long address,
 * page attribute in parallel, that also falls into the
 * just split large page entry.
 */
-   flush_tlb_all();
+   if (xpfo_split)
+   xpfo_flush_tlb_all();
+   else
+   flush_tlb_all();
spin_unlock(_lock);
 
return 0;
@@ -1027,7 +1030,7 @@ static int split_large_page(struct cpa_data *cpa, pte_t 
*kpte,
if (!base)
return -ENOMEM;
 
-   if (__split_large_page(cpa, kpte, address, base))
+   if (__split_large_page(cpa, kpte, address, base, false))
__free_page(base);
 
return 0;
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 638eee5b1f09..8c482c7b54f5 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -47,7 +47,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t 
prot)
 
cpa.vaddr = kaddr;
cpa.pages = 
-   cpa.mask_set = prot;
+   cpa.mask_set = canon_pgprot(prot);
cpa.mask_clr = msk_clr;
cpa.numpages = 1;
cpa.flags = 0;
@@ -57,7 +57,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t 
prot)
 
do_split = should_split_large_page(pte, (unsigned long)kaddr,
   );
-   if (do_split) {
+   if (do_split > 0) {
struct page *base;
 
base = alloc_pages(GFP_ATOMIC, 0);
@@ -69,7 +69,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t 
prot)
if (!debug_pagealloc_enabled())
spin_lock(_lock);
if  (__split_large_page(, pte, (unsigned long)kaddr,
-   b

[RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap

2019-04-03 Thread Khalid Aziz
From: Julian Stecklina 

Only the xpfo_kunmap call that needs to actually unmap the page
needs to be serialized. We need to be careful to handle the case,
where after the atomic decrement of the mapcount, a xpfo_kmap
increased the mapcount again. In this case, we can safely skip
modifying the page table.

Model-checked with up to 4 concurrent callers with Spin.

Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
Cc: x...@kernel.org
Cc: kernel-harden...@lists.openwall.com
Cc: Vasileios P. Kemerlis 
Cc: Juerg Haefliger 
Cc: Tycho Andersen 
Cc: Marco Benatto 
Cc: David Woodhouse 
---
 include/linux/xpfo.h | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 2318c7eb5fb7..37e7f52fa6ce 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page *page)
 static inline void xpfo_kunmap(void *kaddr, struct page *page)
 {
unsigned long flags;
+   bool flush_tlb = false;
 
if (!static_branch_unlikely(_inited))
return;
@@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page 
*page)
 * The page is to be allocated back to user space, so unmap it from
 * the kernel, flush the TLB and tag it as a user page.
 */
-   spin_lock_irqsave(>xpfo_lock, flags);
-
if (atomic_dec_return(>xpfo_mapcount) == 0) {
-#ifdef CONFIG_XPFO_DEBUG
-   WARN_ON(PageXpfoUnmapped(page));
-#endif
-   SetPageXpfoUnmapped(page);
-   set_kpte(kaddr, page, __pgprot(0));
-   xpfo_flush_kernel_tlb(page, 0);
+   spin_lock_irqsave(>xpfo_lock, flags);
+
+   /*
+* In the case, where we raced with kmap after the
+* atomic_dec_return, we must not nuke the mapping.
+*/
+   if (atomic_read(>xpfo_mapcount) == 0) {
+   SetPageXpfoUnmapped(page);
+   set_kpte(kaddr, page, __pgprot(0));
+   flush_tlb = true;
+   }
+   spin_unlock_irqrestore(>xpfo_lock, flags);
}
 
-   spin_unlock_irqrestore(>xpfo_lock, flags);
+   if (flush_tlb)
+   xpfo_flush_kernel_tlb(page, 0);
 }
 
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 09/13] xpfo: add primitives for mapping underlying memory

2019-04-03 Thread Khalid Aziz
From: Tycho Andersen 

In some cases (on arm64 DMA and data cache flushes) we may have unmapped
the underlying pages needed for something via XPFO. Here are some
primitives useful for ensuring the underlying memory is mapped/unmapped in
the face of xpfo.

Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 include/linux/xpfo.h | 21 +
 mm/xpfo.c| 30 ++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 5d8d06e4b796..2318c7eb5fb7 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -91,6 +91,15 @@ void xpfo_free_pages(struct page *page, int order);
 
 phys_addr_t user_virt_to_phys(unsigned long addr);
 
+#define XPFO_NUM_PAGES(addr, size) \
+   (PFN_UP((unsigned long) (addr) + (size)) - \
+   PFN_DOWN((unsigned long) (addr)))
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+  size_t mapping_len);
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+size_t mapping_len);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -106,6 +115,18 @@ static inline void xpfo_flush_kernel_tlb(struct page 
*page, int order) { }
 
 static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
 
+#define XPFO_NUM_PAGES(addr, size) 0
+
+static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+size_t mapping_len)
+{
+}
+
+static inline void xpfo_temp_unmap(const void *addr, size_t size,
+  void **mapping, size_t mapping_len)
+{
+}
+
 #endif /* CONFIG_XPFO */
 
 #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
diff --git a/mm/xpfo.c b/mm/xpfo.c
index b74fee0479e7..974f1b70ccd9 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -14,6 +14,7 @@
  * the Free Software Foundation.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -104,3 +105,32 @@ void xpfo_free_pages(struct page *page, int order)
}
}
 }
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+  size_t mapping_len)
+{
+   struct page *page = virt_to_page(addr);
+   int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+   memset(mapping, 0, mapping_len);
+
+   for (i = 0; i < num_pages; i++) {
+   if (page_to_virt(page + i) >= addr + size)
+   break;
+
+   if (PageXpfoUnmapped(page + i))
+   mapping[i] = kmap_atomic(page + i);
+   }
+}
+EXPORT_SYMBOL(xpfo_temp_map);
+
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+size_t mapping_len)
+{
+   int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+   for (i = 0; i < num_pages; i++)
+   if (mapping[i])
+   kunmap_atomic(mapping[i]);
+}
+EXPORT_SYMBOL(xpfo_temp_unmap);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 08/13] swiotlb: Map the buffer if it was unmapped by XPFO

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

XPFO can unmap a bounce buffer. Check for this and map it back in if
needed.

Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
Reviewed-by: Konrad Rzeszutek Wilk 
---
v9: * Added a generic check for whether a page is mapped or not (suggested
  by Chris Hellwig)

v6: * guard against lookup_xpfo() returning NULL

 include/linux/highmem.h | 7 +++
 kernel/dma/swiotlb.c| 3 ++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 59a1a5fa598d..cf21f023dff4 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -77,6 +77,13 @@ static inline struct page *kmap_to_page(void *addr)
 }
 
 static inline unsigned long totalhigh_pages(void) { return 0UL; }
+static inline bool page_is_unmapped(struct page *page)
+{
+   if (PageHighMem(page) || PageXpfoUnmapped(page))
+   return true;
+   else
+   return false;
+}
 
 #endif /* CONFIG_HIGHMEM */
 
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1fb6fd68b9c7..90a1a3709b55 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -392,8 +392,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
phys_addr_t tlb_addr,
 {
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+   struct page *page = pfn_to_page(pfn);
 
-   if (PageHighMem(pfn_to_page(pfn))) {
+   if (page_is_unmapped(page)) {
/* The buffer does not have a mapping.  Map it in and copy */
unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 10/13] arm64/mm, xpfo: temporarily map dcache regions

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault, so let's temporarily map the region, flush the cache, and then
unmap it.

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
---
v6: actually flush in the face of xpfo, and temporarily map the underlying
memory so it can be flushed correctly

 arch/arm64/mm/flush.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 5c9073bace83..114e8bc5a3dc 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -28,9 +29,15 @@
 void sync_icache_aliases(void *kaddr, unsigned long len)
 {
unsigned long addr = (unsigned long)kaddr;
+   unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
+   void *mapping[num_pages];
 
if (icache_is_aliasing()) {
+   xpfo_temp_map(kaddr, len, mapping,
+ sizeof(mapping[0]) * num_pages);
__clean_dcache_area_pou(kaddr, len);
+   xpfo_temp_unmap(kaddr, len, mapping,
+   sizeof(mapping[0]) * num_pages);
__flush_icache_all();
} else {
/*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 06/13] lkdtm: Add test for XPFO

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

This test simply reads from userspace memory via the kernel's linear
map.

v6: * drop an #ifdef, just let the test fail if XPFO is not supported
* add XPFO_SMP test to try and test the case when one CPU does an xpfo
  unmap of an address, that it can't be used accidentally by other
  CPUs.

Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
[jstec...@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina 
Tested-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 drivers/misc/lkdtm/Makefile |   1 +
 drivers/misc/lkdtm/core.c   |   3 +
 drivers/misc/lkdtm/lkdtm.h  |   5 +
 drivers/misc/lkdtm/xpfo.c   | 196 
 4 files changed, 205 insertions(+)
 create mode 100644 drivers/misc/lkdtm/xpfo.c

diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
index 951c984de61a..97c6b7818cce 100644
--- a/drivers/misc/lkdtm/Makefile
+++ b/drivers/misc/lkdtm/Makefile
@@ -9,6 +9,7 @@ lkdtm-$(CONFIG_LKDTM)   += refcount.o
 lkdtm-$(CONFIG_LKDTM)  += rodata_objcopy.o
 lkdtm-$(CONFIG_LKDTM)  += usercopy.o
 lkdtm-$(CONFIG_LKDTM)  += stackleak.o
+lkdtm-$(CONFIG_LKDTM)  += xpfo.o
 
 KASAN_SANITIZE_stackleak.o := n
 KCOV_INSTRUMENT_rodata.o   := n
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..25f4ab4ebf50 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -185,6 +185,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_KERNEL),
CRASHTYPE(USERCOPY_KERNEL_DS),
CRASHTYPE(STACKLEAK_ERASING),
+   CRASHTYPE(XPFO_READ_USER),
+   CRASHTYPE(XPFO_READ_USER_HUGE),
+   CRASHTYPE(XPFO_SMP),
 };
 
 
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..6b31ff0c7f8f 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -87,4 +87,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void);
 /* lkdtm_stackleak.c */
 void lkdtm_STACKLEAK_ERASING(void);
 
+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);
+
 #endif
diff --git a/drivers/misc/lkdtm/xpfo.c b/drivers/misc/lkdtm/xpfo.c
new file mode 100644
index ..8876128f0144
--- /dev/null
+++ b/drivers/misc/lkdtm/xpfo.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
+{
+   unsigned long user_addr, user_data = XPFO_DATA;
+
+   user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+   PROT_READ | PROT_WRITE | PROT_EXEC,
+   flags, 0);
+   if (user_addr >= TASK_SIZE) {
+   pr_warn("Failed to allocate user memory\n");
+   return 0;
+   }
+
+   if (copy_to_user((void __user *)user_addr, _data,
+sizeof(user_data))) {
+   pr_warn("copy_to_user failed\n");
+   goto free_user;
+   }
+
+   return user_addr;
+
+free_user:
+   vm_munmap(user_addr, PAGE_SIZE);
+   return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+   phys_addr_t phys_addr;
+   void *virt_addr;
+
+   phys_addr = user_virt_to_phys(user_addr);
+   if (!phys_addr) {
+   pr_warn("Failed to get physical address of user memory\n");
+   return NULL;
+   }
+
+   virt_addr = phys_to_virt(phys_addr);
+   if (phys_addr != virt_to_phys(virt_addr)) {
+   pr_warn("Physical address of user memory seems incorrect\n");
+   return NULL;
+   }
+
+   return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
+   pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+   if (*(unsigned long *)virt_addr == XPFO_DATA)
+   pr_err("FAIL: Bad read succeeded?!\n");
+   else
+   pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+   unsigned long user_addr, *kernel;
+
+   user_addr = do_map(flags);
+   if (!user_addr) {
+   pr_err("FAIL: map failed\n");
+   return;
+   }
+
+   kernel = user_to_kernel(user_addr);
+   if (!kernel) {
+   pr_err("FAIL: user to kernel conversion failed\n");
+   goto free_user;
+   }
+
+   read_map(kernel);
+
+free_user:
+   vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+   read_user_with_flags(M

[RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-03 Thread Khalid Aziz
XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs
synchronously. Processes on each core allocating pages causes a
flood of IPI messages to all other cores to flush TLB entries.
Many of these messages are to flush the entire TLB on the core if
the number of entries being flushed from local core exceeds
tlb_single_page_flush_ceiling. The cost of TLB flush caused by
unmapping pages from physmap goes up dramatically on machines with
high core count.

This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. Each core
checks the pending TLB flush flag for itself on every context
switch, flushes its TLB if the flag is set and clears it.
This patch potentially aggregates multiple TLB flushes into one.
This has very significant impact especially on machines with large
core counts. To illustrate this, kernel was compiled with -j on
two classes of machines - a server with high core count and large
amount of memory, and a desktop class machine with more modest
specs. System time from "make -j" from vanilla 4.20 kernel, 4.20
with XPFO patches before applying this patch and after applying
this patch are below:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20950.966s
4.20+XPFO   25073.169s  26.366x
4.20+XPFO+Deferred flush1372.874s1.44x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20607.671s
4.20+XPFO   1588.646s   2.614x
4.20+XPFO+Deferred flush803.989s1.32x

This same code should be implemented for other architectures as
well once finalized.

Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c   | 52 +
 arch/x86/mm/xpfo.c  |  2 +-
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf377fc..92d23629d01d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -561,6 +561,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, 
unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long 
end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 999d6d8f0bef..cc806a01a0eb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -37,6 +37,20 @@
  */
 #define LAST_USER_MM_IBPB  0x1UL
 
+/*
+ * A TLB flush may be needed to flush stale TLB entries
+ * for pages that have been mapped into userspace and unmapped
+ * from kernel space. This TLB flush needs to be propagated to
+ * all CPUs. Asynchronous flush requests to all CPUs can cause
+ * significant performance imapct. Queue a pending flush for
+ * a CPU instead. Multiple of these requests can then be handled
+ * by a CPU at a less disruptive time, like context switch, in
+ * one go and reduce performance impact significantly. Following
+ * data structure is used to keep track of CPUs with pending full
+ * TLB flush forced by xpfo.
+ */
+static cpumask_t pending_xpfo_flush;
+
 /*
  * We get here when we do something requiring a TLB invalidation
  * but could not go invalidate all of the contexts.  We do the
@@ -321,6 +335,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
+
+   /*
+* If there is a pending TLB flush for this CPU due to XPFO
+* flush, do it now.
+*/
+   if (cpumask_test_and_clear_cpu(cpu, _xpfo_flush)) {
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+   __flush_tlb_all();
+   }
+
this_cpu_write(cpu_tlbstate.is_lazy, false);
 
/*
@@ -803,6 +827,34 @@ void flush_tlb_kernel_range(unsigned long start, unsigned 
long end)
}
 }
 
+void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+   struct cpumask tmp_mask;
+
+   /*
+* Balance as user space task's flush, a bit conservative.
+* Do a local flush immediately and post a pending flush on all
+* other CPUs. Local flush can be a range flush or full flush
+* depending upon the number of entries to be flushed. Remote
+* flushes will be done by individual processors at the time of
+* context switch and this allows multiple flush req

[RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

This patch adds basic support infrastructure for XPFO which protects
against 'ret2dir' kernel attacks. The basic idea is to enforce exclusive
ownership of page frames by either the kernel or userspace, unless
explicitly requested by the kernel. Whenever a page destined for
userspace is allocated, it is unmapped from physmap (the kernel's page
table). When such a page is reclaimed from userspace, it is mapped back
to physmap. Individual architectures can enable full XPFO support using
this infrastructure by supplying architecture specific pieces.

Additional fields in the page struct are used for XPFO housekeeping,
specifically:
  - two flags to distinguish user vs. kernel pages and to tag unmapped
pages.
  - a reference counter to balance kmap/kunmap operations.
  - a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

CC: x...@kernel.org
Suggested-by: Vasileios P. Kemerlis 
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Marco Benatto 
[jstec...@amazon.de: encode all XPFO info in struct page]
Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 v9: * Do not use page extensions. Encode all xpfo information in struct
  page (Julian Stecklina).
* Split architecture specific code into its own separate patch
* Move kmap*() to include/linux/xpfo.h for cleaner code as suggested
  for an earlier version of this patch
* Use irq versions of spin_lock to address possible deadlock around
  xpfo_lock caused by interrupts.
* Incorporated various feedback provided on v6 patch way back.

v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
  the tlb entry on all CPUs when unmapping it in kunmap
* handle lookup_page_ext()/lookup_xpfo() returning NULL
* drop lots of BUG()s in favor of WARN()
* don't disable irqs in xpfo_kmap/xpfo_kunmap, export
  __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
  pass it

 .../admin-guide/kernel-parameters.txt |   6 +
 include/linux/highmem.h   |  31 +---
 include/linux/mm_types.h  |   8 +
 include/linux/page-flags.h|  23 ++-
 include/linux/xpfo.h  | 147 ++
 include/trace/events/mmflags.h|  10 +-
 mm/Makefile   |   1 +
 mm/compaction.c   |   2 +-
 mm/internal.h |   2 +-
 mm/page_alloc.c   |  10 +-
 mm/page_isolation.c   |   2 +-
 mm/xpfo.c | 106 +
 security/Kconfig  |  27 
 13 files changed, 337 insertions(+), 38 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 858b6c0b9a15..9b36da94760e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,6 +2997,12 @@
 
nox2apic[X86-64,APIC] Do not enable x2APIC mode.
 
+   noxpfo  [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
+   when CONFIG_XPFO is on. Physical pages mapped into
+   user applications will also be mapped in the
+   kernel's address space as if CONFIG_XPFO was not
+   enabled.
+
cpu0_hotplug[X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c2c3..59a1a5fa598d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -77,36 +78,6 @@ static inline struct page *kmap_to_page(void *addr)
 
 static inline unsigned long totalhigh_pages(void) { return 0UL; }
 
-#ifndef ARCH_HAS_KMAP
-static inline void *kmap(struct page *page)
-{
-   might_sleep();
-   return page_address(page);
-}
-
-static inline void kunmap(struct page *page)
-{
-}
-
-static inline void *kmap_atomic(struct page *page)
-{
-   preempt_disable();
-   pagefault_disable();
-   return page_address(page);
-}
-#define kmap_atomic_prot(page, prot)   kmap_atomic(page)
-
-static inline void __kunmap_atomic(void *addr)
-{
-   pagefault_enable();
-   preempt_enable();
-}
-
-#define kmap_atomic_pfn(pfn)   kmap_atomic(pfn_to_page(pfn))
-
-#define kmap_flush_unused()do {} while(0)
-#endif
-
 #endif /* CONFIG_HIGHMEM */
 
 #if defined(CONFIG_HIGHMEM

[RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership

2019-04-03 Thread Khalid Aziz
This is another update to the work Juerg, Tycho and Julian have
done on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were
flushed actively after an XPFO TLB update.  Benchmark for measuring
performance is kernel build using parallel make. To get full
protection from ret2dir attackes, we must flush stale TLB entries.
Performance penalty from flushing stale TLB entries goes up as the
number of cores goes up. On a desktop class machine with only 4
cores, enabling TLB flush for stale entries causes system time for
"make -j4" to go up by a factor of 2.61x but on a larger machine
with 96 cores, system time with "make -j60" goes up by a factor of
26.37x!  I have been working on reducing this performance penalty.

I implemented two solutions to reduce performance penalty and that
has had large impact. XPFO code flushes TLB every time a page is
allocated to userspace. It does so by sending IPIs to all processors
to flush TLB. Back to back allocations of pages to userspace on
multiple processors results in a storm of IPIs.  Each one of these
incoming IPIs is handled by a processor by flushing its TLB. To
reduce this IPI storm, I have added a per CPU flag that can be set
to tell a processor to flush its TLB. A processor checks this flag
on every context switch. If the flag is set, it flushes its TLB and
clears the flag. This allows for multiple TLB flush requests to a
single CPU to be combined into a single request. A kernel TLB entry
for a page that has been allocated to userspace is flushed on all
processors unlike the previous version of this patch. A processor
could hold a stale kernel TLB entry that was removed on another
processor until the next context switch. A local userspace page
allocation by the currently running process could force the TLB
flush earlier for such entries.

The other solution reduces the number of TLB flushes required, by
performing TLB flush for multiple pages at one time when pages are
refilled on the per-cpu freelist. If the pages being addedd to
per-cpu freelist are marked for userspace allocation, TLB entries
for these pages can be flushed upfront and pages tagged as currently
unmapped. When any such page is allocated to userspace, there is no
need to performa a TLB flush at that time any more. This batching of
TLB flushes reduces performance imapct further. Similarly when
these user pages are freed by userspace and added back to per-cpu
free list, they are left unmapped and tagged so. This further
optimization reduced performance impact from 1.32x to 1.28x for
96-core server and from 1.31x to 1.27x for a 4-core desktop.

I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before these patches and then again
after applying each of these patches. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

5.0 913.862s
5.0+this patch series   1165.259ss  1.28x


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

5.0 610.642s
5.0+this patch series   773.075s1.27x

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

I have restructurerd the patches in this version to separate out
architecture independent code. I folded much of the code
improvement by Julian to not use page extension into patch 3. 

What remains to be done beyond this patch series:

1. Performance improvements: Ideas to explore - (1) kernel mappings
   private to an mm, (2) Any others??
2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
   from Juerg. I dropped it for now since swiotlb code for ARM has
   changed a lot since this patch was written. I could use help
   from ARM experts on this.
3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
   CPUs" to other architectures besides x86.
4. Change kmap to not map the page back to physmap, instead map it
   to a new va similar to what kmap_high does. Mapping page back
   into physmap re-opens the ret2dir security for the duration of
   kmap. All of the kmap_high and related code can be reused for this
   but that will require restructuring that code so it can be built for
   64-bits as well. Any objections to that?

-

Juerg Haefliger (6):
  mm: Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo, x86: Add support for XPFO for x86-64
  lkdtm: Add test for XPFO
  arm64/mm: Add support for XPFO
  swiotlb: Map the buffer if it was unmapped by XPFO
  arm64/mm, xpfo: temporarily map dcache regions

Julian Stecklina (1):
  xpfo, mm: optimize spinlock usage in xpfo_kunmap

Khalid Aziz (2):
  xpfo, mm: Defer TLB flushes for non-current CPUs (x86 onl

[RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-03 Thread Khalid Aziz
From: Tycho Andersen 

Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from 
invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, 
pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: 
lkdtm_xpfo_test Tainted: G  D 4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC 
(i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? 
blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 
[lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

To be safe, let's just always enable irqs.

The particular case I'm hitting is:

Aug 23 19:30:27 xpfo kernel: [   38.278615]  __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [   38.278617]  bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [   38.278618]  __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [   38.278620]  ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [   38.278623]  ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [   38.278624]  trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [   38.278625]  do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [   38.278627]  async_page_fault+0x28/0x30

When a fault is in kernel space which has been triggered by XPFO.

Signed-off-by: Tycho Andersen 
CC: x...@kernel.org
Tested-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 arch/x86/mm/fault.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9d5c75f02295..7891add0913f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Executive summary in case the body of the oops scrolled away */
printk(KERN_DEFAULT "CR2: %016lx\n", address);
 
+   /*
+* We're about to oops, which might kill the task. Make sure we're
+* allowed to sleep.
+*/
+   flags |= X86_EFLAGS_IF;
+
oops_end(flags, regs, sig);
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu