Re: [PATCH v19 6/7] KVM: arm64: Add support for the KVM PTP service

2021-04-17 Thread Zenghui Yu

On 2021/4/17 17:10, Marc Zyngier wrote:

On Sat, 17 Apr 2021 09:59:39 +0100,
Zenghui Yu  wrote:


On 2021/3/30 22:54, Marc Zyngier wrote:

+PTP_KVM support for arm/arm64
+=
+
+PTP_KVM is used for high precision time sync between host and guests.
+It relies on transferring the wall clock and counter value from the
+host to the guest using a KVM-specific hypercall.
+
+* ARM_SMCCC_HYP_KVM_PTP_FUNC_ID: 0x8601


Per include/linux/arm-smccc.h, this should be
ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID.


Well spotted. Care to send a patch on top of my kvm-arm64/ptp branch?


Okay. I'll have a patch for that.


Zenghui


Re: [PATCH v19 1/7] arm/arm64: Probe for the presence of KVM hypervisor

2021-04-17 Thread Zenghui Yu

On 2021/3/30 22:54, Marc Zyngier wrote:

 #define SMCCC_ARCH_WORKAROUND_RET_UNAFFECTED   1


I think it'd be better to keep this definition together with other
wa Function IDs. It's only a cosmetic comment anyway.


Zenghui


Re: [PATCH v19 4/7] time: Add mechanism to recognize clocksource in time_get_snapshot

2021-04-17 Thread Zenghui Yu

On 2021/3/30 22:54, Marc Zyngier wrote:

-   u64 cycles;
-   ktime_t real;
-   ktime_t raw;
-   unsigned intclock_was_set_seq;
-   u8  cs_was_changed_seq;
+   u64 cycles;
+   ktime_t real;
+   ktime_t raw;
+   enum clocksource_idscs_id;


nit: worth adding a description for it (on top of the structure)?


+   unsigned intclock_was_set_seq;
+   u8  cs_was_changed_seq;


Re: [PATCH v19 6/7] KVM: arm64: Add support for the KVM PTP service

2021-04-17 Thread Zenghui Yu

On 2021/3/30 22:54, Marc Zyngier wrote:

+PTP_KVM support for arm/arm64
+=
+
+PTP_KVM is used for high precision time sync between host and guests.
+It relies on transferring the wall clock and counter value from the
+host to the guest using a KVM-specific hypercall.
+
+* ARM_SMCCC_HYP_KVM_PTP_FUNC_ID: 0x8601


Per include/linux/arm-smccc.h, this should be
ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID.


+
+This hypercall uses the SMC32/HVC32 calling convention:
+
+ARM_SMCCC_HYP_KVM_PTP_FUNC_ID


ditto


+=====
+Function ID: (uint32)  0x8601
+Arguments:   (uint32)  KVM_PTP_VIRT_COUNTER(0)
+   KVM_PTP_PHYS_COUNTER(1)


Re: [PATCH v19 7/7] ptp: arm/arm64: Enable ptp_kvm for arm/arm64

2021-04-17 Thread Zenghui Yu

On 2021/3/30 22:54, Marc Zyngier wrote:

+int kvm_arch_ptp_init(void)
+{
+   int ret;
+
+   ret = kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_PTP);
+   if (ret <= 0)


kvm_arm_hyp_service_available() returns boolean. Maybe write as ?

bool ret;

ret = kvm_arm_hyp_service_available();
if (!ret)
return -ENODEV;


+   return -EOPNOTSUPP;
+
+   return 0;
+}


Re: [PATCH v12 01/13] vfio: VFIO_IOMMU_SET_PASID_TABLE

2021-04-07 Thread Zenghui Yu

Hi Eric,

On 2021/2/24 5:06, Eric Auger wrote:

+/*
+ * VFIO_IOMMU_SET_PASID_TABLE - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
+ * struct vfio_iommu_type1_set_pasid_table)
+ *
+ * The SET operation passes a PASID table to the host while the
+ * UNSET operation detaches the one currently programmed. Setting
+ * a table while another is already programmed replaces the old table.


It looks to me that this description doesn't match the IOMMU part.

[v14,05/13] iommu/smmuv3: Implement attach/detach_pasid_table

|   case IOMMU_PASID_CONFIG_TRANSLATE:
|   /* we do not support S1 <-> S1 transitions */
|   if (smmu_domain->s1_cfg.set)
|   goto out;

Maybe I've misread something?


Thanks,
Zenghui


[PATCH] iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command

2021-04-07 Thread Zenghui Yu
Per SMMUv3 spec, there is no Size and Addr field in the PREFETCH_CONFIG
command and they're not used by the driver. Remove them.

We can add them back if we're going to use PREFETCH_ADDR in the future.

Signed-off-by: Zenghui Yu 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 --
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 --
 2 files changed, 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8594b4a83043..610c9f4b7789 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -245,8 +245,6 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
arm_smmu_cmdq_ent *ent)
break;
case CMDQ_OP_PREFETCH_CFG:
cmd[0] |= FIELD_PREP(CMDQ_PREFETCH_0_SID, ent->prefetch.sid);
-   cmd[1] |= FIELD_PREP(CMDQ_PREFETCH_1_SIZE, ent->prefetch.size);
-   cmd[1] |= ent->prefetch.addr & CMDQ_PREFETCH_1_ADDR_MASK;
break;
case CMDQ_OP_CFGI_CD:
cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SSID, ent->cfgi.ssid);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index f985817c967a..83726b6a1cba 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -410,8 +410,6 @@ struct arm_smmu_cmdq_ent {
#define CMDQ_OP_PREFETCH_CFG0x1
struct {
u32 sid;
-   u8  size;
-   u64 addr;
} prefetch;
 
#define CMDQ_OP_CFGI_STE0x3
-- 
2.19.1



Re: [PATCH v14 08/13] dma-iommu: Implement NESTED_MSI cookie

2021-04-07 Thread Zenghui Yu

Hi Eric,

On 2021/2/24 4:56, Eric Auger wrote:

Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a reserved IOVA MSI range.

If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA (hIOVA) to map onto the physical doorbell (hDB).

So we end up with 2 unrelated mappings, at S1 and S2:
  S1 S2
gIOVA-> gDB
hIOVA->hDB

The PCI device would be programmed with hIOVA.
No stage 1 mapping would existing, causing the MSIs to fault.

iommu_dma_bind_guest_msi() allows to pass gIOVA/gDB
to the host so that gIOVA can be used by the host instead of
re-allocating a new hIOVA.

  S1   S2
gIOVA->gDB->hDB

this time, the PCI device can be programmed with the gIOVA MSI
doorbell which is correctly mapped through both stages.

Nested mode is not compatible with HW MSI regions as in that
case gDB and hDB should have a 1-1 mapping. This check will
be done when attaching each device to the IOMMU domain.

Signed-off-by: Eric Auger 


[...]


diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f659395e7959..d25eb7cecaa7 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 


Duplicated include.


  #include 
  #include 
  #include 
@@ -29,12 +30,15 @@
  struct iommu_dma_msi_page {
struct list_headlist;
dma_addr_t  iova;
+   dma_addr_t  gpa;
phys_addr_t phys;
+   size_t  s1_granule;
  };
  
  enum iommu_dma_cookie_type {

IOMMU_DMA_IOVA_COOKIE,
IOMMU_DMA_MSI_COOKIE,
+   IOMMU_DMA_NESTED_MSI_COOKIE,
  };
  
  struct iommu_dma_cookie {

@@ -46,6 +50,7 @@ struct iommu_dma_cookie {
dma_addr_t  msi_iova;


msi_iova is unused in the nested mode, but we still set it to the start
address of the RESV_SW_MSI region (in iommu_get_msi_cookie()), which
looks a bit strange to me.


};
struct list_headmsi_page_list;
+   spinlock_t  msi_lock;


Should msi_lock be grabbed everywhere msi_page_list is populated?
Especially in iommu_dma_get_msi_page(), which can be invoked from the
irqchip driver.

  
  	/* Domain for flush queue callback; NULL if flush queue not in use */

struct iommu_domain *fq_domain;
@@ -87,6 +92,7 @@ static struct iommu_dma_cookie *cookie_alloc(enum 
iommu_dma_cookie_type type)
  
  	cookie = kzalloc(sizeof(*cookie), GFP_KERNEL);

if (cookie) {
+   spin_lock_init(>msi_lock);
INIT_LIST_HEAD(>msi_page_list);
cookie->type = type;
}
@@ -120,14 +126,17 @@ EXPORT_SYMBOL(iommu_get_dma_cookie);
   *
   * Users who manage their own IOVA allocation and do not want DMA API support,
   * but would still like to take advantage of automatic MSI remapping, can use
- * this to initialise their own domain appropriately. Users should reserve a
+ * this to initialise their own domain appropriately. Users may reserve a
   * contiguous IOVA region, starting at @base, large enough to accommodate the
   * number of PAGE_SIZE mappings necessary to cover every MSI doorbell address
- * used by the devices attached to @domain.
+ * used by the devices attached to @domain. The other way round is to provide
+ * usable iova pages through the iommu_dma_bind_doorbell API (nested stages


s/iommu_dma_bind_doorbell/iommu_dma_bind_guest_msi/ ?


+ * use case)
   */
  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
  {
struct iommu_dma_cookie *cookie;
+   int nesting, ret;
  
  	if (domain->type != IOMMU_DOMAIN_UNMANAGED)

return -EINVAL;
@@ -135,7 +144,12 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
if (domain->iova_cookie)
return -EEXIST;
  
-	cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);

+   ret =  iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, );


Redundant space.


+   if (!ret && nesting)
+   cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE);
+   else
+   cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
+
if (!cookie)
return -ENOMEM;
  
@@ -156,6 +170,7 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)

  {
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iommu_dma_msi_page *msi, *tmp;
+   bool s2_unmap = false;
  
  	if (!cookie)

return;
@@ -163,7 +178,15 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
put_iova_domain(>iovad);
  
+	if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE)

+   s2_unmap = true;
+
 

Re: [PATCH v14 07/13] iommu/smmuv3: Implement cache_invalidate

2021-04-01 Thread Zenghui Yu

Hi Eric,

On 2021/2/24 4:56, Eric Auger wrote:

+static int
+arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+ struct iommu_cache_invalidate_info *inv_info)
+{
+   struct arm_smmu_cmdq_ent cmd = {.opcode = CMDQ_OP_TLBI_NSNH_ALL};
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -EINVAL;
+
+   if (!smmu)
+   return -EINVAL;
+
+   if (inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+   return -EINVAL;
+
+   if (inv_info->cache & IOMMU_CACHE_INV_TYPE_PASID ||


I didn't find any code where we would emulate the CFGI_CD{_ALL} commands
for guest and invalidate the stale CD entries on the physical side. Is
PASID-cache type designed for that effect?


+   inv_info->cache & IOMMU_CACHE_INV_TYPE_DEV_IOTLB) {
+   return -ENOENT;
+   }
+
+   if (!(inv_info->cache & IOMMU_CACHE_INV_TYPE_IOTLB))
+   return -EINVAL;
+
+   /* IOTLB invalidation */
+
+   switch (inv_info->granularity) {
+   case IOMMU_INV_GRANU_PASID:
+   {
+   struct iommu_inv_pasid_info *info =
+   _info->granu.pasid_info;
+
+   if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID)
+   return -ENOENT;
+   if (!(info->flags & IOMMU_INV_PASID_FLAGS_ARCHID))
+   return -EINVAL;
+
+   __arm_smmu_tlb_inv_context(smmu_domain, info->archid);
+   return 0;
+   }
+   case IOMMU_INV_GRANU_ADDR:
+   {
+   struct iommu_inv_addr_info *info = _info->granu.addr_info;
+   size_t size = info->nb_granules * info->granule_size;
+   bool leaf = info->flags & IOMMU_INV_ADDR_FLAGS_LEAF;
+
+   if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID)
+   return -ENOENT;
+
+   if (!(info->flags & IOMMU_INV_ADDR_FLAGS_ARCHID))
+   break;
+
+   arm_smmu_tlb_inv_range_domain(info->addr, size,
+ info->granule_size, leaf,
+ info->archid, smmu_domain);
+
+   arm_smmu_cmdq_issue_sync(smmu);


There is no need to issue one more SYNC.


Re: [PATCH v14 13/13] iommu/smmuv3: Accept configs with more than one context descriptor

2021-03-30 Thread Zenghui Yu

Hi Eric,

On 2021/2/24 4:56, Eric Auger wrote:

In preparation for vSVA, let's accept userspace provided configs
with more than one CD. We check the max CD against the host iommu
capability and also the format (linear versus 2 level).

Signed-off-by: Eric Auger 
Signed-off-by: Shameer Kolothum 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 13 -
  1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 332d31c0680f..ab74a0289893 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3038,14 +3038,17 @@ static int arm_smmu_attach_pasid_table(struct 
iommu_domain *domain,
if (smmu_domain->s1_cfg.set)
goto out;
  
-		/*

-* we currently support a single CD so s1fmt and s1dss
-* fields are also ignored
-*/
-   if (cfg->pasid_bits)
+   list_for_each_entry(master, _domain->devices, domain_head) 
{
+   if (cfg->pasid_bits > master->ssid_bits)
+   goto out;
+   }
+   if (cfg->vendor_data.smmuv3.s1fmt == STRTAB_STE_0_S1FMT_64K_L2 
&&
+   !(smmu->features & ARM_SMMU_FEAT_2_LVL_CDTAB))
goto out;
  
  		smmu_domain->s1_cfg.cdcfg.cdtab_dma = cfg->base_ptr;

+   smmu_domain->s1_cfg.s1cdmax = cfg->pasid_bits;
+   smmu_domain->s1_cfg.s1fmt = cfg->vendor_data.smmuv3.s1fmt;


And what about the SIDSS field?


Re: [PATCH v14 06/13] iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs

2021-03-30 Thread Zenghui Yu

On 2021/2/24 4:56, Eric Auger wrote:

@@ -1936,7 +1950,12 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long 
iova, size_t size,
},
};
  
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {

+   if (ext_asid >= 0) {  /* guest stage 1 invalidation */
+   cmd.opcode  = smmu_domain->smmu->features & 
ARM_SMMU_FEAT_E2H ?
+ CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;


If I understand it correctly, the true nested mode effectively gives us
a *NS-EL1* StreamWorld. We should therefore use CMDQ_OP_TLBI_NH_VA to
invalidate the stage-1 NS-EL1 entries, no matter E2H is selected or not.


[PATCH net] docs: net: ena: Fix ena_start_xmit() function name typo

2021-03-15 Thread Zenghui Yu
The ena.rst documentation referred to end_start_xmit() when it should refer
to ena_start_xmit(). Fix the typo.

Signed-off-by: Zenghui Yu 
---
 Documentation/networking/device_drivers/ethernet/amazon/ena.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst 
b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index 3561a8a29fd2..f8c6469f2bd2 100644
--- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -267,7 +267,7 @@ DATA PATH
 Tx
 --
 
-end_start_xmit() is called by the stack. This function does the following:
+ena_start_xmit() is called by the stack. This function does the following:
 
 - Maps data buffers (skb->data and frags).
 - Populates ena_buf for the push buffer (if the driver and device are
-- 
2.19.1



[PATCH] iommu/vt-d: Fix status code for Allocate/Free PASID command

2021-02-26 Thread Zenghui Yu
As per Intel vt-d spec, Rev 3.0 (section 10.4.45 "Virtual Command Response
Register"), the status code of "No PASID available" error in response to
the Allocate PASID command is 2, not 1. The same for "Invalid PASID" error
in response to the Free PASID command.

We will otherwise see confusing kernel log under the command failure from
guest side. Fix it.

Fixes: 24f27d32ab6b ("iommu/vt-d: Enlightened PASID allocation")
Signed-off-by: Zenghui Yu 
---
 drivers/iommu/intel/pasid.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 97dfcffbf495..444c0bec221a 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -30,8 +30,8 @@
 #define VCMD_VRSP_IP   0x1
 #define VCMD_VRSP_SC(e)(((e) >> 1) & 0x3)
 #define VCMD_VRSP_SC_SUCCESS   0
-#define VCMD_VRSP_SC_NO_PASID_AVAIL1
-#define VCMD_VRSP_SC_INVALID_PASID 1
+#define VCMD_VRSP_SC_NO_PASID_AVAIL2
+#define VCMD_VRSP_SC_INVALID_PASID 2
 #define VCMD_VRSP_RESULT_PASID(e)  (((e) >> 8) & 0xf)
 #define VCMD_CMD_OPERAND(e)((e) << 8)
 /*
-- 
2.19.1



[PATCH] dma-debug: Delete outdated comment of the hash function

2020-12-29 Thread Zenghui Yu
We actually use dev_addr[26:13] as the index into dma_entry_hash. Given
that the code itself is clear enough, let's drop the hardcoded comment so
that we won't need to revisit it every time HASH_FN_{SHIFT,MASK} gets
updated.

Signed-off-by: Zenghui Yu 
---
 kernel/dma/debug.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 14de1271463f..d89341162d35 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -235,10 +235,7 @@ static bool driver_filter(struct device *dev)
  */
 static int hash_fn(struct dma_debug_entry *entry)
 {
-   /*
-* Hash function is based on the dma address.
-* We use bits 20-27 here as the index into the hash
-*/
+   /* Hash function is based on the dma address. */
return (entry->dev_addr >> HASH_FN_SHIFT) & HASH_FN_MASK;
 }
 
-- 
2.19.1



Re: [PATCH] arm64/smp: Remove unused irq variable in arch_show_interrupts()

2020-12-24 Thread Zenghui Yu

On 2020/12/15 18:30, Geert Uytterhoeven wrote:

 arch/arm64/kernel/smp.c: In function ‘arch_show_interrupts’:
 arch/arm64/kernel/smp.c:808:16: warning: unused variable ‘irq’ 
[-Wunused-variable]
   808 |   unsigned int irq = irq_desc_get_irq(ipi_desc[i]);
  |^~~

The removal of the last user forgot to remove the variable.

Fixes: 2f516e34383d0e97 ("arm64/smp: Use irq_desc_kstat_cpu() in 
arch_show_interrupts()")


This is in mainline now and the commit id is 5089bc51f81f.


Signed-off-by: Geert Uytterhoeven 
---
One more issue in irq-core-2020-12-14.

  arch/arm64/kernel/smp.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 3aef3bc22d3250b5..10b39328268687e3 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -805,7 +805,6 @@ int arch_show_interrupts(struct seq_file *p, int prec)
unsigned int cpu, i;
  
  	for (i = 0; i < NR_IPI; i++) {

-   unsigned int irq = irq_desc_get_irq(ipi_desc[i]);
seq_printf(p, "%*s%u:%s", prec - 1, "IPI", i,
   prec >= 4 ? " " : "");
for_each_online_cpu(cpu)


And I guess we have the same problem on 32-bit ARM. 'irq' in
show_ipi_list() is no longer used since commit 88c637748e31
("ARM: smp: Use irq_desc_kstat_cpu() in show_ipi_list()").


Thanks,
Zenghui


Re: [PATCH] genirq/msi: Initialize msi_alloc_info to zero for msi_prepare API

2020-12-20 Thread Zenghui Yu

Hi Marc,

On 2020/12/19 1:38, Marc Zyngier wrote:

Hi Zenghui,

On Fri, 18 Dec 2020 06:00:39 +,
Zenghui Yu  wrote:


Since commit 5fe71d271df8 ("irqchip/gic-v3-its: Tag ITS device as shared if
allocating for a proxy device"), some of the devices are wrongly marked as
"shared" by the ITS driver on systems equipped with the ITS(es). The
problem is that the @info->flags may not be initialized anywhere and we end
up looking at random bits on the stack. That's obviously not good.

The straightforward fix is to properly initialize msi_alloc_info inside the
.prepare callback of affected MSI domains (its-pci-msi, its-platform-msi,
etc). We can also perform the initialization in IRQ core layer for
msi_domain_prepare_irqs() API and it looks much neater to me.

Signed-off-by: Zenghui Yu 
---

This was noticed when I was playing with the assigned devices on arm64 and
VFIO failed to enable MSI-X vectors for almost all VFs (CCed kvm list in
case others will hit the same issue). It turned out that these VFs are
marked as "shared" by mistake and have trouble with the following sequence:

pci_alloc_irq_vectors(pdev, 1, 1, flag);
pci_free_irq_vectors(pdev);
pci_alloc_irq_vectors(pdev, 1, 2, flag); --> we can only get
 *one* vector

But besides VFIO, I guess there are already some devices get into trouble
at probe time and can't work properly.

  kernel/irq/msi.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 2c0c4d6d0f83..dc0e2d7fbdfd 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -402,7 +402,7 @@ int __msi_domain_alloc_irqs(struct irq_domain *domain, 
struct device *dev,
struct msi_domain_ops *ops = info->ops;
struct irq_data *irq_data;
struct msi_desc *desc;
-   msi_alloc_info_t arg;
+   msi_alloc_info_t arg = { };
int i, ret, virq;
bool can_reserve;


Thanks for having investigated this. I guess my only worry with this
is that msi_alloc_info_t is a pretty large structure on x86, and
zeroing it isn't totally free.


It seems that x86 will zero the whole msi_alloc_info_t structure at the
beginning of its .prepare (pci_msi_prepare()/init_irq_alloc_info()). If
this really affects something, I think we can easily address it with
some cleanup (on top of this patch).


But this definitely looks nicer than
some of the alternatives (.prepare isn't a good option, as we do rely
on the flag being set in __platform_msi_create_device_domain(), which
calls itself .prepare).


Indeed, thanks for fixing the commit message.


I'll queue it, and we can always revisit this later if Thomas (or
anyone else) has a better idea.


Thanks!


Zenghui


[irqchip: irq/irqchip-next] genirq/msi: Initialize msi_alloc_info before calling msi_domain_prepare_irqs()

2020-12-18 Thread irqchip-bot for Zenghui Yu
The following commit has been merged into the irq/irqchip-next branch of 
irqchip:

Commit-ID: 06fde695ee76429634c1e8c8c1154035aa61191e
Gitweb:
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms/06fde695ee76429634c1e8c8c1154035aa61191e
Author:Zenghui Yu 
AuthorDate:Fri, 18 Dec 2020 14:00:39 +08:00
Committer: Marc Zyngier 
CommitterDate: Fri, 18 Dec 2020 17:42:18 

genirq/msi: Initialize msi_alloc_info before calling msi_domain_prepare_irqs()

Since commit 5fe71d271df8 ("irqchip/gic-v3-its: Tag ITS device as shared if
allocating for a proxy device"), some of the devices are wrongly marked as
"shared" by the ITS driver on systems equipped with the ITS(es). The
problem is that the @info->flags may not be initialized anywhere and we end
up looking at random bits on the stack. That's obviously not good.

We can perform the initialization in the IRQ core layer before calling
msi_domain_prepare_irqs(), which is neat enough.

Fixes: 5fe71d271df8 ("irqchip/gic-v3-its: Tag ITS device as shared if 
allocating for a proxy device")
Signed-off-by: Zenghui Yu 
Signed-off-by: Marc Zyngier 
Link: https://lore.kernel.org/r/20201218060039.1770-1-yuzeng...@huawei.com
---
 kernel/irq/msi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 2c0c4d6..dc0e2d7 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -402,7 +402,7 @@ int __msi_domain_alloc_irqs(struct irq_domain *domain, 
struct device *dev,
struct msi_domain_ops *ops = info->ops;
struct irq_data *irq_data;
struct msi_desc *desc;
-   msi_alloc_info_t arg;
+   msi_alloc_info_t arg = { };
int i, ret, virq;
bool can_reserve;
 


[PATCH] genirq/msi: Initialize msi_alloc_info to zero for msi_prepare API

2020-12-17 Thread Zenghui Yu
Since commit 5fe71d271df8 ("irqchip/gic-v3-its: Tag ITS device as shared if
allocating for a proxy device"), some of the devices are wrongly marked as
"shared" by the ITS driver on systems equipped with the ITS(es). The
problem is that the @info->flags may not be initialized anywhere and we end
up looking at random bits on the stack. That's obviously not good.

The straightforward fix is to properly initialize msi_alloc_info inside the
.prepare callback of affected MSI domains (its-pci-msi, its-platform-msi,
etc). We can also perform the initialization in IRQ core layer for
msi_domain_prepare_irqs() API and it looks much neater to me.

Signed-off-by: Zenghui Yu 
---

This was noticed when I was playing with the assigned devices on arm64 and
VFIO failed to enable MSI-X vectors for almost all VFs (CCed kvm list in
case others will hit the same issue). It turned out that these VFs are
marked as "shared" by mistake and have trouble with the following sequence:

pci_alloc_irq_vectors(pdev, 1, 1, flag);
pci_free_irq_vectors(pdev);
pci_alloc_irq_vectors(pdev, 1, 2, flag); --> we can only get
 *one* vector

But besides VFIO, I guess there are already some devices get into trouble
at probe time and can't work properly.

 kernel/irq/msi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 2c0c4d6d0f83..dc0e2d7fbdfd 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -402,7 +402,7 @@ int __msi_domain_alloc_irqs(struct irq_domain *domain, 
struct device *dev,
struct msi_domain_ops *ops = info->ops;
struct irq_data *irq_data;
struct msi_desc *desc;
-   msi_alloc_info_t arg;
+   msi_alloc_info_t arg = { };
int i, ret, virq;
bool can_reserve;
 
-- 
2.19.1



[PATCH] KVM: Documentation: Update description of KVM_{GET,CLEAR}_DIRTY_LOG

2020-12-07 Thread Zenghui Yu
Update various words, including the wrong parameter name and the vague
description of the usage of "slot" field.

Signed-off-by: Zenghui Yu 
---
 Documentation/virt/kvm/api.rst | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 70254eaa5229..0eb236737f80 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -360,10 +360,9 @@ since the last call to this ioctl.  Bit 0 is the first 
page in the
 memory slot.  Ensure the entire structure is cleared to avoid padding
 issues.
 
-If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies
-the address space for which you want to return the dirty bitmap.
-They must be less than the value that KVM_CHECK_EXTENSION returns for
-the KVM_CAP_MULTI_ADDRESS_SPACE capability.
+If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of slot field specifies
+the address space for which you want to return the dirty bitmap.  See
+KVM_SET_USER_MEMORY_REGION for details on the usage of slot field.
 
 The bits in the dirty bitmap are cleared before the ioctl returns, unless
 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled.  For more information,
@@ -4427,7 +4426,7 @@ to I/O ports.
 :Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
 :Architectures: x86, arm, arm64, mips
 :Type: vm ioctl
-:Parameters: struct kvm_dirty_log (in)
+:Parameters: struct kvm_clear_dirty_log (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -4454,10 +4453,9 @@ in KVM's dirty bitmap, and dirty tracking is re-enabled 
for that page
 (for example via write-protection, or by clearing the dirty bit in
 a page table entry).
 
-If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies
-the address space for which you want to return the dirty bitmap.
-They must be less than the value that KVM_CHECK_EXTENSION returns for
-the KVM_CAP_MULTI_ADDRESS_SPACE capability.
+If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of slot field specifies
+the address space for which you want to clear the dirty status.  See
+KVM_SET_USER_MEMORY_REGION for details on the usage of slot field.
 
 This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
 is enabled; for more information, see the description of the capability.
-- 
2.19.1



Re: [PATCH] irqchip/gic-v3: Check SRE bit for GICv2 legacy support

2020-11-30 Thread Zenghui Yu

Hi Shameer,

On 2020/11/30 18:26, Shameer Kolothum wrote:

At present, the support for GICv2 backward compatibility on GICv3/v4
hardware is determined based on whether DT/ACPI provides a memory
mapped phys base address for GIC virtual CPU interface register(GICV).
This creates a problem that a Qemu guest boot with default GIC(GICv2)
hangs when firmware falsely reports this address on systems that don't
have support for legacy mode.


So the problem is that BIOS has provided us a bogus GICC Structure.


As per GICv3/v4 spec, in an implementation that does not support legacy
operation, affinity routing and system register access are permanently
enabled. This means that the associated control bits are RAO/WI. Hence
use the ICC_SRE_EL1.SRE bit to decide whether hardware supports GICv2
mode in addition to the above firmware based check.

Signed-off-by: Shameer Kolothum 
---
On Hisilicon D06, UEFI sets the GIC MADT GICC gicv_base_address but the
GIC implementation on these boards doesn't have the GICv2 legacy support.
This results in, Guest boot hang when Qemu uses the default GIC option.

With this patch, the Qemu Guest with GICv2 now gracefully exits,
  "qemu-system-aarch64: host does not support in-kernel GICv2 emulation"

Not very sure there is a better way to detect this other than checking
the SRE bit as done in this patch(Of course, we will be fixing the UEFI
going forward).


Yes, I had seen the same problem on the D06. But I *do* think it's the
firmware that actually needs to be fixed.


Thanks,
Shameer

---
  drivers/irqchip/irq-gic-v3.c | 33 -
  1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 16fecc0febe8..15fa1eea45e4 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1835,6 +1835,27 @@ static void __init gic_populate_ppi_partitions(struct 
device_node *gic_node)
of_node_put(parts_node);
  }
  
+/* SRE bit being RAO/WI implies no GICv2 legacy mode support */


I'm wondering if this is a mandate of the architecture.


+static bool __init gic_gicv2_compatible(void)
+{
+   u32 org, val;
+
+   org = gic_read_sre();
+   if (!(org & ICC_SRE_EL1_SRE))
+   return true;
+
+   val = org & ~ICC_SRE_EL1_SRE;
+   gic_write_sre(val);
+
+   val = gic_read_sre();
+   gic_write_sre(org);
+
+   if (val & ICC_SRE_EL1_SRE)
+   return false;
+
+   return true;
+}
+
  static void __init gic_of_setup_kvm_info(struct device_node *node)
  {
int ret;
@@ -1851,10 +1872,12 @@ static void __init gic_of_setup_kvm_info(struct 
device_node *node)
 _idx))
gicv_idx = 1;
  
-	gicv_idx += 3;	/* Also skip GICD, GICC, GICH */

-   ret = of_address_to_resource(node, gicv_idx, );
-   if (!ret)
-   gic_v3_kvm_info.vcpu = r;
+   if (gic_gicv2_compatible()) {
+   gicv_idx += 3;  /* Also skip GICD, GICC, GICH */
+   ret = of_address_to_resource(node, gicv_idx, );
+   if (!ret)
+   gic_v3_kvm_info.vcpu = r;
+   }
  
  	gic_v3_kvm_info.has_v4 = gic_data.rdists.has_vlpis;

gic_v3_kvm_info.has_v4_1 = gic_data.rdists.has_rvpeid;
@@ -2164,7 +2187,7 @@ static void __init gic_acpi_setup_kvm_info(void)
  
  	gic_v3_kvm_info.maint_irq = irq;
  
-	if (acpi_data.vcpu_base) {

+   if (gic_gicv2_compatible() && acpi_data.vcpu_base) {
struct resource *vcpu = _v3_kvm_info.vcpu;
  
  		vcpu->flags = IORESOURCE_MEM;


Thanks,
Zenghui


[PATCH] KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspace

2020-11-17 Thread Zenghui Yu
It was recently reported that if GICR_TYPER is accessed before the RD base
address is set, we'll suffer from the unset @rdreg dereferencing. Oops...

gpa_t last_rdist_typer = rdreg->base + GICR_TYPER +
(rdreg->free_index - 1) * KVM_VGIC_V3_REDIST_SIZE;

It's "expected" that users will access registers in the redistributor if
the RD has been properly configured (e.g., the RD base address is set). But
it hasn't yet been covered by the existing documentation.

Per discussion on the list [1], the reporting of the GICR_TYPER.Last bit
for userspace never actually worked. And it's difficult for us to emulate
it correctly given that userspace has the flexibility to access it any
time. Let's just drop the reporting of the Last bit for userspace for now
(userspace should have full knowledge about it anyway) and it at least
prevents kernel from panic ;-)

[1] https://lore.kernel.org/kvmarm/c20865a267e44d1e2c0d52ce4e012...@kernel.org/

Fixes: ba7b3f1275fd ("KVM: arm/arm64: Revisit Redistributor TYPER last bit 
computation")
Reported-by: Keqian Zhu 
Signed-off-by: Zenghui Yu 
---

This may be the easiest way to fix the issue and to get the fix backported
to stable tree. There is still some work can be done since (at least) we
have code duplicates between the MMIO and uaccess callbacks.

 arch/arm64/kvm/vgic/vgic-mmio-v3.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c 
b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
index 52d6f24f65dc..15a6c98ee92f 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
@@ -273,6 +273,23 @@ static unsigned long vgic_mmio_read_v3r_typer(struct 
kvm_vcpu *vcpu,
return extract_bytes(value, addr & 7, len);
 }
 
+static unsigned long vgic_uaccess_read_v3r_typer(struct kvm_vcpu *vcpu,
+gpa_t addr, unsigned int len)
+{
+   unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu);
+   int target_vcpu_id = vcpu->vcpu_id;
+   u64 value;
+
+   value = (u64)(mpidr & GENMASK(23, 0)) << 32;
+   value |= ((target_vcpu_id & 0x) << 8);
+
+   if (vgic_has_its(vcpu->kvm))
+   value |= GICR_TYPER_PLPIS;
+
+   /* reporting of the Last bit is not supported for userspace */
+   return extract_bytes(value, addr & 7, len);
+}
+
 static unsigned long vgic_mmio_read_v3r_iidr(struct kvm_vcpu *vcpu,
 gpa_t addr, unsigned int len)
 {
@@ -593,8 +610,9 @@ static const struct vgic_register_region 
vgic_v3_rd_registers[] = {
REGISTER_DESC_WITH_LENGTH(GICR_IIDR,
vgic_mmio_read_v3r_iidr, vgic_mmio_write_wi, 4,
VGIC_ACCESS_32bit),
-   REGISTER_DESC_WITH_LENGTH(GICR_TYPER,
-   vgic_mmio_read_v3r_typer, vgic_mmio_write_wi, 8,
+   REGISTER_DESC_WITH_LENGTH_UACCESS(GICR_TYPER,
+   vgic_mmio_read_v3r_typer, vgic_mmio_write_wi,
+   vgic_uaccess_read_v3r_typer, vgic_mmio_uaccess_write_wi, 8,
VGIC_ACCESS_64bit | VGIC_ACCESS_32bit),
REGISTER_DESC_WITH_LENGTH(GICR_WAKER,
vgic_mmio_read_raz, vgic_mmio_write_wi, 4,
-- 
2.19.1



Re: [PATCH 1/2] KVM: arm64: vgic: Forbid invalid userspace Redistributor accesses

2020-11-17 Thread Zenghui Yu

On 2020/11/17 16:49, Marc Zyngier wrote:

Hi Zenghui,

On 2020-11-16 14:57, Zenghui Yu wrote:

Hi Marc,

On 2020/11/16 22:10, Marc Zyngier wrote:

My take is that only if the "[Re]Distributor base address" is specified
in the system memory map, will the user-provided kvm_device_attr.offset
make sense. And we can then handle the access to the register which is
defined by "base address + offset".


I'd tend to agree, but it is just that this is a large change at -rc4.
I'd rather have a quick fix for 5.10, and a more invasive change for 
5.11,

spanning all the possible vgic devices.


So you prefer fixing it by "return a value that doesn't have the Last
bit set" for v5.10? I'm ok with it and can send v2 for it.


Cool. Thanks for that.


Btw, looking again at the way we handle the user-reading of GICR_TYPER

vgic_mmio_read_v3r_typer(vcpu, addr, len)

it seems that @addr is actually the *offset* of GICR_TYPER (0x0008) and
@addr is unlikely to be equal to last_rdist_typer, which is the *GPA* of
the last RD. Looks like the user-reading of GICR_TYPER.Last is always
broken?


I think you are right. Somehow, we don't seem to track the index of
the RD in the region, so we can never compute the address of the RD
even if the base address is set.

Let's drop the reporting of Last for userspace for now, as it never
worked. If you post a patch addressing that quickly, I'll get it to
Paolo by the end of the week (there's another fix that needs merging).


OK. I'll fix it by providing a uaccess_read callback for GICR_TYPER.


Thanks,
Zenghui



Eric: do we have any test covering the userspace API?

Thanks,

     M.


Re: [PATCH 1/2] KVM: arm64: vgic: Forbid invalid userspace Redistributor accesses

2020-11-16 Thread Zenghui Yu

Hi Marc,

On 2020/11/16 22:10, Marc Zyngier wrote:

My take is that only if the "[Re]Distributor base address" is specified
in the system memory map, will the user-provided kvm_device_attr.offset
make sense. And we can then handle the access to the register which is
defined by "base address + offset".


I'd tend to agree, but it is just that this is a large change at -rc4.
I'd rather have a quick fix for 5.10, and a more invasive change for 5.11,
spanning all the possible vgic devices.


So you prefer fixing it by "return a value that doesn't have the Last
bit set" for v5.10? I'm ok with it and can send v2 for it.

Btw, looking again at the way we handle the user-reading of GICR_TYPER

vgic_mmio_read_v3r_typer(vcpu, addr, len)

it seems that @addr is actually the *offset* of GICR_TYPER (0x0008) and
@addr is unlikely to be equal to last_rdist_typer, which is the *GPA* of
the last RD. Looks like the user-reading of GICR_TYPER.Last is always
broken?


Thanks,
Zenghui


Re: [PATCH 1/2] KVM: arm64: vgic: Forbid invalid userspace Redistributor accesses

2020-11-16 Thread Zenghui Yu

Hi Marc,

On 2020/11/16 1:04, Marc Zyngier wrote:

Hi Zenghui,

On 2020-11-13 14:28, Zenghui Yu wrote:

It's expected that users will access registers in the redistributor *if*
the RD has been initialized properly. Unfortunately userspace can be 
bogus

enough to access registers before setting the RD base address, and KVM
implicitly allows it (we handle the access anyway, regardless of whether
the base address is set).

Bad thing happens when we're handling the user read of GICR_TYPER. We end
up with an oops when deferencing the unset rdreg...

gpa_t last_rdist_typer = rdreg->base + GICR_TYPER +
    (rdreg->free_index - 1) * KVM_VGIC_V3_REDIST_SIZE;

Fix this issue by informing userspace what had gone wrong (-ENXIO).


I'm worried about the "implicit" aspect of the access that this patch
now forbids.

The problem is that the existing documentation doesn't cover this case, > and -ENXIO's 
"Getting or setting this register is not yet supported"
is way too vague.


Indeed. How about changing to

-ENXIO  Getting or setting this register is not yet supported
or VGIC not properly configured (e.g., [Re]Distributor base
address is unknown)

There is a precedent with the ITS, but that's 
undocumented

as well. Also, how about v2? If that's the wasy we are going to fix this,
we also nned to beef up the documentation.


Sure, I plan to do so and hope it won't break the existing userspace.


Of course, the other horrible way to address the issue is to return a value
that doesn't have the Last bit set, since we can't synthetise it. It 
doesn't

change the userspace API, and I can even find some (admittedly  twisted)
logic to it (since there is no base address, there is no last RD...).


I'm fine with it. But I'm afraid that there might be other issues due to
the "unexpected" accesses since I haven't tested with all registers from
userspace.

My take is that only if the "[Re]Distributor base address" is specified
in the system memory map, will the user-provided kvm_device_attr.offset
make sense. And we can then handle the access to the register which is
defined by "base address + offset".


Thanks,
Zenghui


[PATCH 2/2] KVM: arm64: vgic: Forbid invalid userspace Distributor accesses

2020-11-13 Thread Zenghui Yu
Accessing registers in the Distributor before setting its base address
should be treated as an invalid userspece behaviour. But KVM implicitly
allows it as we handle the access anyway, regardless of whether the base
address is set or not.

Fix this issue by informing userspace what had gone wrong (-ENXIO).

Signed-off-by: Zenghui Yu 
---
 arch/arm64/kvm/vgic/vgic-mmio-v3.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c 
b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
index 30e370585a27..6a9e5eb311f0 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
@@ -1029,11 +1029,15 @@ void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 
reg, bool allow_group1)
 int vgic_v3_dist_uaccess(struct kvm_vcpu *vcpu, bool is_write,
 int offset, u32 *val)
 {
+   struct vgic_dist *dist = >kvm->arch.vgic;
struct vgic_io_device dev = {
.regions = vgic_v3_dist_registers,
.nr_regions = ARRAY_SIZE(vgic_v3_dist_registers),
};
 
+   if (IS_VGIC_ADDR_UNDEF(dist->vgic_dist_base))
+   return -ENXIO;
+
return vgic_uaccess(vcpu, , is_write, offset, val);
 }
 
-- 
2.19.1



[PATCH 1/2] KVM: arm64: vgic: Forbid invalid userspace Redistributor accesses

2020-11-13 Thread Zenghui Yu
It's expected that users will access registers in the redistributor *if*
the RD has been initialized properly. Unfortunately userspace can be bogus
enough to access registers before setting the RD base address, and KVM
implicitly allows it (we handle the access anyway, regardless of whether
the base address is set).

Bad thing happens when we're handling the user read of GICR_TYPER. We end
up with an oops when deferencing the unset rdreg...

gpa_t last_rdist_typer = rdreg->base + GICR_TYPER +
(rdreg->free_index - 1) * KVM_VGIC_V3_REDIST_SIZE;

Fix this issue by informing userspace what had gone wrong (-ENXIO).

Reported-by: Keqian Zhu 
Signed-off-by: Zenghui Yu 
---
 arch/arm64/kvm/vgic/vgic-mmio-v3.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c 
b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
index 52d6f24f65dc..30e370585a27 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
@@ -1040,11 +1040,15 @@ int vgic_v3_dist_uaccess(struct kvm_vcpu *vcpu, bool 
is_write,
 int vgic_v3_redist_uaccess(struct kvm_vcpu *vcpu, bool is_write,
   int offset, u32 *val)
 {
+   struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
struct vgic_io_device rd_dev = {
.regions = vgic_v3_rd_registers,
.nr_regions = ARRAY_SIZE(vgic_v3_rd_registers),
};
 
+   if (IS_VGIC_ADDR_UNDEF(vgic_cpu->rd_iodev.base_addr))
+   return -ENXIO;
+
return vgic_uaccess(vcpu, _dev, is_write, offset, val);
 }
 
-- 
2.19.1



[PATCH 0/2] KVM: arm64: vgic: Fix handling of userspace register accesses

2020-11-13 Thread Zenghui Yu
We had recently seen a kernel panic when accidently programming QEMU in an
inappropriate way (in short, accessing RD registers before setting the RD
base address. See patch #1 for details). And it looks like we're missing
some basic checking when handling userspace register access.

I've only tested it with QEMU. It'd be appreciated if others can test it
with other user tools.

Zenghui Yu (2):
  KVM: arm64: vgic: Forbid invalid userspace Redistributor accesses
  KVM: arm64: vgic: Forbid invalid userspace Distributor accesses

 arch/arm64/kvm/vgic/vgic-mmio-v3.c | 8 
 1 file changed, 8 insertions(+)

-- 
2.19.1



Re: [PATCH net] net: hns3: Clear the CMDQ registers before unmapping BAR region

2020-10-23 Thread Zenghui Yu

On 2020/10/23 14:22, Yunsheng Lin wrote:

On 2020/10/23 13:15, Zenghui Yu wrote:

When unbinding the hns3 driver with the HNS3 VF, I got the following
kernel panic:

[  265.709989] Unable to handle kernel paging request at virtual address 
800054627000
[  265.717928] Mem abort info:
[  265.720740]   ESR = 0x9647
[  265.723810]   EC = 0x25: DABT (current EL), IL = 32 bits
[  265.729126]   SET = 0, FnV = 0
[  265.732195]   EA = 0, S1PTW = 0
[  265.735351] Data abort info:
[  265.738227]   ISV = 0, ISS = 0x0047
[  265.742071]   CM = 0, WnR = 1
[  265.745055] swapper pgtable: 4k pages, 48-bit VAs, pgdp=09b54000
[  265.751753] [800054627000] pgd=202ff003, p4d=202ff003, 
pud=2020020eb003, pmd=0020a0dfc003, pte=
[  265.764314] Internal error: Oops: 9647 [#1] SMP
[  265.830357] CPU: 61 PID: 20319 Comm: bash Not tainted 5.9.0+ #206
[  265.836423] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 
09/18/2019
[  265.843873] pstate: 8049 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[  265.843890] pc : hclgevf_cmd_uninit+0xbc/0x300
[  265.861988] lr : hclgevf_cmd_uninit+0xb0/0x300
[  265.861992] sp : 80004c983b50
[  265.881411] pmr_save: 00e0
[  265.884453] x29: 80004c983b50 x28: 20280bbce500
[  265.889744] x27:  x26: 
[  265.895034] x25: 800011a1f000 x24: 800011a1fe90
[  265.900325] x23: 0020ce9b00d8 x22: 0020ce9b0150
[  265.905616] x21: 800010d70e90 x20: 800010d70e90
[  265.910906] x19: 0020ce9b0080 x18: 0004
[  265.916198] x17:  x16: 800011ae32e8
[  265.916201] x15: 0028 x14: 0002
[  265.916204] x13: 800011ae32e8 x12: 00012ad8
[  265.946619] x11: 80004c983b50 x10: 
[  265.951911] x9 : 8000115d0888 x8 : 
[  265.951914] x7 : 800011890b20 x6 : c0007fff
[  265.951917] x5 : 80004c983930 x4 : 0001
[  265.951919] x3 : a027eec1b000 x2 : 2b78ccbbff369100
[  265.964487] x1 :  x0 : 800054627000
[  265.964491] Call trace:
[  265.964494]  hclgevf_cmd_uninit+0xbc/0x300
[  265.964496]  hclgevf_uninit_ae_dev+0x9c/0xe8
[  265.964501]  hnae3_unregister_ae_dev+0xb0/0x130
[  265.964516]  hns3_remove+0x34/0x88 [hns3]
[  266.009683]  pci_device_remove+0x48/0xf0
[  266.009692]  device_release_driver_internal+0x114/0x1e8
[  266.030058]  device_driver_detach+0x28/0x38
[  266.034224]  unbind_store+0xd4/0x108
[  266.037784]  drv_attr_store+0x40/0x58
[  266.041435]  sysfs_kf_write+0x54/0x80
[  266.045081]  kernfs_fop_write+0x12c/0x250
[  266.049076]  vfs_write+0xc4/0x248
[  266.052378]  ksys_write+0x74/0xf8
[  266.055677]  __arm64_sys_write+0x24/0x30
[  266.059584]  el0_svc_common.constprop.3+0x84/0x270
[  266.064354]  do_el0_svc+0x34/0xa0
[  266.067658]  el0_svc+0x38/0x40
[  266.070700]  el0_sync_handler+0x8c/0xb0
[  266.074519]  el0_sync+0x140/0x180

It looks like the BAR memory region had already been unmapped before we
start clearing CMDQ registers in it, which is pretty bad and the kernel
happily kills itself because of a Current EL Data Abort (on arm64).

Moving the CMDQ uninitialization a bit early fixes the issue for me.


Yes, this seems a obvious bug, and below patch seems the obvious fix too.

We has testcase to test the loading and unloading process, but does not
seem to catch this error, and testing it again using the latest internal
version seems ok too:


I don't have access to the internal version ;-) I use mainline instead.


[root@localhost ~]# rmmod hclgevf
[ 8035.010715] hns3 :7d:01.0 eth8: net stop
[ 8035.079917] hns3 :7d:01.0 eth8: link down
[ 8036.402491] hns3 :7d:01.1 eth9: net stop
[ 8036.472946] hns3 :7d:01.1 eth9: link down
[root@localhost ~]# rmmod hns3
[ 8045.938705] hns3 :bd:00.3 eth7: net stop
[ 8046.354308] hns3 :bd:00.2 eth6: net stop
[ 8046.627653] hns3 :bd:00.1 eth5: net stop
[ 8046.632251] hns3 :bd:00.1 eth5: link down
[ 8047.050235] hns3 :bd:00.0 eth4: net stop
[ 8047.054837] hns3 :bd:00.0 eth4: link down
[ 8047.340528] hns3 :7d:00.3 eth3: net stop
[ 8047.347633] hns3 :7d:00.3 eth3: link down
[ 8048.879299] hns3 :7d:00.2 eth2: net stop
[ 8050.271999] hns3 :7d:00.1 eth1: net stop
[ 8050.278755] hns3 :7d:00.1 eth1: link down
[ 8051.650607] pci :7d:01.0: Removing from iommu group 44
[ 8051.657909] pci :7d:01.1: Removing from iommu group 45
[ 8052.794671] hns3 :7d:00.0 eth0: net stop
[ 8052.800385] hns3 :7d:00.0 eth0: link down
[root@localhost ~]#
[root@localhost ~]#


Do you care to provide the testcase for above calltrace?


I noticed it with VFIO, but it's easy to reproduce it manually. Here you
go:

  # cat /sys/bus/pci/devices/\:7d\:00.2/sriov_totalvfs
3
  # echo 3 > /sys/bus/pci/devices/\:7d\:00.2/sriov_numvfs
  # lspci | grep "Virtual Function"
7d:01.6 Ethernet controller: Huawei Technologies Co., 

[PATCH net] net: hns3: Clear the CMDQ registers before unmapping BAR region

2020-10-22 Thread Zenghui Yu
When unbinding the hns3 driver with the HNS3 VF, I got the following
kernel panic:

[  265.709989] Unable to handle kernel paging request at virtual address 
800054627000
[  265.717928] Mem abort info:
[  265.720740]   ESR = 0x9647
[  265.723810]   EC = 0x25: DABT (current EL), IL = 32 bits
[  265.729126]   SET = 0, FnV = 0
[  265.732195]   EA = 0, S1PTW = 0
[  265.735351] Data abort info:
[  265.738227]   ISV = 0, ISS = 0x0047
[  265.742071]   CM = 0, WnR = 1
[  265.745055] swapper pgtable: 4k pages, 48-bit VAs, pgdp=09b54000
[  265.751753] [800054627000] pgd=202ff003, p4d=202ff003, 
pud=2020020eb003, pmd=0020a0dfc003, pte=
[  265.764314] Internal error: Oops: 9647 [#1] SMP
[  265.830357] CPU: 61 PID: 20319 Comm: bash Not tainted 5.9.0+ #206
[  265.836423] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 
09/18/2019
[  265.843873] pstate: 8049 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[  265.843890] pc : hclgevf_cmd_uninit+0xbc/0x300
[  265.861988] lr : hclgevf_cmd_uninit+0xb0/0x300
[  265.861992] sp : 80004c983b50
[  265.881411] pmr_save: 00e0
[  265.884453] x29: 80004c983b50 x28: 20280bbce500
[  265.889744] x27:  x26: 
[  265.895034] x25: 800011a1f000 x24: 800011a1fe90
[  265.900325] x23: 0020ce9b00d8 x22: 0020ce9b0150
[  265.905616] x21: 800010d70e90 x20: 800010d70e90
[  265.910906] x19: 0020ce9b0080 x18: 0004
[  265.916198] x17:  x16: 800011ae32e8
[  265.916201] x15: 0028 x14: 0002
[  265.916204] x13: 800011ae32e8 x12: 00012ad8
[  265.946619] x11: 80004c983b50 x10: 
[  265.951911] x9 : 8000115d0888 x8 : 
[  265.951914] x7 : 800011890b20 x6 : c0007fff
[  265.951917] x5 : 80004c983930 x4 : 0001
[  265.951919] x3 : a027eec1b000 x2 : 2b78ccbbff369100
[  265.964487] x1 :  x0 : 800054627000
[  265.964491] Call trace:
[  265.964494]  hclgevf_cmd_uninit+0xbc/0x300
[  265.964496]  hclgevf_uninit_ae_dev+0x9c/0xe8
[  265.964501]  hnae3_unregister_ae_dev+0xb0/0x130
[  265.964516]  hns3_remove+0x34/0x88 [hns3]
[  266.009683]  pci_device_remove+0x48/0xf0
[  266.009692]  device_release_driver_internal+0x114/0x1e8
[  266.030058]  device_driver_detach+0x28/0x38
[  266.034224]  unbind_store+0xd4/0x108
[  266.037784]  drv_attr_store+0x40/0x58
[  266.041435]  sysfs_kf_write+0x54/0x80
[  266.045081]  kernfs_fop_write+0x12c/0x250
[  266.049076]  vfs_write+0xc4/0x248
[  266.052378]  ksys_write+0x74/0xf8
[  266.055677]  __arm64_sys_write+0x24/0x30
[  266.059584]  el0_svc_common.constprop.3+0x84/0x270
[  266.064354]  do_el0_svc+0x34/0xa0
[  266.067658]  el0_svc+0x38/0x40
[  266.070700]  el0_sync_handler+0x8c/0xb0
[  266.074519]  el0_sync+0x140/0x180

It looks like the BAR memory region had already been unmapped before we
start clearing CMDQ registers in it, which is pretty bad and the kernel
happily kills itself because of a Current EL Data Abort (on arm64).

Moving the CMDQ uninitialization a bit early fixes the issue for me.

Signed-off-by: Zenghui Yu 
---

I have almost zero knowledge about the hns3 driver. You can regard this
as a report and make a better fix if possible.

I can't even figure out that how can we live with this issue for a long
time... It should exists since commit 34f81f049e35 ("net: hns3: clear
command queue's registers when unloading VF driver"), where we start
writing something into the unmapped area.

 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 50c84c5e65d2..c8e3fdd5999c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -3262,8 +3262,8 @@ static void hclgevf_uninit_hdev(struct hclgevf_dev *hdev)
hclgevf_uninit_msi(hdev);
}
 
-   hclgevf_pci_uninit(hdev);
hclgevf_cmd_uninit(hdev);
+   hclgevf_pci_uninit(hdev);
hclgevf_uninit_mac_list(hdev);
 }
 
-- 
2.19.1



[PATCH] vfio/type1: Use the new helper to find vfio_group

2020-10-22 Thread Zenghui Yu
When attaching a new group to the container, let's use the new helper
vfio_iommu_find_iommu_group() to check if it's already attached. There
is no functional change.

Also take this chance to add a missing blank line.

Signed-off-by: Zenghui Yu 
---
 drivers/vfio/vfio_iommu_type1.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a05d856ae806..aa586bd684da 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1997,6 +1997,7 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu 
*iommu,
 
list_splice_tail(iova_copy, iova);
 }
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 struct iommu_group *iommu_group)
 {
@@ -2013,18 +2014,10 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
 
mutex_lock(>lock);
 
-   list_for_each_entry(d, >domain_list, next) {
-   if (find_iommu_group(d, iommu_group)) {
-   mutex_unlock(>lock);
-   return -EINVAL;
-   }
-   }
-
-   if (iommu->external_domain) {
-   if (find_iommu_group(iommu->external_domain, iommu_group)) {
-   mutex_unlock(>lock);
-   return -EINVAL;
-   }
+   /* Check for duplicates */
+   if (vfio_iommu_find_iommu_group(iommu, iommu_group)) {
+   mutex_unlock(>lock);
+   return -EINVAL;
}
 
group = kzalloc(sizeof(*group), GFP_KERNEL);
-- 
2.19.1



Re: [PATCH v10 11/11] vfio: Document nested stage control

2020-09-24 Thread Zenghui Yu

Hi Eric,

On 2020/3/21 0:19, Eric Auger wrote:

The VFIO API was enhanced to support nested stage control: a bunch of
new iotcls, one DMA FAULT region and an associated specific IRQ.

Let's document the process to follow to set up nested mode.

Signed-off-by: Eric Auger 


[...]


+The userspace must be prepared to receive faults. The VFIO-PCI device
+exposes one dedicated DMA FAULT region: it contains a ring buffer and
+its header that allows to manage the head/tail indices. The region is
+identified by the following index/subindex:
+- VFIO_REGION_TYPE_NESTED/VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT
+
+The DMA FAULT region exposes a VFIO_REGION_INFO_CAP_PRODUCER_FAULT
+region capability that allows the userspace to retrieve the ABI version
+of the fault records filled by the host.


Nit: I don't see this capability in the code.


Thanks,
Zenghui


Re: [PATCH v10 05/11] vfio/pci: Register an iommu fault handler

2020-09-24 Thread Zenghui Yu

Hi Eric,

On 2020/3/21 0:19, Eric Auger wrote:

Register an IOMMU fault handler which records faults in
the DMA FAULT region ring buffer. In a subsequent patch, we
will add the signaling of a specific eventfd to allow the
userspace to be notified whenever a new fault as shown up.

Signed-off-by: Eric Auger 


[...]


diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 586b89debed5..69595c240baf 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "vfio_pci_private.h"
  
@@ -283,6 +284,38 @@ static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {

.add_capability = vfio_pci_dma_fault_add_capability,
  };
  
+int vfio_pci_iommu_dev_fault_handler(struct iommu_fault *fault, void *data)

+{
+   struct vfio_pci_device *vdev = (struct vfio_pci_device *)data;
+   struct vfio_region_dma_fault *reg =
+   (struct vfio_region_dma_fault *)vdev->fault_pages;
+   struct iommu_fault *new =
+   (struct iommu_fault *)(vdev->fault_pages + reg->offset +
+   reg->head * reg->entry_size);


Shouldn't 'reg->head' be protected under the fault_queue_lock? Otherwise
things may change behind our backs...

We shouldn't take any assumption about how IOMMU driver would report the
fault (serially or in parallel), I think.


+   int head, tail, size;
+   int ret = 0;
+
+   if (fault->type != IOMMU_FAULT_DMA_UNRECOV)
+   return -ENOENT;
+
+   mutex_lock(>fault_queue_lock);
+
+   head = reg->head;
+   tail = reg->tail;
+   size = reg->nb_entries;
+
+   if (CIRC_SPACE(head, tail, size) < 1) {
+   ret = -ENOSPC;
+   goto unlock;
+   }
+
+   *new = *fault;
+   reg->head = (head + 1) % size;
+unlock:
+   mutex_unlock(>fault_queue_lock);
+   return ret;
+}
+
  #define DMA_FAULT_RING_LENGTH 512
  
  static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)

@@ -317,6 +350,13 @@ static int vfio_pci_init_dma_fault_region(struct 
vfio_pci_device *vdev)
header->entry_size = sizeof(struct iommu_fault);
header->nb_entries = DMA_FAULT_RING_LENGTH;
header->offset = sizeof(struct vfio_region_dma_fault);
+
+   ret = iommu_register_device_fault_handler(>pdev->dev,
+   vfio_pci_iommu_dev_fault_handler,
+   vdev);
+   if (ret)
+   goto out;
+
return 0;
  out:
kfree(vdev->fault_pages);



Thanks,
Zenghui


Re: [PATCH v10 04/11] vfio/pci: Add VFIO_REGION_TYPE_NESTED region type

2020-09-24 Thread Zenghui Yu

Hi Eric,

On 2020/3/21 0:19, Eric Auger wrote:

Add a new specific DMA_FAULT region aiming to exposed nested mode
translation faults.

The region has a ring buffer that contains the actual fault
records plus a header allowing to handle it (tail/head indices,
max capacity, entry size). At the moment the region is dimensionned
for 512 fault records.

Signed-off-by: Eric Auger 


[...]


diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 379a02c36e37..586b89debed5 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -260,6 +260,69 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, 
pci_power_t state)
return ret;
  }
  
+static void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev,

+  struct vfio_pci_region *region)
+{
+}
+
+static int vfio_pci_dma_fault_add_capability(struct vfio_pci_device *vdev,
+struct vfio_pci_region *region,
+struct vfio_info_cap *caps)
+{
+   struct vfio_region_info_cap_fault cap = {
+   .header.id = VFIO_REGION_INFO_CAP_DMA_FAULT,
+   .header.version = 1,
+   .version = 1,
+   };
+   return vfio_info_add_capability(caps, , sizeof(cap));
+}
+
+static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {
+   .rw = vfio_pci_dma_fault_rw,
+   .release= vfio_pci_dma_fault_release,
+   .add_capability = vfio_pci_dma_fault_add_capability,
+};
+
+#define DMA_FAULT_RING_LENGTH 512
+
+static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
+{
+   struct vfio_region_dma_fault *header;
+   size_t size;
+   int ret;
+
+   mutex_init(>fault_queue_lock);
+
+   /*
+* We provision 1 page for the header and space for
+* DMA_FAULT_RING_LENGTH fault records in the ring buffer.
+*/
+   size = ALIGN(sizeof(struct iommu_fault) *
+DMA_FAULT_RING_LENGTH, PAGE_SIZE) + PAGE_SIZE;
+
+   vdev->fault_pages = kzalloc(size, GFP_KERNEL);
+   if (!vdev->fault_pages)
+   return -ENOMEM;
+
+   ret = vfio_pci_register_dev_region(vdev,
+   VFIO_REGION_TYPE_NESTED,
+   VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT,
+   _pci_dma_fault_regops, size,
+   VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+   vdev->fault_pages);
+   if (ret)
+   goto out;
+
+   header = (struct vfio_region_dma_fault *)vdev->fault_pages;
+   header->entry_size = sizeof(struct iommu_fault);
+   header->nb_entries = DMA_FAULT_RING_LENGTH;
+   header->offset = sizeof(struct vfio_region_dma_fault);
+   return 0;
+out:
+   kfree(vdev->fault_pages);
+   return ret;
+}
+
  static int vfio_pci_enable(struct vfio_pci_device *vdev)
  {
struct pci_dev *pdev = vdev->pdev;
@@ -358,6 +421,10 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
}
}
  
+	ret = vfio_pci_init_dma_fault_region(vdev);

+   if (ret)
+   goto disable_exit;
+
vfio_pci_probe_mmaps(vdev);
  
  	return 0;

@@ -1383,6 +1450,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
  
  	vfio_iommu_group_put(pdev->dev.iommu_group, >dev);

kfree(vdev->region);
+   kfree(vdev->fault_pages);
mutex_destroy(>ioeventfds_lock);
  
  	if (!disable_idle_d3)

diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 8a2c7607d513..a392f50e3a99 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -119,6 +119,8 @@ struct vfio_pci_device {
int ioeventfds_nr;
struct eventfd_ctx  *err_trigger;
struct eventfd_ctx  *req_trigger;
+   u8  *fault_pages;


What's the reason to use 'u8'? It doesn't match the type of header, nor
the type of ring buffer.


+   struct mutexfault_queue_lock;
struct list_headdummy_resources_list;
struct mutexioeventfds_lock;
struct list_headioeventfds_list;
@@ -150,6 +152,14 @@ extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device 
*vdev, char __user *buf,
  extern long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
   uint64_t data, int count, int fd);
  
+struct vfio_pci_fault_abi {

+   u32 entry_size;
+};


This is not used by this patch (and the whole series).


+
+extern size_t vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev,
+   char __user *buf, size_t count,
+   loff_t *ppos, bool iswrite);
+
  extern int vfio_pci_init_perm_bits(void);
  extern void vfio_pci_uninit_perm_bits(void);
  
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c

index 

Re: [PATCH v10 01/11] vfio: VFIO_IOMMU_SET_PASID_TABLE

2020-09-23 Thread Zenghui Yu

Hi Eric,

On 2020/3/21 0:19, Eric Auger wrote:

From: "Liu, Yi L" 

This patch adds an VFIO_IOMMU_SET_PASID_TABLE ioctl
which aims to pass the virtual iommu guest configuration
to the host. This latter takes the form of the so-called
PASID table.

Signed-off-by: Jacob Pan 
Signed-off-by: Liu, Yi L 
Signed-off-by: Eric Auger 


[...]


diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a177bf2c6683..bfacbd876ee1 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2172,6 +2172,43 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu 
*iommu,
return ret;
  }
  
+static void

+vfio_detach_pasid_table(struct vfio_iommu *iommu)
+{
+   struct vfio_domain *d;
+
+   mutex_lock(>lock);
+
+   list_for_each_entry(d, >domain_list, next) {
+   iommu_detach_pasid_table(d->domain);
+   }
+   mutex_unlock(>lock);
+}
+
+static int
+vfio_attach_pasid_table(struct vfio_iommu *iommu,
+   struct vfio_iommu_type1_set_pasid_table *ustruct)
+{
+   struct vfio_domain *d;
+   int ret = 0;
+
+   mutex_lock(>lock);
+
+   list_for_each_entry(d, >domain_list, next) {
+   ret = iommu_attach_pasid_table(d->domain, >config);
+   if (ret)
+   goto unwind;
+   }
+   goto unlock;
+unwind:
+   list_for_each_entry_continue_reverse(d, >domain_list, next) {
+   iommu_detach_pasid_table(d->domain);
+   }
+unlock:
+   mutex_unlock(>lock);
+   return ret;
+}
+
  static long vfio_iommu_type1_ioctl(void *iommu_data,
   unsigned int cmd, unsigned long arg)
  {
@@ -2276,6 +2313,25 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
  
  		return copy_to_user((void __user *)arg, , minsz) ?

-EFAULT : 0;
+   } else if (cmd == VFIO_IOMMU_SET_PASID_TABLE) {
+   struct vfio_iommu_type1_set_pasid_table ustruct;
+
+   minsz = offsetofend(struct vfio_iommu_type1_set_pasid_table,
+   config);
+
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (ustruct.argsz < minsz)
+   return -EINVAL;
+
+   if (ustruct.flags & VFIO_PASID_TABLE_FLAG_SET)
+   return vfio_attach_pasid_table(iommu, );
+   else if (ustruct.flags & VFIO_PASID_TABLE_FLAG_UNSET) {
+   vfio_detach_pasid_table(iommu);
+   return 0;
+   } else
+   return -EINVAL;


Nit:

What if user-space blindly set both flags? Should we check that only one
flag is allowed to be set at this stage, and return error otherwise?

Besides, before going through the whole series [1][2], I'd like to know
if this is the latest version of your Nested-Stage-Setup work in case I
had missed something.

[1] https://lore.kernel.org/r/20200320161911.27494-1-eric.au...@redhat.com
[2] https://lore.kernel.org/r/20200414150607.28488-1-eric.au...@redhat.com


Thanks,
Zenghui


Re: [PATCH v2 2/2] vfio/pci: Don't regenerate vconfig for all BARs if !bardirty

2020-09-21 Thread Zenghui Yu

Hi Cornelia,

On 2020/9/21 18:21, Cornelia Huck wrote:

On Mon, 21 Sep 2020 12:51:16 +0800
Zenghui Yu  wrote:


Now we regenerate vconfig for all the BARs via vfio_bar_fixup(), every time
any offset of any of them are read. Though BARs aren't re-read regularly,
the regeneration can be avoid if no BARs had been written since they were


s/avoid/avoided/


last read, in which case the vdev->bardirty is false.


s/the//



Let's predicate the vfio_bar_fixup() on the bardirty so that it can return
immediately if !bardirty.


Maybe

"Let's return immediately in vfio_bar_fixup() if bardirty is false." ?


Yes.



Suggested-by: Alex Williamson 
Signed-off-by: Zenghui Yu 
---
* From v1:
   - Per Alex's suggestion, let vfio_bar_fixup() test vdev->bardirty to
 avoid doing work if bardirty is false, instead of removing it entirely.
   - Rewrite the commit message.

  drivers/vfio/pci/vfio_pci_config.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index d98843feddce..5e02ba07e8e8 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -467,6 +467,9 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
__le32 *vbar;
u64 mask;

+   if (!vdev->bardirty)


Finally, bardirty can actually affect something :)


+   return;
+
vbar = (__le32 *)>vconfig[PCI_BASE_ADDRESS_0];
  
  	for (i = 0; i < PCI_STD_NUM_BARS; i++, vbar++) {


Reviewed-by: Cornelia Huck 


Thanks for you review! I think Alex can help fix the commit message when
applying? Otherwise I can send a v3.


Thanks,
Zenghui


[PATCH v2 2/2] vfio/pci: Don't regenerate vconfig for all BARs if !bardirty

2020-09-20 Thread Zenghui Yu
Now we regenerate vconfig for all the BARs via vfio_bar_fixup(), every time
any offset of any of them are read. Though BARs aren't re-read regularly,
the regeneration can be avoid if no BARs had been written since they were
last read, in which case the vdev->bardirty is false.

Let's predicate the vfio_bar_fixup() on the bardirty so that it can return
immediately if !bardirty.

Suggested-by: Alex Williamson 
Signed-off-by: Zenghui Yu 
---
* From v1:
  - Per Alex's suggestion, let vfio_bar_fixup() test vdev->bardirty to
avoid doing work if bardirty is false, instead of removing it entirely.
  - Rewrite the commit message.

 drivers/vfio/pci/vfio_pci_config.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index d98843feddce..5e02ba07e8e8 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -467,6 +467,9 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
__le32 *vbar;
u64 mask;
 
+   if (!vdev->bardirty)
+   return;
+
vbar = (__le32 *)>vconfig[PCI_BASE_ADDRESS_0];
 
for (i = 0; i < PCI_STD_NUM_BARS; i++, vbar++) {
-- 
2.19.1



[PATCH v2 1/2] vfio/pci: Remove redundant declaration of vfio_pci_driver

2020-09-20 Thread Zenghui Yu
It was added by commit 137e5531351d ("vfio/pci: Add sriov_configure
support") but duplicates a forward declaration earlier in the file.

Remove it.

Signed-off-by: Zenghui Yu 
Reviewed-by: Cornelia Huck 
---
* From v1:
  - Clarify the commit message per Alex's suggestion.
  - Add Cornelia's R-b tag.

 drivers/vfio/pci/vfio_pci.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 1ab1f5cda4ac..da68e2f86622 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1862,7 +1862,6 @@ static const struct vfio_device_ops vfio_pci_ops = {
 
 static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
 static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
-static struct pci_driver vfio_pci_driver;
 
 static int vfio_pci_bus_notifier(struct notifier_block *nb,
 unsigned long action, void *data)
-- 
2.19.1



Re: [PATCH 2/2] vfio/pci: Remove bardirty from vfio_pci_device

2020-09-18 Thread Zenghui Yu

On 2020/9/19 10:11, Alex Williamson wrote:

On Sat, 19 Sep 2020 09:54:00 +0800
Zenghui Yu  wrote:


Hi Alex,

On 2020/9/18 6:07, Alex Williamson wrote:

On Thu, 17 Sep 2020 13:35:37 +0200
Cornelia Huck  wrote:
   

On Thu, 17 Sep 2020 11:31:28 +0800
Zenghui Yu  wrote:
  

It isn't clear what purpose the @bardirty serves. It might be used to avoid
the unnecessary vfio_bar_fixup() invoking on a user-space BAR read, which
is not required when bardirty is unset.

The variable was introduced in commit 89e1f7d4c66d ("vfio: Add PCI device
driver") but never actually used, so it shouldn't be that important. Remove
it.

Signed-off-by: Zenghui Yu 
---
   drivers/vfio/pci/vfio_pci_config.c  | 7 ---
   drivers/vfio/pci/vfio_pci_private.h | 1 -
   2 files changed, 8 deletions(-)


Yes, it seems to have been write-only all the time.


I suspect the intent was that vfio_bar_fixup() could test
vdev->bardirty to avoid doing work if no BARs had been written since
they were last read.  As it is now we regenerate vconfig for all the
BARs every time any offset of any of them are read.  BARs aren't
re-read regularly and config space is not a performance path,


Yes, it seems that Qemu itself emulates all BAR registers and will read
the BAR from the kernel side only at initialization time.


but maybe
we should instead test if we see any regressions from returning without
doing any work in vfio_bar_fixup() if vdev->bardirty is false.  Thanks,


I will test it with the following diff. Please let me know which way do
you prefer.

diff --git a/drivers/vfio/pci/vfio_pci_config.c
b/drivers/vfio/pci/vfio_pci_config.c
index d98843feddce..77c419d536d0 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -515,7 +515,7 @@ static int vfio_basic_config_read(struct
vfio_pci_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 *val)
   {
-   if (is_bar(offset)) /* pos == offset for basic config */
+   if (is_bar(offset) && vdev->bardirty) /* pos == offset for basic
config */
  vfio_bar_fixup(vdev);

  count = vfio_default_config_read(vdev, pos, count, perm,
offset, val);



There's only one caller currently, but I'd think it cleaner to put this
in vfio_bar_fixup(), ie. return immediately if !bardirty.  Thanks,


OK, I'll do that in the v2.


Thanks,
Zenghui


Re: [PATCH 1/2] vfio/pci: Remove redundant declaration of vfio_pci_driver

2020-09-18 Thread Zenghui Yu

On 2020/9/18 6:22, Alex Williamson wrote:

On Thu, 17 Sep 2020 11:31:27 +0800
Zenghui Yu  wrote:


It was added by commit 137e5531351d ("vfio/pci: Add sriov_configure
support") and actually unnecessary. Remove it.


Looks correct, but I might clarify as:

s/unnecessary/duplicates a forward declaration earlier in the file/

I can change on commit if you approve.  Thanks,


Indeed. Please help to change it.


Thanks,
Zenghui


Re: [PATCH 2/2] vfio/pci: Remove bardirty from vfio_pci_device

2020-09-18 Thread Zenghui Yu

Hi Alex,

On 2020/9/18 6:07, Alex Williamson wrote:

On Thu, 17 Sep 2020 13:35:37 +0200
Cornelia Huck  wrote:


On Thu, 17 Sep 2020 11:31:28 +0800
Zenghui Yu  wrote:


It isn't clear what purpose the @bardirty serves. It might be used to avoid
the unnecessary vfio_bar_fixup() invoking on a user-space BAR read, which
is not required when bardirty is unset.

The variable was introduced in commit 89e1f7d4c66d ("vfio: Add PCI device
driver") but never actually used, so it shouldn't be that important. Remove
it.

Signed-off-by: Zenghui Yu 
---
  drivers/vfio/pci/vfio_pci_config.c  | 7 ---
  drivers/vfio/pci/vfio_pci_private.h | 1 -
  2 files changed, 8 deletions(-)


Yes, it seems to have been write-only all the time.


I suspect the intent was that vfio_bar_fixup() could test
vdev->bardirty to avoid doing work if no BARs had been written since
they were last read.  As it is now we regenerate vconfig for all the
BARs every time any offset of any of them are read.  BARs aren't
re-read regularly and config space is not a performance path,


Yes, it seems that Qemu itself emulates all BAR registers and will read
the BAR from the kernel side only at initialization time.


but maybe
we should instead test if we see any regressions from returning without
doing any work in vfio_bar_fixup() if vdev->bardirty is false.  Thanks,


I will test it with the following diff. Please let me know which way do
you prefer.

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c

index d98843feddce..77c419d536d0 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -515,7 +515,7 @@ static int vfio_basic_config_read(struct 
vfio_pci_device *vdev, int pos,

  int count, struct perm_bits *perm,
  int offset, __le32 *val)
 {
-   if (is_bar(offset)) /* pos == offset for basic config */
+   if (is_bar(offset) && vdev->bardirty) /* pos == offset for basic 
config */

vfio_bar_fixup(vdev);

count = vfio_default_config_read(vdev, pos, count, perm, 
offset, val);



Thanks,
Zenghui


[PATCH 2/2] vfio/pci: Remove bardirty from vfio_pci_device

2020-09-16 Thread Zenghui Yu
It isn't clear what purpose the @bardirty serves. It might be used to avoid
the unnecessary vfio_bar_fixup() invoking on a user-space BAR read, which
is not required when bardirty is unset.

The variable was introduced in commit 89e1f7d4c66d ("vfio: Add PCI device
driver") but never actually used, so it shouldn't be that important. Remove
it.

Signed-off-by: Zenghui Yu 
---
 drivers/vfio/pci/vfio_pci_config.c  | 7 ---
 drivers/vfio/pci/vfio_pci_private.h | 1 -
 2 files changed, 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index d98843feddce..e93b287fea02 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -507,8 +507,6 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
*vbar &= cpu_to_le32((u32)mask);
} else
*vbar = 0;
-
-   vdev->bardirty = false;
 }
 
 static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
@@ -633,9 +631,6 @@ static int vfio_basic_config_write(struct vfio_pci_device 
*vdev, int pos,
}
}
 
-   if (is_bar(offset))
-   vdev->bardirty = true;
-
return count;
 }
 
@@ -1697,8 +1692,6 @@ int vfio_config_init(struct vfio_pci_device *vdev)
if (ret)
goto out;
 
-   vdev->bardirty = true;
-
/*
 * XXX can we just pci_load_saved_state/pci_restore_state?
 * may need to rebuild vconfig after that
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 61ca8ab165dc..dc96a0fda548 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -122,7 +122,6 @@ struct vfio_pci_device {
boolvirq_disabled;
boolreset_works;
boolextended_caps;
-   boolbardirty;
boolhas_vga;
boolneeds_reset;
boolnointx;
-- 
2.19.1



[PATCH 1/2] vfio/pci: Remove redundant declaration of vfio_pci_driver

2020-09-16 Thread Zenghui Yu
It was added by commit 137e5531351d ("vfio/pci: Add sriov_configure
support") and actually unnecessary. Remove it.

Signed-off-by: Zenghui Yu 
---
 drivers/vfio/pci/vfio_pci.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 1ab1f5cda4ac..da68e2f86622 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1862,7 +1862,6 @@ static const struct vfio_device_ops vfio_pci_ops = {
 
 static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
 static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
-static struct pci_driver vfio_pci_driver;
 
 static int vfio_pci_bus_notifier(struct notifier_block *nb,
 unsigned long action, void *data)
-- 
2.19.1



[PATCH] vfio: Fix typo of the device_state

2020-09-10 Thread Zenghui Yu
A typo fix ("_RUNNNG" => "_RUNNING") in comment block of the uapi header.

Signed-off-by: Zenghui Yu 
---
 include/uapi/linux/vfio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 920470502329..d4bd39e124bf 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -462,7 +462,7 @@ struct vfio_region_gfx_edid {
  * 5. Resumed
  *  |->|
  *
- * 0. Default state of VFIO device is _RUNNNG when the user application starts.
+ * 0. Default state of VFIO device is _RUNNING when the user application 
starts.
  * 1. During normal shutdown of the user application, the user application may
  *optionally change the VFIO device state from _RUNNING to _STOP. This
  *transition is optional. The vendor driver must support this transition 
but
-- 
2.19.1



[PATCH] iommu/arm-smmu-v3: Fix l1 stream table size in the error message

2020-08-26 Thread Zenghui Yu
The actual size of level-1 stream table is l1size. This looks like an
oversight on commit d2e88e7c081ef ("iommu/arm-smmu: Fix LOG2SIZE setting
for 2-level stream tables") which forgot to update the @size in error
message as well.

As memory allocation failure is already bad enough, nothing worse would
happen. But let's be careful.

Signed-off-by: Zenghui Yu 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c192544e874b..bb458d0c7b73 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3280,7 +3280,7 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
if (!strtab) {
dev_err(smmu->dev,
"failed to allocate l1 stream table (%u bytes)\n",
-   size);
+   l1size);
return -ENOMEM;
}
cfg->strtab = strtab;
-- 
2.19.1



[PATCH v2 2/2] ACPI/IORT: Remove the unused inline functions

2020-08-18 Thread Zenghui Yu
Since commit 8212688600ed ("ACPI/IORT: Fix build error when IOMMU_SUPPORT
is disabled"), iort_fwspec_iommu_ops() and iort_add_device_replay() are not
needed anymore when CONFIG_IOMMU_API is not selected. Let's remove them.

Signed-off-by: Zenghui Yu 
---
 drivers/acpi/arm64/iort.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index a0ece0e201b2..e670785a6201 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1086,10 +1086,6 @@ const struct iommu_ops *iort_iommu_configure_id(struct 
device *dev,
 }
 
 #else
-static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct device *dev)
-{ return NULL; }
-static inline int iort_add_device_replay(struct device *dev)
-{ return 0; }
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
 { return 0; }
 const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
-- 
2.19.1



[PATCH v2 0/2] ACPI/IORT: Code cleanups

2020-08-18 Thread Zenghui Yu
* From v1 [1]:
  - As pointed out by Hanjun, remove two now unused inline functions.
Compile tested with CONFIG_IOMMU_API is not selected.

[1] https://lore.kernel.org/r/20200817105946.1511-1-yuzeng...@huawei.com

Zenghui Yu (2):
  ACPI/IORT: Drop the unused @ops of iort_add_device_replay()
  ACPI/IORT: Remove the unused inline functions

 drivers/acpi/arm64/iort.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

-- 
2.19.1



[PATCH v2 1/2] ACPI/IORT: Drop the unused @ops of iort_add_device_replay()

2020-08-18 Thread Zenghui Yu
Since commit d2e1a003af56 ("ACPI/IORT: Don't call iommu_ops->add_device
directly"), we use the IOMMU core API to replace a direct invoke of the
specified callback. The parameter @ops has therefore became unused. Let's
drop it.

Signed-off-by: Zenghui Yu 
---
 drivers/acpi/arm64/iort.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index ec782e4a0fe4..a0ece0e201b2 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -811,8 +811,7 @@ static inline const struct iommu_ops 
*iort_fwspec_iommu_ops(struct device *dev)
return (fwspec && fwspec->ops) ? fwspec->ops : NULL;
 }
 
-static inline int iort_add_device_replay(const struct iommu_ops *ops,
-struct device *dev)
+static inline int iort_add_device_replay(struct device *dev)
 {
int err = 0;
 
@@ -1072,7 +1071,7 @@ const struct iommu_ops *iort_iommu_configure_id(struct 
device *dev,
 */
if (!err) {
ops = iort_fwspec_iommu_ops(dev);
-   err = iort_add_device_replay(ops, dev);
+   err = iort_add_device_replay(dev);
}
 
/* Ignore all other errors apart from EPROBE_DEFER */
@@ -1089,8 +1088,7 @@ const struct iommu_ops *iort_iommu_configure_id(struct 
device *dev,
 #else
 static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct device *dev)
 { return NULL; }
-static inline int iort_add_device_replay(const struct iommu_ops *ops,
-struct device *dev)
+static inline int iort_add_device_replay(struct device *dev)
 { return 0; }
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
 { return 0; }
-- 
2.19.1



Re: [PATCH] ACPI/IORT: Drop the unused @ops of iort_add_device_replay()

2020-08-18 Thread Zenghui Yu

On 2020/8/18 11:49, Hanjun Guo wrote:

On 2020/8/17 18:59, Zenghui Yu wrote:

Since commit d2e1a003af56 ("ACPI/IORT: Don't call iommu_ops->add_device
directly"), we use the IOMMU core API to replace a direct invoke of the
specified callback. The parameter @ops has therefore became unused. Let's
drop it.

Signed-off-by: Zenghui Yu 
---
  drivers/acpi/arm64/iort.c | 8 +++-
  1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index ec782e4a0fe4..a0ece0e201b2 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -811,8 +811,7 @@ static inline const struct iommu_ops 
*iort_fwspec_iommu_ops(struct device *dev)

  return (fwspec && fwspec->ops) ? fwspec->ops : NULL;
  }
-static inline int iort_add_device_replay(const struct iommu_ops *ops,
- struct device *dev)
+static inline int iort_add_device_replay(struct device *dev)
  {
  int err = 0;
@@ -1072,7 +1071,7 @@ const struct iommu_ops 
*iort_iommu_configure_id(struct device *dev,

   */
  if (!err) {
  ops = iort_fwspec_iommu_ops(dev);
-    err = iort_add_device_replay(ops, dev);
+    err = iort_add_device_replay(dev);
  }
  /* Ignore all other errors apart from EPROBE_DEFER */
@@ -1089,8 +1088,7 @@ const struct iommu_ops 
*iort_iommu_configure_id(struct device *dev,

  #else
  static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct 
device *dev)

  { return NULL; }
-static inline int iort_add_device_replay(const struct iommu_ops *ops,
- struct device *dev)
+static inline int iort_add_device_replay(struct device *dev)


inline functions iort_fwspec_iommu_ops() and iort_add_device_replay()
are not needed anymore after commit 8212688600ed ("ACPI/IORT: Fix build
error when IOMMU_SUPPORT is disabled"), could you please add another
patch to remove them as well?


Sure, I will remove them in v2. Thanks for the reminder.


Zenghui


[PATCH] ACPI/IORT: Drop the unused @ops of iort_add_device_replay()

2020-08-17 Thread Zenghui Yu
Since commit d2e1a003af56 ("ACPI/IORT: Don't call iommu_ops->add_device
directly"), we use the IOMMU core API to replace a direct invoke of the
specified callback. The parameter @ops has therefore became unused. Let's
drop it.

Signed-off-by: Zenghui Yu 
---
 drivers/acpi/arm64/iort.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index ec782e4a0fe4..a0ece0e201b2 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -811,8 +811,7 @@ static inline const struct iommu_ops 
*iort_fwspec_iommu_ops(struct device *dev)
return (fwspec && fwspec->ops) ? fwspec->ops : NULL;
 }
 
-static inline int iort_add_device_replay(const struct iommu_ops *ops,
-struct device *dev)
+static inline int iort_add_device_replay(struct device *dev)
 {
int err = 0;
 
@@ -1072,7 +1071,7 @@ const struct iommu_ops *iort_iommu_configure_id(struct 
device *dev,
 */
if (!err) {
ops = iort_fwspec_iommu_ops(dev);
-   err = iort_add_device_replay(ops, dev);
+   err = iort_add_device_replay(dev);
}
 
/* Ignore all other errors apart from EPROBE_DEFER */
@@ -1089,8 +1088,7 @@ const struct iommu_ops *iort_iommu_configure_id(struct 
device *dev,
 #else
 static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct device *dev)
 { return NULL; }
-static inline int iort_add_device_replay(const struct iommu_ops *ops,
-struct device *dev)
+static inline int iort_add_device_replay(struct device *dev)
 { return 0; }
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
 { return 0; }
-- 
2.19.1



[PATCH] driver core: Use the ktime_us_delta() helper

2020-08-02 Thread Zenghui Yu
Use the ktime_us_delta() helper to measure the driver probe time. Given the
helpers already returns an s64 value, let's drop the unnecessary casting to
s64 as well. There is no functional change.

Signed-off-by: Zenghui Yu 
---
 drivers/base/dd.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 3cefe89e94ca..0ef5b2f32932 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -618,15 +618,14 @@ static int really_probe(struct device *dev, struct 
device_driver *drv)
  */
 static int really_probe_debug(struct device *dev, struct device_driver *drv)
 {
-   ktime_t calltime, delta, rettime;
+   ktime_t calltime, rettime;
int ret;
 
calltime = ktime_get();
ret = really_probe(dev, drv);
rettime = ktime_get();
-   delta = ktime_sub(rettime, calltime);
pr_debug("probe of %s returned %d after %lld usecs\n",
-dev_name(dev), ret, (s64) ktime_to_us(delta));
+dev_name(dev), ret, ktime_us_delta(rettime, calltime));
return ret;
 }
 
-- 
2.19.1



Re: [PATCH] irqchip/gic-v4.1: Use GFP_ATOMIC flag in allocate_vpe_l1_table()

2020-07-26 Thread Zenghui Yu

Hi Marc,

On 2020/6/30 21:37, Zenghui Yu wrote:

Booting the latest kernel with DEBUG_ATOMIC_SLEEP=y on a GICv4.1 enabled
box, I get the following kernel splat:

[0.053766] BUG: sleeping function called from invalid context at 
mm/slab.h:567
[0.053767] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, 
name: swapper/1
[0.053769] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.8.0-rc3+ #23
[0.053770] Call trace:
[0.053774]  dump_backtrace+0x0/0x218
[0.053775]  show_stack+0x2c/0x38
[0.053777]  dump_stack+0xc4/0x10c
[0.053779]  ___might_sleep+0xfc/0x140
[0.053780]  __might_sleep+0x58/0x90
[0.053782]  slab_pre_alloc_hook+0x7c/0x90
[0.053783]  kmem_cache_alloc_trace+0x60/0x2f0
[0.053785]  its_cpu_init+0x6f4/0xe40
[0.053786]  gic_starting_cpu+0x24/0x38
[0.053788]  cpuhp_invoke_callback+0xa0/0x710
[0.053789]  notify_cpu_starting+0xcc/0xd8
[0.053790]  secondary_start_kernel+0x148/0x200

 # ./scripts/faddr2line vmlinux its_cpu_init+0x6f4/0xe40
its_cpu_init+0x6f4/0xe40:
allocate_vpe_l1_table at drivers/irqchip/irq-gic-v3-its.c:2818
(inlined by) its_cpu_init_lpis at drivers/irqchip/irq-gic-v3-its.c:3138
(inlined by) its_cpu_init at drivers/irqchip/irq-gic-v3-its.c:5166

It turned out that we're allocating memory using GFP_KERNEL (may sleep)
within the CPU hotplug notifier, which is indeed an atomic context. Bad
thing may happen if we're playing on a system with more than a single
CommonLPIAff group. Avoid it by turning this into an atomic allocation.

Fixes: 5e5168461c22 ("irqchip/gic-v4.1: VPE table (aka GICR_VPROPBASER) 
allocation")
Signed-off-by: Zenghui Yu 
---
 drivers/irqchip/irq-gic-v3-its.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 6a5a87fc4601..b66eeca442c4 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2814,7 +2814,7 @@ static int allocate_vpe_l1_table(void)
if (val & GICR_VPROPBASER_4_1_VALID)
goto out;
 
-	gic_data_rdist()->vpe_table_mask = kzalloc(sizeof(cpumask_t), GFP_KERNEL);

+   gic_data_rdist()->vpe_table_mask = kzalloc(sizeof(cpumask_t), 
GFP_ATOMIC);
if (!gic_data_rdist()->vpe_table_mask)
return -ENOMEM;
 
@@ -2881,7 +2881,7 @@ static int allocate_vpe_l1_table(void)
 
 	pr_debug("np = %d, npg = %lld, psz = %d, epp = %d, esz = %d\n",

 np, npg, psz, epp, esz);
-   page = alloc_pages(GFP_KERNEL | __GFP_ZERO, get_order(np * PAGE_SIZE));
+   page = alloc_pages(GFP_ATOMIC | __GFP_ZERO, get_order(np * PAGE_SIZE));
if (!page)
return -ENOMEM;
 


Do you mind taking this patch into v5.9? Or please let me know if you
still have any concerns on it?


Thanks,
Zenghui


[PATCH v2] irqchip/gic-v4.1: Ensure accessing the correct RD when writing INVALLR

2020-07-20 Thread Zenghui Yu
The GICv4.1 spec tells us that it's CONSTRAINED UNPREDICTABLE to issue a
register-based invalidation operation for a vPEID not mapped to that RD,
or another RD within the same CommonLPIAff group.

To follow this rule, commit f3a059219bc7 ("irqchip/gic-v4.1: Ensure mutual
exclusion between vPE affinity change and RD access") tried to address the
race between the RD accesses and the vPE affinity change, but somehow
forgot to take GICR_INVALLR into account. Let's take the vpe_lock before
evaluating vpe->col_idx to fix it.

Fixes: f3a059219bc7 ("irqchip/gic-v4.1: Ensure mutual exclusion between vPE 
affinity change and RD access")
Cc: sta...@vger.kernel.org
Signed-off-by: Zenghui Yu 
---
* From v1:
  - Add a proper Fixes: tag
  - Cc stable

 drivers/irqchip/irq-gic-v3-its.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index beac4caefad9..5bf2263bdbcc 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -4087,18 +4087,22 @@ static void its_vpe_4_1_deschedule(struct its_vpe *vpe,
 static void its_vpe_4_1_invall(struct its_vpe *vpe)
 {
void __iomem *rdbase;
+   unsigned long flags;
u64 val;
+   int cpu;
 
val  = GICR_INVALLR_V;
val |= FIELD_PREP(GICR_INVALLR_VPEID, vpe->vpe_id);
 
/* Target the redistributor this vPE is currently known on */
-   raw_spin_lock(_data_rdist_cpu(vpe->col_idx)->rd_lock);
-   rdbase = per_cpu_ptr(gic_rdists->rdist, vpe->col_idx)->rd_base;
+   cpu = vpe_to_cpuid_lock(vpe, );
+   raw_spin_lock(_data_rdist_cpu(cpu)->rd_lock);
+   rdbase = per_cpu_ptr(gic_rdists->rdist, cpu)->rd_base;
gic_write_lpir(val, rdbase + GICR_INVALLR);
 
wait_for_syncr(rdbase);
-   raw_spin_unlock(_data_rdist_cpu(vpe->col_idx)->rd_lock);
+   raw_spin_unlock(_data_rdist_cpu(cpu)->rd_lock);
+   vpe_to_cpuid_unlock(vpe, flags);
 }
 
 static int its_vpe_4_1_set_vcpu_affinity(struct irq_data *d, void *vcpu_info)
-- 
2.19.1



Re: [PATCH] irqchip/gic-v4.1: Ensure accessing the correct RD when writing INVALLR

2020-07-19 Thread Zenghui Yu

Hi Marc,

On 2020/7/17 19:07, Marc Zyngier wrote:

On Thu, 09 Jul 2020 14:49:59 +0100,
Zenghui Yu  wrote:


The GICv4.1 spec tells us that it's CONSTRAINED UNPREDICTABLE to issue a
register-based invalidation operation for a vPEID not mapped to that RD,
or another RD within the same CommonLPIAff group.

To follow this rule, commit f3a059219bc7 ("irqchip/gic-v4.1: Ensure mutual
exclusion between vPE affinity change and RD access") tried to address the
race between the RD accesses and the vPE affinity change, but somehow
forgot to take GICR_INVALLR into account. Let's take the vpe_lock before
evaluating vpe->col_idx to fix it.

Signed-off-by: Zenghui Yu 


Shouldn't this deserve a Fixes: tag?


Yes, I think a

Fixes: f3a059219bc7 ("irqchip/gic-v4.1: Ensure mutual exclusion between 
vPE affinity change and RD access")


should be enough. Should I resend a version with the tag added?


Thanks,
Zenghui


[PATCH] genirq/irqdomain: Remove redundant NULL pointer check on fwnode

2020-07-16 Thread Zenghui Yu
The is_fwnode_irqchip() helper will check if the fwnode_handle is empty.
There is no need to perform a redundant check outside of it.

Signed-off-by: Zenghui Yu 
---
 kernel/irq/irqdomain.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index a4c2c915511d..155460fc0147 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -142,7 +142,7 @@ struct irq_domain *__irq_domain_add(struct fwnode_handle 
*fwnode, int size,
if (!domain)
return NULL;
 
-   if (fwnode && is_fwnode_irqchip(fwnode)) {
+   if (is_fwnode_irqchip(fwnode)) {
fwid = container_of(fwnode, struct irqchip_fwid, fwnode);
 
switch (fwid->type) {
-- 
2.19.1



[PATCH] irqchip/gic-v4.1: Ensure accessing the correct RD when writing INVALLR

2020-07-09 Thread Zenghui Yu
The GICv4.1 spec tells us that it's CONSTRAINED UNPREDICTABLE to issue a
register-based invalidation operation for a vPEID not mapped to that RD,
or another RD within the same CommonLPIAff group.

To follow this rule, commit f3a059219bc7 ("irqchip/gic-v4.1: Ensure mutual
exclusion between vPE affinity change and RD access") tried to address the
race between the RD accesses and the vPE affinity change, but somehow
forgot to take GICR_INVALLR into account. Let's take the vpe_lock before
evaluating vpe->col_idx to fix it.

Signed-off-by: Zenghui Yu 
---
 drivers/irqchip/irq-gic-v3-its.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index da44bfa48bc2..50a04cca8207 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -4087,18 +4087,22 @@ static void its_vpe_4_1_deschedule(struct its_vpe *vpe,
 static void its_vpe_4_1_invall(struct its_vpe *vpe)
 {
void __iomem *rdbase;
+   unsigned long flags;
u64 val;
+   int cpu;
 
val  = GICR_INVALLR_V;
val |= FIELD_PREP(GICR_INVALLR_VPEID, vpe->vpe_id);
 
/* Target the redistributor this vPE is currently known on */
-   raw_spin_lock(_data_rdist_cpu(vpe->col_idx)->rd_lock);
-   rdbase = per_cpu_ptr(gic_rdists->rdist, vpe->col_idx)->rd_base;
+   cpu = vpe_to_cpuid_lock(vpe, );
+   raw_spin_lock(_data_rdist_cpu(cpu)->rd_lock);
+   rdbase = per_cpu_ptr(gic_rdists->rdist, cpu)->rd_base;
gic_write_lpir(val, rdbase + GICR_INVALLR);
 
wait_for_syncr(rdbase);
-   raw_spin_unlock(_data_rdist_cpu(vpe->col_idx)->rd_lock);
+   raw_spin_unlock(_data_rdist_cpu(cpu)->rd_lock);
+   vpe_to_cpuid_unlock(vpe, flags);
 }
 
 static int its_vpe_4_1_set_vcpu_affinity(struct irq_data *d, void *vcpu_info)
-- 
2.19.1



Re: [REPORT] possible circular locking dependency when booting a VM on arm64 host

2020-07-09 Thread Zenghui Yu

On 2020/7/9 18:54, Salil Mehta wrote:

Hi Yuzenghui,
I will try to reproduce it today at our platform. Just one question is it easily
reproducible or is a rare occurrence?


Salil, it's 100% reproducible once you start a guest. You don't even
need to assign hostdev to the VM.


Thanks,
Zenghui


[REPORT] possible circular locking dependency when booting a VM on arm64 host

2020-07-09 Thread Zenghui Yu

Hi All,

I had seen the following lockdep splat when booting a guest on my
Kunpeng 920 with GICv4 enabled. I can also trigger the same splat
on v5.5 so it should already exist in the kernel for a while. I'm
not sure what the exact problem is and hope someone can have a look!


Thanks,
Zenghui

[  103.855511] ==
[  103.861664] WARNING: possible circular locking dependency detected
[  103.867817] 5.8.0-rc4+ #35 Tainted: GW
[  103.872932] --
[  103.879083] CPU 2/KVM/20515 is trying to acquire lock:
[  103.884200] 202fcd5865b0 (_desc_lock_class){-.-.}-{2:2}, at: 
__irq_get_desc_lock+0x60/0xa0

[  103.893127]
   but task is already holding lock:
[  103.898933] 202fcfd07f58 (>lock){-.-.}-{2:2}, at: 
__schedule+0x114/0x8b8

[  103.906301]
   which lock already depends on the new lock.

[  103.914441]
   the existing dependency chain (in reverse order) is:
[  103.921888]
   -> #3 (>lock){-.-.}-{2:2}:
[  103.927438]_raw_spin_lock+0x54/0x70
[  103.931605]task_fork_fair+0x48/0x150
[  103.935860]sched_fork+0x100/0x268
[  103.939856]copy_process+0x628/0x1868
[  103.944106]_do_fork+0x74/0x710
[  103.947840]kernel_thread+0x78/0xa0
[  103.951917]rest_init+0x30/0x270
[  103.955742]arch_call_rest_init+0x14/0x1c
[  103.960339]start_kernel+0x534/0x568
[  103.964503]
   -> #2 (>pi_lock){-.-.}-{2:2}:
[  103.970224]_raw_spin_lock_irqsave+0x70/0x98
[  103.975080]try_to_wake_up+0x5c/0x5b0
[  103.979330]wake_up_process+0x28/0x38
[  103.983581]create_worker+0x128/0x1b8
[  103.987834]workqueue_init+0x308/0x3bc
[  103.992172]kernel_init_freeable+0x180/0x33c
[  103.997027]kernel_init+0x18/0x118
[  104.001020]ret_from_fork+0x10/0x18
[  104.005097]
   -> #1 (>lock){-.-.}-{2:2}:
[  104.010817]_raw_spin_lock+0x54/0x70
[  104.014983]__queue_work+0x120/0x6e8
[  104.019146]queue_work_on+0xa0/0xd8
[  104.023225]irq_set_affinity_locked+0xa8/0x178
[  104.028253]__irq_set_affinity+0x5c/0x90
[  104.032762]irq_set_affinity_hint+0x74/0xb0
[  104.037540]hns3_nic_init_irq+0xe0/0x210 [hns3]
[  104.042655]hns3_client_init+0x2d8/0x4e0 [hns3]
[  104.047779]hclge_init_client_instance+0xf0/0x3a8 [hclge]
[  104.053760]hnae3_init_client_instance.part.3+0x30/0x68 [hnae3]
[  104.060257]hnae3_register_ae_dev+0x100/0x1f0 [hnae3]
[  104.065892]hns3_probe+0x60/0xa8 [hns3]
[  104.070319]local_pci_probe+0x44/0x98
[  104.074573]work_for_cpu_fn+0x20/0x30
[  104.078823]process_one_work+0x258/0x618
[  104.08]worker_thread+0x1c0/0x438
[  104.087585]kthread+0x120/0x128
[  104.091318]ret_from_fork+0x10/0x18
[  104.095394]
   -> #0 (_desc_lock_class){-.-.}-{2:2}:
[  104.101895]__lock_acquire+0x11bc/0x1530
[  104.106406]lock_acquire+0x100/0x3f8
[  104.110570]_raw_spin_lock_irqsave+0x70/0x98
[  104.115426]__irq_get_desc_lock+0x60/0xa0
[  104.120021]irq_set_vcpu_affinity+0x48/0xc8
[  104.124793]its_make_vpe_non_resident+0x6c/0xc0
[  104.129910]vgic_v4_put+0x64/0x70
[  104.133815]vgic_v3_put+0x28/0x100
[  104.137806]kvm_vgic_put+0x3c/0x60
[  104.141801]kvm_arch_vcpu_put+0x38/0x58
[  104.146228]kvm_sched_out+0x38/0x58
[  104.150306]__schedule+0x554/0x8b8
[  104.154298]schedule+0x50/0xe0
[  104.157946]kvm_arch_vcpu_ioctl_run+0x644/0x9e8
[  104.163063]kvm_vcpu_ioctl+0x4b4/0x918
[  104.167403]ksys_ioctl+0xb4/0xd0
[  104.171222]__arm64_sys_ioctl+0x28/0xc8
[  104.175647]el0_svc_common.constprop.2+0x74/0x138
[  104.180935]do_el0_svc+0x34/0xa0
[  104.184755]el0_sync_handler+0xec/0x128
[  104.189179]el0_sync+0x140/0x180
[  104.192997]
   other info that might help us debug this:

[  104.200962] Chain exists of:
 _desc_lock_class --> >pi_lock --> >lock

[  104.211261]  Possible unsafe locking scenario:

[  104.217152]CPU0CPU1
[  104.221660]
[  104.226170]   lock(>lock);
[  104.229210]lock(>pi_lock);
[  104.234930]lock(>lock);
[  104.240474]   lock(_desc_lock_class);
[  104.244465]
*** DEADLOCK ***

[  104.250356] 2 locks held by CPU 2/KVM/20515:
[  104.254606]  #0: 202fa95680c8 (>mutex){+.+.}-{3:3}, at: 
kvm_vcpu_ioctl+0x80/0x918
[  104.262921]  #1: 202fcfd07f58 (>lock){-.-.}-{2:2}, at: 
__schedule+0x114/0x8b8

[  104.270717]
   stack backtrace:
[  104.275057] CPU: 73 PID: 20515 Comm: CPU 2/KVM Kdump: loaded Tainted: 
GW 5.8.0-rc4+ #35
[  

[PATCH] drm/hisilicon/hibmc: Move drm_fbdev_generic_setup() down to avoid the splat

2020-07-06 Thread Zenghui Yu
The HiSilicon hibmc driver triggers a splat at boot time as below

[   14.137806] [ cut here ]
[   14.142405] hibmc-drm :0a:00.0: Device has not been registered.
[   14.148661] WARNING: CPU: 0 PID: 496 at drivers/gpu/drm/drm_fb_helper.c:2233 
drm_fbdev_generic_setup+0x15c/0x1b8
[   14.158787] [...]
[   14.278307] Call trace:
[   14.280742]  drm_fbdev_generic_setup+0x15c/0x1b8
[   14.285337]  hibmc_pci_probe+0x354/0x418
[   14.289242]  local_pci_probe+0x44/0x98
[   14.292974]  work_for_cpu_fn+0x20/0x30
[   14.296708]  process_one_work+0x1c4/0x4e0
[   14.300698]  worker_thread+0x2c8/0x528
[   14.304431]  kthread+0x138/0x140
[   14.307646]  ret_from_fork+0x10/0x18
[   14.311205] ---[ end trace a2000ec2d838af4d ]---

This turned out to be due to the fbdev device hasn't been registered when
drm_fbdev_generic_setup() is invoked. Let's fix the splat by moving it down
after drm_dev_register() which will follow the "Display driver example"
documented by commit de99f0600a79 ("drm/drv: DOC: Add driver example
code").

Signed-off-by: Zenghui Yu 
---
 drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c 
b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c
index a6fd0c29e5b8..544b9993c99e 100644
--- a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c
+++ b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c
@@ -307,8 +307,6 @@ static int hibmc_load(struct drm_device *dev)
/* reset all the states of crtc/plane/encoder/connector */
drm_mode_config_reset(dev);
 
-   drm_fbdev_generic_setup(dev, dev->mode_config.preferred_depth);
-
return 0;
 
 err:
@@ -355,6 +353,9 @@ static int hibmc_pci_probe(struct pci_dev *pdev,
  ret);
goto err_unload;
}
+
+   drm_fbdev_generic_setup(dev, dev->mode_config.preferred_depth);
+
return 0;
 
 err_unload:
-- 
2.19.1



[PATCH] irqchip/gic-v3: Remove the unused register definition

2020-06-30 Thread Zenghui Yu
As per the GICv3 specification, GIC{D,R}_SEIR are not assigned and the
locations (0x0068) are actually Reserved. GICR_MOV{LPI,ALL}R are two IMP
DEF registers and might be defined by some specific micro-architecture,
Linux doesn't use them.

As they're not used anywhere in the kernel, just drop all of them.

Signed-off-by: Zenghui Yu 
---

Compile tested on top of mainline.

 include/linux/irqchip/arm-gic-v3.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/include/linux/irqchip/arm-gic-v3.h 
b/include/linux/irqchip/arm-gic-v3.h
index 6c36b6cc3edf..f6d092fdb93d 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -19,7 +19,6 @@
 #define GICD_CLRSPI_NSR0x0048
 #define GICD_SETSPI_SR 0x0050
 #define GICD_CLRSPI_SR 0x0058
-#define GICD_SEIR  0x0068
 #define GICD_IGROUPR   0x0080
 #define GICD_ISENABLER 0x0100
 #define GICD_ICENABLER 0x0180
@@ -119,14 +118,11 @@
 #define GICR_WAKER 0x0014
 #define GICR_SETLPIR   0x0040
 #define GICR_CLRLPIR   0x0048
-#define GICR_SEIR  GICD_SEIR
 #define GICR_PROPBASER 0x0070
 #define GICR_PENDBASER 0x0078
 #define GICR_INVLPIR   0x00A0
 #define GICR_INVALLR   0x00B0
 #define GICR_SYNCR 0x00C0
-#define GICR_MOVLPIR   0x0100
-#define GICR_MOVALLR   0x0110
 #define GICR_IDREGSGICD_IDREGS
 #define GICR_PIDR2 GICD_PIDR2
 
-- 
2.19.1



[PATCH] irqchip/gic-v4.1: Use GFP_ATOMIC flag in allocate_vpe_l1_table()

2020-06-30 Thread Zenghui Yu
Booting the latest kernel with DEBUG_ATOMIC_SLEEP=y on a GICv4.1 enabled
box, I get the following kernel splat:

[0.053766] BUG: sleeping function called from invalid context at 
mm/slab.h:567
[0.053767] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, 
name: swapper/1
[0.053769] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.8.0-rc3+ #23
[0.053770] Call trace:
[0.053774]  dump_backtrace+0x0/0x218
[0.053775]  show_stack+0x2c/0x38
[0.053777]  dump_stack+0xc4/0x10c
[0.053779]  ___might_sleep+0xfc/0x140
[0.053780]  __might_sleep+0x58/0x90
[0.053782]  slab_pre_alloc_hook+0x7c/0x90
[0.053783]  kmem_cache_alloc_trace+0x60/0x2f0
[0.053785]  its_cpu_init+0x6f4/0xe40
[0.053786]  gic_starting_cpu+0x24/0x38
[0.053788]  cpuhp_invoke_callback+0xa0/0x710
[0.053789]  notify_cpu_starting+0xcc/0xd8
[0.053790]  secondary_start_kernel+0x148/0x200

 # ./scripts/faddr2line vmlinux its_cpu_init+0x6f4/0xe40
its_cpu_init+0x6f4/0xe40:
allocate_vpe_l1_table at drivers/irqchip/irq-gic-v3-its.c:2818
(inlined by) its_cpu_init_lpis at drivers/irqchip/irq-gic-v3-its.c:3138
(inlined by) its_cpu_init at drivers/irqchip/irq-gic-v3-its.c:5166

It turned out that we're allocating memory using GFP_KERNEL (may sleep)
within the CPU hotplug notifier, which is indeed an atomic context. Bad
thing may happen if we're playing on a system with more than a single
CommonLPIAff group. Avoid it by turning this into an atomic allocation.

Fixes: 5e5168461c22 ("irqchip/gic-v4.1: VPE table (aka GICR_VPROPBASER) 
allocation")
Signed-off-by: Zenghui Yu 
---
 drivers/irqchip/irq-gic-v3-its.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 6a5a87fc4601..b66eeca442c4 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2814,7 +2814,7 @@ static int allocate_vpe_l1_table(void)
if (val & GICR_VPROPBASER_4_1_VALID)
goto out;
 
-   gic_data_rdist()->vpe_table_mask = kzalloc(sizeof(cpumask_t), 
GFP_KERNEL);
+   gic_data_rdist()->vpe_table_mask = kzalloc(sizeof(cpumask_t), 
GFP_ATOMIC);
if (!gic_data_rdist()->vpe_table_mask)
return -ENOMEM;
 
@@ -2881,7 +2881,7 @@ static int allocate_vpe_l1_table(void)
 
pr_debug("np = %d, npg = %lld, psz = %d, epp = %d, esz = %d\n",
 np, npg, psz, epp, esz);
-   page = alloc_pages(GFP_KERNEL | __GFP_ZERO, get_order(np * PAGE_SIZE));
+   page = alloc_pages(GFP_ATOMIC | __GFP_ZERO, get_order(np * PAGE_SIZE));
if (!page)
return -ENOMEM;
 
-- 
2.19.1



[tip: irq/urgent] irqchip/gic-v4.1: Use readx_poll_timeout_atomic() to fix sleep in atomic

2020-06-30 Thread tip-bot2 for Zenghui Yu
The following commit has been merged into the irq/urgent branch of tip:

Commit-ID: 31dbb6b1d025506b3b8b8b74e9b697df47b9f696
Gitweb:
https://git.kernel.org/tip/31dbb6b1d025506b3b8b8b74e9b697df47b9f696
Author:Zenghui Yu 
AuthorDate:Fri, 05 Jun 2020 13:23:45 +08:00
Committer: Marc Zyngier 
CommitterDate: Sun, 21 Jun 2020 15:13:11 +01:00

irqchip/gic-v4.1: Use readx_poll_timeout_atomic() to fix sleep in atomic

readx_poll_timeout() can sleep if @sleep_us is specified by the caller,
and is therefore unsafe to be used inside the atomic context, which is
this case when we use it to poll the GICR_VPENDBASER.Dirty bit in
irq_set_vcpu_affinity() callback.

Let's convert to its atomic version instead which helps to get the v4.1
board back to life!

Fixes: 96806229ca03 ("irqchip/gic-v4.1: Add support for VPENDBASER's 
Dirty+Valid signaling")
Signed-off-by: Zenghui Yu 
Signed-off-by: Marc Zyngier 
Link: https://lore.kernel.org/r/20200605052345.1494-1-yuzeng...@huawei.com
---
 drivers/irqchip/irq-gic-v3-its.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index cd685f5..6a5a87f 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -3797,10 +3797,10 @@ static void its_wait_vpt_parse_complete(void)
if (!gic_rdists->has_vpend_valid_dirty)
return;
 
-   WARN_ON_ONCE(readq_relaxed_poll_timeout(vlpi_base + GICR_VPENDBASER,
-   val,
-   !(val & GICR_VPENDBASER_Dirty),
-   10, 500));
+   WARN_ON_ONCE(readq_relaxed_poll_timeout_atomic(vlpi_base + 
GICR_VPENDBASER,
+  val,
+  !(val & 
GICR_VPENDBASER_Dirty),
+  10, 500));
 }
 
 static void its_vpe_schedule(struct its_vpe *vpe)


Re: [BUG] irqchip/gic-v4.1: sleeping function called from invalid context

2020-06-29 Thread Zenghui Yu

Hi Marc,

On 2020/6/29 22:01, Marc Zyngier wrote:

Hi Zenghui,

On 2020-06-29 10:39, Zenghui Yu wrote:

Hi All,

Booting the latest kernel with DEBUG_ATOMIC_SLEEP=y on a GICv4.1 enabled
box, I get the following kernel splat:

[0.053766] BUG: sleeping function called from invalid context at
mm/slab.h:567
[0.053767] in_atomic(): 1, irqs_disabled(): 128, non_block: 0,
pid: 0, name: swapper/1
[0.053769] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.8.0-rc3+ #23
[0.053770] Call trace:
[0.053774]  dump_backtrace+0x0/0x218
[0.053775]  show_stack+0x2c/0x38
[0.053777]  dump_stack+0xc4/0x10c
[0.053779]  ___might_sleep+0xfc/0x140
[0.053780]  __might_sleep+0x58/0x90
[0.053782]  slab_pre_alloc_hook+0x7c/0x90
[0.053783]  kmem_cache_alloc_trace+0x60/0x2f0
[0.053785]  its_cpu_init+0x6f4/0xe40
[0.053786]  gic_starting_cpu+0x24/0x38
[0.053788]  cpuhp_invoke_callback+0xa0/0x710
[0.053789]  notify_cpu_starting+0xcc/0xd8
[0.053790]  secondary_start_kernel+0x148/0x200

# ./scripts/faddr2line vmlinux its_cpu_init+0x6f4/0xe40
its_cpu_init+0x6f4/0xe40:
allocate_vpe_l1_table at drivers/irqchip/irq-gic-v3-its.c:2818
(inlined by) its_cpu_init_lpis at drivers/irqchip/irq-gic-v3-its.c:3138
(inlined by) its_cpu_init at drivers/irqchip/irq-gic-v3-its.c:5166


Let me guess: a system with more than a single CommonLPIAff group?


I *think* you're right. E.g., when we're allocating vpe_table_mask for
the first CPU of the second CommonLPIAff group.

The truth is that all the GICv4.1 boards I'm having on hand only have a
single CommonLPIAff group. Just to get the above backtrace, I did some
crazy hacking on my 920 and pretend it as v4.1 capable (well, please
ignore me). Hopefully I can get a new GICv4.1 board with more than one
CommonLPIAff group next month and do more tests.


I've tried to replace GFP_KERNEL flag with GFP_ATOMIC to allocate memory
in this atomic context, and the splat disappears. But after a quick look
at [*], it seems not a good idea to allocate memory within the CPU
hotplug notifier. I really don't know much about it, please have a look.

[*]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11e37d357f6ba7a9af850a872396082cc0a0001f 



The allocation of the cpumask is pretty benign, and could either be
allocated upfront for all RDs (and freed on detecting that we share
the same CommonLPIAff group) or made atomic.

The much bigger issue is the alloc_pages call just after. Allocating this
upfront probably is the wrong thing to do, as you are likely to allocate
way too much memory, even if you free it quickly afterwards.

At this stage, I'd rather we turn this into an atomic allocation. A 
notifier

is just another atomic context, and if this fails at such an early stage,
then the CPU is unlikely to continue booting...


Got it.


Would you like to write a patch for this? Given that you have tested
something, it probably already exists. Or do you want me to do it?


Yes, I had written something like below. I will add a commit message and
send it out today.

diff --git a/drivers/irqchip/irq-gic-v3-its.c 
b/drivers/irqchip/irq-gic-v3-its.c

index 6a5a87fc4601..b66eeca442c4 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2814,7 +2814,7 @@ static int allocate_vpe_l1_table(void)
if (val & GICR_VPROPBASER_4_1_VALID)
goto out;

-   gic_data_rdist()->vpe_table_mask = kzalloc(sizeof(cpumask_t), 
GFP_KERNEL);
+   gic_data_rdist()->vpe_table_mask = kzalloc(sizeof(cpumask_t), 
GFP_ATOMIC);
if (!gic_data_rdist()->vpe_table_mask)
return -ENOMEM;

@@ -2881,7 +2881,7 @@ static int allocate_vpe_l1_table(void)

pr_debug("np = %d, npg = %lld, psz = %d, epp = %d, esz = %d\n",
 np, npg, psz, epp, esz);
-   page = alloc_pages(GFP_KERNEL | __GFP_ZERO, get_order(np * PAGE_SIZE));
+   page = alloc_pages(GFP_ATOMIC | __GFP_ZERO, get_order(np * PAGE_SIZE));
if (!page)
return -ENOMEM;


Thanks,
Zenghui


[BUG] irqchip/gic-v4.1: sleeping function called from invalid context

2020-06-29 Thread Zenghui Yu

Hi All,

Booting the latest kernel with DEBUG_ATOMIC_SLEEP=y on a GICv4.1 enabled
box, I get the following kernel splat:

[0.053766] BUG: sleeping function called from invalid context at 
mm/slab.h:567
[0.053767] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 
0, name: swapper/1

[0.053769] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.8.0-rc3+ #23
[0.053770] Call trace:
[0.053774]  dump_backtrace+0x0/0x218
[0.053775]  show_stack+0x2c/0x38
[0.053777]  dump_stack+0xc4/0x10c
[0.053779]  ___might_sleep+0xfc/0x140
[0.053780]  __might_sleep+0x58/0x90
[0.053782]  slab_pre_alloc_hook+0x7c/0x90
[0.053783]  kmem_cache_alloc_trace+0x60/0x2f0
[0.053785]  its_cpu_init+0x6f4/0xe40
[0.053786]  gic_starting_cpu+0x24/0x38
[0.053788]  cpuhp_invoke_callback+0xa0/0x710
[0.053789]  notify_cpu_starting+0xcc/0xd8
[0.053790]  secondary_start_kernel+0x148/0x200

# ./scripts/faddr2line vmlinux its_cpu_init+0x6f4/0xe40
its_cpu_init+0x6f4/0xe40:
allocate_vpe_l1_table at drivers/irqchip/irq-gic-v3-its.c:2818
(inlined by) its_cpu_init_lpis at drivers/irqchip/irq-gic-v3-its.c:3138
(inlined by) its_cpu_init at drivers/irqchip/irq-gic-v3-its.c:5166

I've tried to replace GFP_KERNEL flag with GFP_ATOMIC to allocate memory
in this atomic context, and the splat disappears. But after a quick look
at [*], it seems not a good idea to allocate memory within the CPU
hotplug notifier. I really don't know much about it, please have a look.

[*] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11e37d357f6ba7a9af850a872396082cc0a0001f



Thanks,
Zenghui


Re: [PATCH] drm/msm/dpu: Fix usage of ERR_PTR()

2020-06-20 Thread Zenghui Yu

ping for this obvious fix...

On 2020/5/28 21:08, Zenghui Yu wrote:

ERR_PTR() is used in the kernel to encode an usual *negative* errno code
into a pointer.  Passing a positive value (ENOMEM) to it will break the
following IS_ERR() check.

Though memory allocation is unlikely to fail, it's still worth fixing.
And grepping shows that this is the only misuse of ERR_PTR() in kernel.

Fixes: 25fdd5933e4c ("drm/msm: Add SDM845 DPU support")
Signed-off-by: Zenghui Yu 
---
  drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c 
b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c
index a1b79ee2bd9d..a2f6b688a976 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c
@@ -2173,7 +2173,7 @@ struct drm_encoder *dpu_encoder_init(struct drm_device 
*dev,
  
  	dpu_enc = devm_kzalloc(dev->dev, sizeof(*dpu_enc), GFP_KERNEL);

if (!dpu_enc)
-   return ERR_PTR(ENOMEM);
+   return ERR_PTR(-ENOMEM);
  
  	rc = drm_encoder_init(dev, _enc->base, _encoder_funcs,

drm_enc_mode, NULL);



Re: [PATCH] irqchip/gic-v4.1: Use readx_poll_timeout_atomic() to fix sleep in atomic

2020-06-10 Thread Zenghui Yu

Hi Marc,

Sorry to ping you in the merge window, but ...

On 2020/6/5 13:23, Zenghui Yu wrote:

readx_poll_timeout() can sleep if @sleep_us is specified by the caller,
and is therefore unsafe to be used inside the atomic context, which is
this case when we use it to poll the GICR_VPENDBASER.Dirty bit in
irq_set_vcpu_affinity() callback.


this seems like an urgent thing to me. Without this patch, CPUs are
easily to get stuck on my board with GICv4.1 enabled. So it'd be good if
you can have a look and take this as a fix (if it is correct).


Thanks,
Zenghui



Let's convert to its atomic version instead which helps to get the v4.1
board back to life!

Fixes: 96806229ca03 ("irqchip/gic-v4.1: Add support for VPENDBASER's Dirty+Valid 
signaling")
Signed-off-by: Zenghui Yu 
---
  drivers/irqchip/irq-gic-v3-its.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index cd685f521c77..6a5a87fc4601 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -3797,10 +3797,10 @@ static void its_wait_vpt_parse_complete(void)
if (!gic_rdists->has_vpend_valid_dirty)
return;
  
-	WARN_ON_ONCE(readq_relaxed_poll_timeout(vlpi_base + GICR_VPENDBASER,

-   val,
-   !(val & GICR_VPENDBASER_Dirty),
-   10, 500));
+   WARN_ON_ONCE(readq_relaxed_poll_timeout_atomic(vlpi_base + 
GICR_VPENDBASER,
+  val,
+  !(val & 
GICR_VPENDBASER_Dirty),
+  10, 500));
  }
  
  static void its_vpe_schedule(struct its_vpe *vpe)




[PATCH] irqchip/gic-v4.1: Use readx_poll_timeout_atomic() to fix sleep in atomic

2020-06-04 Thread Zenghui Yu
readx_poll_timeout() can sleep if @sleep_us is specified by the caller,
and is therefore unsafe to be used inside the atomic context, which is
this case when we use it to poll the GICR_VPENDBASER.Dirty bit in
irq_set_vcpu_affinity() callback.

Let's convert to its atomic version instead which helps to get the v4.1
board back to life!

Fixes: 96806229ca03 ("irqchip/gic-v4.1: Add support for VPENDBASER's 
Dirty+Valid signaling")
Signed-off-by: Zenghui Yu 
---
 drivers/irqchip/irq-gic-v3-its.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index cd685f521c77..6a5a87fc4601 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -3797,10 +3797,10 @@ static void its_wait_vpt_parse_complete(void)
if (!gic_rdists->has_vpend_valid_dirty)
return;
 
-   WARN_ON_ONCE(readq_relaxed_poll_timeout(vlpi_base + GICR_VPENDBASER,
-   val,
-   !(val & GICR_VPENDBASER_Dirty),
-   10, 500));
+   WARN_ON_ONCE(readq_relaxed_poll_timeout_atomic(vlpi_base + 
GICR_VPENDBASER,
+  val,
+  !(val & 
GICR_VPENDBASER_Dirty),
+  10, 500));
 }
 
 static void its_vpe_schedule(struct its_vpe *vpe)
-- 
2.19.1



Re: [PATCH RFC] KVM: arm64: Sidestep stage2_unmap_vm() on vcpu reset when S2FWB is supported

2020-05-31 Thread Zenghui Yu

Hi Alex,

On 2020/5/30 18:46, Alexandru Elisei wrote:

Hi,

On 4/20/20 5:10 PM, Alexandru Elisei wrote:


[ For some unknown reasons, I had missed your reply one month ago.
  Sorry, I'm going to fix my email settings ... ]


Hi,

On 4/15/20 8:28 AM, Zenghui Yu wrote:

stage2_unmap_vm() was introduced to unmap user RAM region in the stage2
page table to make the caches coherent. E.g., a guest reboot with stage1
MMU disabled will access memory using non-cacheable attributes. If the
RAM and caches are not coherent at this stage, some evicted dirty cache
line may go and corrupt guest data in RAM.

Since ARMv8.4, S2FWB feature is mandatory and KVM will take advantage
of it to configure the stage2 page table and the attributes of memory
access. So we ensure that guests always access memory using cacheable
attributes and thus, the caches always be coherent.

So on CPUs that support S2FWB, we can safely reset the vcpu without a
heavy stage2 unmapping.

Cc: Marc Zyngier 
Cc: Christoffer Dall 
Cc: James Morse 
Cc: Julien Thierry 
Cc: Suzuki K Poulose 
Signed-off-by: Zenghui Yu 
---

If this is correct, there should be a great performance improvement on
a guest reboot (or reset) on systems support S2FWB. But I'm afraid that
I've missed some points here, so please comment!

The commit 957db105c997 ("arm/arm64: KVM: Introduce stage2_unmap_vm")
was merged about six years ago and I failed to track its histroy and
intention. Instead of a whole stage2 unmapping, something like
stage2_flush_vm() looks enough to me. But again, I'm unsure...

Thanks for having a look!

I had a chat with Christoffer about stage2_unmap_vm, and as I understood it, the
purpose was to make sure that any changes made by userspace were seen by the 
guest
while the MMU is off. When a stage 2 fault happens, we do clean+inval on the
dcache, or inval on the icache if it was an exec fault. This means that whatever
the host userspace writes while the guest is shut down and is still in the 
cache,
the guest will be able to read/execute.

This can be relevant if the guest relocates the kernel and overwrites the 
original
image location, and userspace copies the original kernel image back in before
restarting the vm.


Yes, I-cache coherency is what I had missed! So without a S2 unmapping
on reboot, if there's any stale and "valid" cache line in the I-cache,
guest may fetch the wrong instructions directly from it, and bad things
will happen... (We will otherwise get a translation fault and a
permission fault and invalidate the I-cache as needed.)




  virt/kvm/arm/arm.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 48d0ec44ad77..e6378162cdef 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -983,8 +983,11 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu 
*vcpu,
/*
 * Ensure a rebooted VM will fault in RAM pages and detect if the
 * guest MMU is turned off and flush the caches as needed.
+*
+* S2FWB enforces all memory accesses to RAM being cacheable, we
+* ensure that the cache is always coherent.
 */
-   if (vcpu->arch.has_run_once)
+   if (vcpu->arch.has_run_once && 
!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))

I think userspace does not invalidate the icache when loading a new kernel 
image,
and if the guest patched instructions, they could potentially still be in the
icache. Should the icache be invalidated if FWB is present?


I noticed that this was included in the current pull request and I remembered 
that
I wasn't sure about this part. Did some more digging and it turns out that FWB
implies no cache maintenance needed for *data to instruction* coherence. From 
ARM
DDI 0487F.b, page D5-2635:

"When ARMv8.4-S2FWB is implemented, the architecture requires that
CLIDR_EL1.{LOUU, LOIUS} are zero so that no levels of data cache need to be
cleaned in order to manage coherency with instruction fetches".

However, there's no mention that I found for instruction to data coherence,
meaning that the icache would still need to be invalidated on each vcpu in order
to prevent fetching of patched instructions from the icache. Am I missing 
something?


Thanks for the head up and Marc's fix!


Thanks both,
Zenghui


Re: [PATCH] irqchip/gic-v3-its: Don't try to move a disabled irq

2020-05-28 Thread Zenghui Yu

Hi,

On 2020/5/29 9:55, Ali Saidi wrote:

If an interrupt is disabled the ITS driver has sent a discard removing
the DeviceID and EventID from the ITT. After this occurs it can't be
moved to another collection with a MOVI and a command error occurs if
attempted. Before issuing the MOVI command make sure that the IRQ isn't
disabled and change the activate code to try and use the previous
affinity.

Signed-off-by: Ali Saidi 
---
  drivers/irqchip/irq-gic-v3-its.c | 18 +++---
  1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 124251b0ccba..1235dd9a2fb2 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1540,7 +1540,11 @@ static int its_set_affinity(struct irq_data *d, const 
struct cpumask *mask_val,
/* don't set the affinity when the target cpu is same as current one */
if (cpu != its_dev->event_map.col_map[id]) {
target_col = _dev->its->collections[cpu];
-   its_send_movi(its_dev, target_col, id);
+
+   /* If the IRQ is disabled a discard was sent so don't move */
+   if (!irqd_irq_disabled(d))
+   its_send_movi(its_dev, target_col, id);


It looks to me that if the IRQ is disabled, we mask the enable bit in
the corresponding LPI configuration table entry, but not sending DISCARD
to remove the DevID/EventID mapping. And moving a disabled LPI is
actually allowed by the GIC architecture, right?


+
its_dev->event_map.col_map[id] = cpu;
irq_data_update_effective_affinity(d, cpumask_of(cpu));
}
@@ -3439,8 +3443,16 @@ static int its_irq_domain_activate(struct irq_domain 
*domain,
if (its_dev->its->numa_node >= 0)
cpu_mask = cpumask_of_node(its_dev->its->numa_node);
  
-	/* Bind the LPI to the first possible CPU */

-   cpu = cpumask_first_and(cpu_mask, cpu_online_mask);
+   /* If the cpu set to a different CPU that is still online use it */
+   cpu = its_dev->event_map.col_map[event];
+
+   cpumask_and(cpu_mask, cpu_mask, cpu_online_mask);
+
+   if (!cpumask_test_cpu(cpu, cpu_mask)) {
+   /* Bind the LPI to the first possible CPU */
+   cpu = cpumask_first(cpu_mask);
+   }


I'd like to know what actual problem you had seen and the way to
reproduce it :-)


Thanks,
Zenghui


[PATCH] drm/msm/dpu: Fix usage of ERR_PTR()

2020-05-28 Thread Zenghui Yu
ERR_PTR() is used in the kernel to encode an usual *negative* errno code
into a pointer.  Passing a positive value (ENOMEM) to it will break the
following IS_ERR() check.

Though memory allocation is unlikely to fail, it's still worth fixing.
And grepping shows that this is the only misuse of ERR_PTR() in kernel.

Fixes: 25fdd5933e4c ("drm/msm: Add SDM845 DPU support")
Signed-off-by: Zenghui Yu 
---
 drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c 
b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c
index a1b79ee2bd9d..a2f6b688a976 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c
@@ -2173,7 +2173,7 @@ struct drm_encoder *dpu_encoder_init(struct drm_device 
*dev,
 
dpu_enc = devm_kzalloc(dev->dev, sizeof(*dpu_enc), GFP_KERNEL);
if (!dpu_enc)
-   return ERR_PTR(ENOMEM);
+   return ERR_PTR(-ENOMEM);
 
rc = drm_encoder_init(dev, _enc->base, _encoder_funcs,
drm_enc_mode, NULL);
-- 
2.19.1



Re: [PATCH] ACPI/IORT: Remove the unused __get_pci_rid()

2020-05-24 Thread Zenghui Yu

On 2020/5/9 17:56, Hanjun Guo wrote:

On 2020/5/9 17:34, Zenghui Yu wrote:

Since commit bc8648d49a95 ("ACPI/IORT: Handle PCI aliases properly for
IOMMUs"), __get_pci_rid() has become actually unused and can be removed.

Signed-off-by: Zenghui Yu 


Looks good to me,

Acked-by: Hanjun Guo 


Hi Will,

Could you please take this patch [*] into v5.8 (if it's not too late)?
I've tried and it can be applied to for-next/acpi cleanly.

[*] 
https://lore.kernel.org/linux-acpi/20200509093430.1983-1-yuzeng...@huawei.com/



Thanks,
Zenghui


[PATCH] ACPI/IORT: Remove the unused __get_pci_rid()

2020-05-09 Thread Zenghui Yu
Since commit bc8648d49a95 ("ACPI/IORT: Handle PCI aliases properly for
IOMMUs"), __get_pci_rid() has become actually unused and can be removed.

Signed-off-by: Zenghui Yu 
---
 drivers/acpi/arm64/iort.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 7d04424189df..ec94dbb60c7a 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -789,15 +789,6 @@ void acpi_configure_pmsi_domain(struct device *dev)
dev_set_msi_domain(dev, msi_domain);
 }
 
-static int __maybe_unused __get_pci_rid(struct pci_dev *pdev, u16 alias,
-   void *data)
-{
-   u32 *rid = data;
-
-   *rid = alias;
-   return 0;
-}
-
 #ifdef CONFIG_IOMMU_API
 static struct acpi_iort_node *iort_get_msi_resv_iommu(struct device *dev)
 {
-- 
2.19.1



[PATCH resend 2/2] KVM: arm64: Unify handling THP backed host memory

2020-05-07 Thread Zenghui Yu
From: Suzuki K Poulose 

We support mapping host memory backed by PMD transparent hugepages
at stage2 as huge pages. However the checks are now spread across
two different places. Let us unify the handling of the THPs to
keep the code cleaner (and future proof for PUD THP support).
This patch moves transparent_hugepage_adjust() closer to the caller
to avoid a forward declaration for fault_supports_stage2_huge_mappings().

Also, since we already handle the case where the host VA and the guest
PA may not be aligned, the explicit VM_BUG_ON() is not required.

Cc: Marc Zyngier 
Cc: Christoffer Dall 
Signed-off-by: Suzuki K Poulose 
Signed-off-by: Zenghui Yu 
---
 virt/kvm/arm/mmu.c | 115 +++--
 1 file changed, 60 insertions(+), 55 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 557f36866d1c..93a770fd2b5e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1372,47 +1372,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
return ret;
 }
 
-static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
-{
-   kvm_pfn_t pfn = *pfnp;
-   gfn_t gfn = *ipap >> PAGE_SHIFT;
-
-   if (kvm_is_transparent_hugepage(pfn)) {
-   unsigned long mask;
-   /*
-* The address we faulted on is backed by a transparent huge
-* page.  However, because we map the compound huge page and
-* not the individual tail page, we need to transfer the
-* refcount to the head page.  We have to be careful that the
-* THP doesn't start to split while we are adjusting the
-* refcounts.
-*
-* We are sure this doesn't happen, because mmu_notifier_retry
-* was successful and we are holding the mmu_lock, so if this
-* THP is trying to split, it will be blocked in the mmu
-* notifier before touching any of the pages, specifically
-* before being able to call __split_huge_page_refcount().
-*
-* We can therefore safely transfer the refcount from PG_tail
-* to PG_head and switch the pfn from a tail page to the head
-* page accordingly.
-*/
-   mask = PTRS_PER_PMD - 1;
-   VM_BUG_ON((gfn & mask) != (pfn & mask));
-   if (pfn & mask) {
-   *ipap &= PMD_MASK;
-   kvm_release_pfn_clean(pfn);
-   pfn &= ~mask;
-   kvm_get_pfn(pfn);
-   *pfnp = pfn;
-   }
-
-   return true;
-   }
-
-   return false;
-}
-
 /**
  * stage2_wp_ptes - write protect PMD range
  * @pmd:   pointer to pmd entry
@@ -1660,6 +1619,59 @@ static bool fault_supports_stage2_huge_mapping(struct 
kvm_memory_slot *memslot,
   (hva & ~(map_size - 1)) + map_size <= uaddr_end;
 }
 
+/*
+ * Check if the given hva is backed by a transparent huge page (THP) and
+ * whether it can be mapped using block mapping in stage2. If so, adjust
+ * the stage2 PFN and IPA accordingly. Only PMD_SIZE THPs are currently
+ * supported. This will need to be updated to support other THP sizes.
+ *
+ * Returns the size of the mapping.
+ */
+static unsigned long
+transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
+   unsigned long hva, kvm_pfn_t *pfnp,
+   phys_addr_t *ipap)
+{
+   kvm_pfn_t pfn = *pfnp;
+
+   /*
+* Make sure the adjustment is done only for THP pages. Also make
+* sure that the HVA and IPA are sufficiently aligned and that the
+* block map is contained within the memslot.
+*/
+   if (kvm_is_transparent_hugepage(pfn) &&
+   fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+   /*
+* The address we faulted on is backed by a transparent huge
+* page.  However, because we map the compound huge page and
+* not the individual tail page, we need to transfer the
+* refcount to the head page.  We have to be careful that the
+* THP doesn't start to split while we are adjusting the
+* refcounts.
+*
+* We are sure this doesn't happen, because mmu_notifier_retry
+* was successful and we are holding the mmu_lock, so if this
+* THP is trying to split, it will be blocked in the mmu
+* notifier before touching any of the pages, specifically
+* before being able to call __split_huge_page_refcount().
+*
+* We can therefore safely transfer the refcount from PG_tail
+* to PG_head and switch the pfn from a tail page to the

[PATCH resend 1/2] KVM: arm64: Clean up the checking for huge mapping

2020-05-07 Thread Zenghui Yu
From: Suzuki K Poulose 

If we are checking whether the stage2 can map PAGE_SIZE,
we don't have to do the boundary checks as both the host
VMA and the guest memslots are page aligned. Bail the case
easily.

While we're at it, fixup a typo in the comment below.

Cc: Christoffer Dall 
Cc: Marc Zyngier 
Signed-off-by: Suzuki K Poulose 
Signed-off-by: Zenghui Yu 
---
 virt/kvm/arm/mmu.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e3b9ee268823..557f36866d1c 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1607,6 +1607,10 @@ static bool fault_supports_stage2_huge_mapping(struct 
kvm_memory_slot *memslot,
hva_t uaddr_start, uaddr_end;
size_t size;
 
+   /* The memslot and the VMA are guaranteed to be aligned to PAGE_SIZE */
+   if (map_size == PAGE_SIZE)
+   return true;
+
size = memslot->npages * PAGE_SIZE;
 
gpa_start = memslot->base_gfn << PAGE_SHIFT;
@@ -1626,7 +1630,7 @@ static bool fault_supports_stage2_huge_mapping(struct 
kvm_memory_slot *memslot,
 *|abcde|fgh  Stage-1 block  |Stage-1 block tv|xyz|
 *+-+++---+
 *
-*memslot->base_gfn << PAGE_SIZE:
+*memslot->base_gfn << PAGE_SHIFT:
 *  +---+++-+
 *  |abc|def  Stage-2 block  |Stage-2 block   |tvxyz|
 *  +---+++-+
-- 
2.19.1




[PATCH resend 0/2] KVM: arm64: Unify stage2 mapping for THP backed memory

2020-05-07 Thread Zenghui Yu
This series was originally posted by Suzuki K Poulose a year ago [*],
with the aim of cleaning up the handling of the stage2 huge mapping for
THP. I think this still helps to make the current code cleaner, so
rebase it on top of kvmarm/master and repost it for acceptance.

Thanks!

[*] 
https://lore.kernel.org/kvm/1554909832-7169-1-git-send-email-suzuki.poul...@arm.com/

Suzuki K Poulose (2):
  KVM: arm64: Clean up the checking for huge mapping
  KVM: arm64: Unify handling THP backed host memory

 virt/kvm/arm/mmu.c | 121 -
 1 file changed, 65 insertions(+), 56 deletions(-)

-- 
2.19.1




[tip: irq/urgent] irqchip/gic-v3: Fix GIC_LINE_NR accessor

2019-10-14 Thread tip-bot2 for Zenghui Yu
The following commit has been merged into the irq/urgent branch of tip:

Commit-ID: c107d613f9204ff9c7624c229938153d7492c56e
Gitweb:
https://git.kernel.org/tip/c107d613f9204ff9c7624c229938153d7492c56e
Author:Zenghui Yu 
AuthorDate:Wed, 18 Sep 2019 06:57:30 
Committer: Marc Zyngier 
CommitterDate: Wed, 18 Sep 2019 11:42:23 +01:00

irqchip/gic-v3: Fix GIC_LINE_NR accessor

As per GIC spec, ITLinesNumber indicates the maximum SPI INTID that
the GIC implementation supports. And the maximum SPI INTID an
implementation might support is 1019 (field value 1).

max(GICD_TYPER_SPIS(...), 1020) is not what we actually want for
GIC_LINE_NR. Fix it to min(GICD_TYPER_SPIS(...), 1020).

Signed-off-by: Zenghui Yu 
Signed-off-by: Marc Zyngier 
Link: 
https://lore.kernel.org/r/1568789850-14080-1-git-send-email-yuzeng...@huawei.com
---
 drivers/irqchip/irq-gic-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 422664a..1edc993 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -59,7 +59,7 @@ static struct gic_chip_data gic_data __read_mostly;
 static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key);
 
 #define GIC_ID_NR  (1U << GICD_TYPER_ID_BITS(gic_data.rdists.gicd_typer))
-#define GIC_LINE_NRmax(GICD_TYPER_SPIS(gic_data.rdists.gicd_typer), 1020U)
+#define GIC_LINE_NRmin(GICD_TYPER_SPIS(gic_data.rdists.gicd_typer), 1020U)
 #define GIC_ESPI_NRGICD_TYPER_ESPIS(gic_data.rdists.gicd_typer)
 
 /*


[PATCH] irqchip/gic-v3: Fix GIC_LINE_NR accessor

2019-09-18 Thread Zenghui Yu
As per GIC spec, ITLinesNumber indicates the maximum SPI INTID that
the GIC implementation supports. And the maximum SPI INTID an
implementation might support is 1019 (field value 1).

max(GICD_TYPER_SPIS(...), 1020) is not what we actually want for
GIC_LINE_NR. Fix it to min(GICD_TYPER_SPIS(...), 1020).

Signed-off-by: Zenghui Yu 
---

Hi Marc,

I still see "GICv3: 992 SPIs implemented" on the host. I go back to
https://patchwork.kernel.org/patch/11078623/ and it seems that we
failed to make the GIC_LINE_NR correct at that time.

 drivers/irqchip/irq-gic-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 7b0c96b9e02f..f4a49aef5ca4 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -59,7 +59,7 @@ static struct gic_chip_data gic_data __read_mostly;
 static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key);
 
 #define GIC_ID_NR  (1U << GICD_TYPER_ID_BITS(gic_data.rdists.gicd_typer))
-#define GIC_LINE_NRmax(GICD_TYPER_SPIS(gic_data.rdists.gicd_typer), 1020U)
+#define GIC_LINE_NRmin(GICD_TYPER_SPIS(gic_data.rdists.gicd_typer), 1020U)
 #define GIC_ESPI_NRGICD_TYPER_ESPIS(gic_data.rdists.gicd_typer)
 
 /*
-- 
2.19.1




[tip: irq/core] irqchip/gic-v3-its: Remove the redundant set_bit for lpi_map

2019-09-06 Thread tip-bot2 for Zenghui Yu
The following commit has been merged into the irq/core branch of tip:

Commit-ID: 342be1068d9b5b1fd364d270b4f731764e23de2b
Gitweb:
https://git.kernel.org/tip/342be1068d9b5b1fd364d270b4f731764e23de2b
Author:Zenghui Yu 
AuthorDate:Sat, 27 Jul 2019 06:14:22 
Committer: Marc Zyngier 
CommitterDate: Tue, 20 Aug 2019 10:34:34 +01:00

irqchip/gic-v3-its: Remove the redundant set_bit for lpi_map

We try to find a free LPI region in device's lpi_map and allocate them
(set them to 1) when we want to allocate LPIs for this device. This is
what bitmap_find_free_region() has done for us. The following set_bit
is redundant and a bit confusing (since we only set_bit against the first
allocated LPI idx). Remove it, and make the set_bit explicit by comment.

Signed-off-by: Zenghui Yu 
Signed-off-by: Marc Zyngier 
---
 drivers/irqchip/irq-gic-v3-its.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 8eeb0e2..9380aa4 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2464,6 +2464,7 @@ static int its_alloc_device_irq(struct its_device *dev, 
int nvecs, irq_hw_number
 {
int idx;
 
+   /* Find a free LPI region in lpi_map and allocate them. */
idx = bitmap_find_free_region(dev->event_map.lpi_map,
  dev->event_map.nr_lpis,
  get_count_order(nvecs));
@@ -2471,7 +2472,6 @@ static int its_alloc_device_irq(struct its_device *dev, 
int nvecs, irq_hw_number
return -ENOSPC;
 
*hwirq = dev->event_map.lpi_base + idx;
-   set_bit(idx, dev->event_map.lpi_map);
 
return 0;
 }


Re: [PATCH v4 05/10] KVM: arm64: Support stolen time reporting via shared structure

2019-09-03 Thread Zenghui Yu

On 2019/8/30 16:42, Steven Price wrote:

Implement the service call for configuring a shared structure between a
VCPU and the hypervisor in which the hypervisor can write the time
stolen from the VCPU's execution time by other tasks on the host.

The hypervisor allocates memory which is placed at an IPA chosen by user
space.


It seems that no allocation happens in the hypervisor code.  User space
will do it instead?


The hypervisor then updates the shared structure using
kvm_put_guest() to ensure single copy atomicity of the 64-bit value
reporting the stolen time in nanoseconds.

Whenever stolen time is enabled by the guest, the stolen time counter is
reset.

The stolen time itself is retrieved from the sched_info structure
maintained by the Linux scheduler code. We enable SCHEDSTATS when
selecting KVM Kconfig to ensure this value is meaningful.

Signed-off-by: Steven Price 


Thanks,
zenghui



Re: [PATCH v3 05/10] KVM: arm64: Support stolen time reporting via shared structure

2019-08-23 Thread Zenghui Yu

Hi Steven,

Only one comment, at the bottom.

On 2019/8/21 23:36, Steven Price wrote:

Implement the service call for configuring a shared structure between a
VCPU and the hypervisor in which the hypervisor can write the time
stolen from the VCPU's execution time by other tasks on the host.

The hypervisor allocates memory which is placed at an IPA chosen by user
space. The hypervisor then updates the shared structure using
kvm_put_guest() to ensure single copy atomicity of the 64-bit value
reporting the stolen time in nanoseconds.

Whenever stolen time is enabled by the guest, the stolen time counter is
reset.

The stolen time itself is retrieved from the sched_info structure
maintained by the Linux scheduler code. We enable SCHEDSTATS when
selecting KVM Kconfig to ensure this value is meaningful.

Signed-off-by: Steven Price 
---
  arch/arm/include/asm/kvm_host.h   | 20 +
  arch/arm64/include/asm/kvm_host.h | 25 +++-
  arch/arm64/kvm/Kconfig|  1 +
  include/linux/kvm_types.h |  2 +
  virt/kvm/arm/arm.c| 10 +
  virt/kvm/arm/hypercalls.c |  3 ++
  virt/kvm/arm/pvtime.c | 67 +++
  7 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 369b5d2d54bf..47d2ced99421 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -39,6 +39,7 @@
KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
  #define KVM_REQ_IRQ_PENDING   KVM_ARCH_REQ(1)
  #define KVM_REQ_VCPU_RESETKVM_ARCH_REQ(2)
+#define KVM_REQ_RECORD_STEAL   KVM_ARCH_REQ(3)
  
  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
  
@@ -329,6 +330,25 @@ static inline int kvm_hypercall_pv_features(struct kvm_vcpu *vcpu)

return SMCCC_RET_NOT_SUPPORTED;
  }
  
+static inline int kvm_hypercall_stolen_time(struct kvm_vcpu *vcpu)

+{
+   return SMCCC_RET_NOT_SUPPORTED;
+}
+
+static inline int kvm_update_stolen_time(struct kvm_vcpu *vcpu, bool init)
+{
+   return -ENOTSUPP;
+}
+
+static inline void kvm_pvtime_init_vm(struct kvm_arch *kvm_arch)
+{
+}
+
+static inline bool kvm_is_pvtime_enabled(struct kvm_arch *kvm_arch)
+{
+   return false;
+}
+
  void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
  
  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 583b3639062a..b6fa7beffd8a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -44,6 +44,7 @@
KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
  #define KVM_REQ_IRQ_PENDING   KVM_ARCH_REQ(1)
  #define KVM_REQ_VCPU_RESETKVM_ARCH_REQ(2)
+#define KVM_REQ_RECORD_STEAL   KVM_ARCH_REQ(3)
  
  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
  
@@ -83,6 +84,11 @@ struct kvm_arch {
  
  	/* Mandated version of PSCI */

u32 psci_version;
+
+   struct kvm_arch_pvtime {
+   gpa_t st_base;
+   u64 st_size;
+   } pvtime;
  };
  
  #define KVM_NR_MEM_OBJS 40

@@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
/* True when deferrable sysregs are loaded on the physical CPU,
 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
bool sysregs_loaded_on_cpu;
-};
  
+	/* Guest PV state */

+   struct {
+   u64 steal;
+   u64 last_steal;
+   } steal;
+};
  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
  #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
  sve_ffr_offset((vcpu)->arch.sve_max_vl)))
@@ -479,6 +490,18 @@ int kvm_perf_init(void);
  int kvm_perf_teardown(void);
  
  int kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);

+int kvm_hypercall_stolen_time(struct kvm_vcpu *vcpu);
+int kvm_update_stolen_time(struct kvm_vcpu *vcpu, bool init);
+
+static inline void kvm_pvtime_init_vm(struct kvm_arch *kvm_arch)
+{
+   kvm_arch->pvtime.st_base = GPA_INVALID;
+}
+
+static inline bool kvm_is_pvtime_enabled(struct kvm_arch *kvm_arch)
+{
+   return (kvm_arch->pvtime.st_base != GPA_INVALID);
+}
  
  void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
  
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig

index a67121d419a2..d8b88e40d223 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -39,6 +39,7 @@ config KVM
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
select HAVE_KVM_VCPU_RUN_PID_CHANGE
+   select SCHEDSTATS
---help---
  Support hosting virtualized guest machines.
  We don't support KVM with 16K page tables yet, due to the multiple
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index bde5374ae021..1c88e69db3d9 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -35,6 +35,8 @@ typedef 

Re: [PATCH v3 10/10] arm64: Retrieve stolen time as paravirtualized guest

2019-08-23 Thread Zenghui Yu

Hi Steven,

On 2019/8/21 23:36, Steven Price wrote:

Enable paravirtualization features when running under a hypervisor
supporting the PV_TIME_ST hypercall.

For each (v)CPU, we ask the hypervisor for the location of a shared
page which the hypervisor will use to report stolen time to us. We set
pv_time_ops to the stolen time function which simply reads the stolen
value from the shared page for a VCPU. We guarantee single-copy
atomicity using READ_ONCE which means we can also read the stolen
time for another VCPU than the currently running one while it is
potentially being updated by the hypervisor.

Signed-off-by: Steven Price 
---
  arch/arm64/include/asm/paravirt.h |   9 +-
  arch/arm64/kernel/paravirt.c  | 148 ++
  arch/arm64/kernel/time.c  |   3 +
  include/linux/cpuhotplug.h|   1 +
  4 files changed, 160 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/paravirt.h 
b/arch/arm64/include/asm/paravirt.h
index 799d9dd6f7cc..125c26c42902 100644
--- a/arch/arm64/include/asm/paravirt.h
+++ b/arch/arm64/include/asm/paravirt.h
@@ -21,6 +21,13 @@ static inline u64 paravirt_steal_clock(int cpu)
  {
return pv_ops.time.steal_clock(cpu);
  }
-#endif
+
+int __init kvm_guest_init(void);
+
+#else
+
+#define kvm_guest_init()
+
+#endif // CONFIG_PARAVIRT
  
  #endif

diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index 4cfed91fe256..ea8dbbbd3293 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -6,13 +6,161 @@
   * Author: Stefano Stabellini 
   */
  
+#define pr_fmt(fmt) "kvmarm-pv: " fmt

+
+#include 
+#include 
  #include 
+#include 
  #include 
+#include 
+#include 
+#include 
+#include 
  #include 
+
  #include 
+#include 
+#include 
  
  struct static_key paravirt_steal_enabled;

  struct static_key paravirt_steal_rq_enabled;
  
  struct paravirt_patch_template pv_ops;

  EXPORT_SYMBOL_GPL(pv_ops);
+
+struct kvmarm_stolen_time_region {
+   struct pvclock_vcpu_stolen_time *kaddr;
+};
+
+static DEFINE_PER_CPU(struct kvmarm_stolen_time_region, stolen_time_region);
+
+static bool steal_acc = true;
+static int __init parse_no_stealacc(char *arg)
+{
+   steal_acc = false;
+   return 0;
+}
+
+early_param("no-steal-acc", parse_no_stealacc);
+
+/* return stolen time in ns by asking the hypervisor */
+static u64 kvm_steal_clock(int cpu)
+{
+   struct kvmarm_stolen_time_region *reg;
+
+   reg = per_cpu_ptr(_time_region, cpu);
+   if (!reg->kaddr) {
+   pr_warn_once("stolen time enabled but not configured for cpu 
%d\n",
+cpu);
+   return 0;
+   }
+
+   return le64_to_cpu(READ_ONCE(reg->kaddr->stolen_time));
+}
+
+static int disable_stolen_time_current_cpu(void)
+{
+   struct kvmarm_stolen_time_region *reg;
+
+   reg = this_cpu_ptr(_time_region);
+   if (!reg->kaddr)
+   return 0;
+
+   memunmap(reg->kaddr);
+   memset(reg, 0, sizeof(*reg));
+
+   return 0;
+}
+
+static int stolen_time_dying_cpu(unsigned int cpu)
+{
+   return disable_stolen_time_current_cpu();
+}
+
+static int init_stolen_time_cpu(unsigned int cpu)
+{
+   struct kvmarm_stolen_time_region *reg;
+   struct arm_smccc_res res;
+
+   reg = this_cpu_ptr(_time_region);
+
+   arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_ST, );
+
+   if ((long)res.a0 < 0)
+   return -EINVAL;
+
+   reg->kaddr = memremap(res.a0,
+ sizeof(struct pvclock_vcpu_stolen_time),
+ MEMREMAP_WB);


cpuhp callbacks can be invoked in atomic context (see:
secondary_start_kernel ->
notify_cpu_starting ->
invoke callbacks),
but memremap might sleep...

Try to run a DEBUG_ATOMIC_SLEEP enabled PV guest, I guess we will be
greeted by the Sleep-in-Atomic-Context BUG.  We need an alternative
here?


+
+   if (!reg->kaddr) {
+   pr_warn("Failed to map stolen time data structure\n");
+   return -ENOMEM;
+   }
+
+   if (le32_to_cpu(reg->kaddr->revision) != 0 ||
+   le32_to_cpu(reg->kaddr->attributes) != 0) {
+   pr_warn("Unexpected revision or attributes in stolen time 
data\n");
+   return -ENXIO;
+   }
+
+   return 0;
+}
+
+static int kvm_arm_init_stolen_time(void)
+{
+   int ret;
+
+   ret = cpuhp_setup_state(CPUHP_AP_ARM_KVMPV_STARTING,
+   "hypervisor/kvmarm/pv:starting",
+   init_stolen_time_cpu, stolen_time_dying_cpu);
+   if (ret < 0)
+   return ret;
+   return 0;
+}
+
+static bool has_kvm_steal_clock(void)
+{
+   struct arm_smccc_res res;
+
+   /* To detect the presence of PV time support we require SMCCC 1.1+ */
+   if (psci_ops.smccc_version < SMCCC_VERSION_1_1)
+   return false;
+
+   arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
+ 

Re: [PATCH v2 05/12] irqchip/gic: Prepare for more than 16 PPIs

2019-08-21 Thread Zenghui Yu

On 2019/8/6 18:01, Marc Zyngier wrote:

GICv3.1 allows up to 80 PPIs (16 legaci PPIs and 64 Extended PPIs),

  ^^
legacy?


Zenghui


meaning we can't just leave the old 16 hardcoded everywhere.

We also need to add the infrastructure to discover the number of PPIs
on a per redistributor basis, although we still pretend there is only
16 of them for now.

No functional change.

Signed-off-by: Marc Zyngier 




Re: [PATCH v2 01/12] irqchip/gic: Rework gic_configure_irq to take the full ICFGR base

2019-08-19 Thread Zenghui Yu

Hi Marc,

On 2019/8/6 18:01, Marc Zyngier wrote:

gic_configure_irq is currently passed the (re)distributor address,
to which it applies an a fixed offset to get to the configuration
registers. This offset is constant across all GICs, or rather it was
until to v3.1...

An easy way out is for the individual drivers to pass the base
address of the configuration register for the considered interrupt.
At the same time, move part of the error handling back to the
individual drivers, as things are about to change on that front.

Signed-off-by: Marc Zyngier 
---
  drivers/irqchip/irq-gic-common.c | 14 +-
  drivers/irqchip/irq-gic-v3.c | 11 ++-
  drivers/irqchip/irq-gic.c| 10 +-
  drivers/irqchip/irq-hip04.c  |  7 ++-
  4 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/irqchip/irq-gic-common.c b/drivers/irqchip/irq-gic-common.c
index b0a8215a13fc..6900b6f0921c 100644
--- a/drivers/irqchip/irq-gic-common.c
+++ b/drivers/irqchip/irq-gic-common.c
@@ -63,7 +63,7 @@ int gic_configure_irq(unsigned int irq, unsigned int type,
 * for "irq", depending on "type".
 */
raw_spin_lock_irqsave(_controller_lock, flags);
-   val = oldval = readl_relaxed(base + GIC_DIST_CONFIG + confoff);
+   val = oldval = readl_relaxed(base + confoff);
if (type & IRQ_TYPE_LEVEL_MASK)
val &= ~confmask;
else if (type & IRQ_TYPE_EDGE_BOTH)
@@ -83,14 +83,10 @@ int gic_configure_irq(unsigned int irq, unsigned int type,
 * does not allow us to set the configuration or we are in a
 * non-secure mode, and hence it may not be catastrophic.
 */
-   writel_relaxed(val, base + GIC_DIST_CONFIG + confoff);
-   if (readl_relaxed(base + GIC_DIST_CONFIG + confoff) != val) {
-   if (WARN_ON(irq >= 32))
-   ret = -EINVAL;


Since this WARN_ON is dropped, the comment above should also be updated.
But what is the reason for deleting it?  (It may give us some points
when we fail to set type for SPIs.)


Thanks,
zenghui


-   else
-   pr_warn("GIC: PPI%d is secure or misconfigured\n",
-   irq - 16);
-   }
+   writel_relaxed(val, base + confoff);
+   if (readl_relaxed(base + confoff) != val)
+   ret = -EINVAL;
+
raw_spin_unlock_irqrestore(_controller_lock, flags);
  
  	if (sync_access)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 96d927f0f91a..b250e69908f8 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -407,6 +407,7 @@ static int gic_set_type(struct irq_data *d, unsigned int 
type)
unsigned int irq = gic_irq(d);
void (*rwp_wait)(void);
void __iomem *base;
+   int ret;
  
  	/* Interrupt configuration for SGIs can't be changed */

if (irq < 16)
@@ -425,7 +426,15 @@ static int gic_set_type(struct irq_data *d, unsigned int 
type)
rwp_wait = gic_dist_wait_for_rwp;
}
  
-	return gic_configure_irq(irq, type, base, rwp_wait);

+
+   ret = gic_configure_irq(irq, type, base + GICD_ICFGR, rwp_wait);
+   if (ret && irq < 32) {
+   /* Misconfigured PPIs are usually not fatal */
+   pr_warn("GIC: PPI%d is secure or misconfigured\n", irq - 16);
+   ret = 0;
+   }
+
+   return ret;
  }
  
  static int gic_irq_set_vcpu_affinity(struct irq_data *d, void *vcpu)

diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index e45f45e68720..ab48760acabb 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -291,6 +291,7 @@ static int gic_set_type(struct irq_data *d, unsigned int 
type)
  {
void __iomem *base = gic_dist_base(d);
unsigned int gicirq = gic_irq(d);
+   int ret;
  
  	/* Interrupt configuration for SGIs can't be changed */

if (gicirq < 16)
@@ -301,7 +302,14 @@ static int gic_set_type(struct irq_data *d, unsigned int 
type)
type != IRQ_TYPE_EDGE_RISING)
return -EINVAL;
  
-	return gic_configure_irq(gicirq, type, base, NULL);

+   ret = gic_configure_irq(gicirq, type, base + GIC_DIST_CONFIG, NULL);
+   if (ret && gicirq < 32) {
+   /* Misconfigured PPIs are usually not fatal */
+   pr_warn("GIC: PPI%d is secure or misconfigured\n", gicirq - 16);
+   ret = 0;
+   }
+
+   return ret;
  }
  
  static int gic_irq_set_vcpu_affinity(struct irq_data *d, void *vcpu)

diff --git a/drivers/irqchip/irq-hip04.c b/drivers/irqchip/irq-hip04.c
index cf705827599c..1626131834a6 100644
--- a/drivers/irqchip/irq-hip04.c
+++ b/drivers/irqchip/irq-hip04.c
@@ -130,7 +130,12 @@ static int hip04_irq_set_type(struct irq_data *d, unsigned 
int type)
  
  	raw_spin_lock(_controller_lock);
  
-	ret = gic_configure_irq(irq, type, base, NULL);

+   ret = gic_configure_irq(irq, type, base + 

Re: [PATCH v2 04/12] irqchip/gic-v3: Add ESPI range support

2019-08-19 Thread Zenghui Yu

Hi Marc,

On 2019/8/6 18:01, Marc Zyngier wrote:

Add the required support for the ESPI range, which behave exactly like
the SPIs of old, only with new funky INTIDs.

Signed-off-by: Marc Zyngier 
---
  drivers/irqchip/irq-gic-v3.c   | 85 --
  include/linux/irqchip/arm-gic-v3.h | 17 +-
  2 files changed, 85 insertions(+), 17 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index db3bdedd7241..1ca4dde32034 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -51,13 +51,16 @@ struct gic_chip_data {
u32 nr_redist_regions;
u64 flags;
boolhas_rss;
-   unsigned intirq_nr;
struct partition_desc   *ppi_descs[16];
  };
  
  static struct gic_chip_data gic_data __read_mostly;

  static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key);
  
+#define GIC_ID_NR	(1U << GICD_TYPER_ID_BITS(gic_data.rdists.gicd_typer))

+#define GIC_LINE_NRGICD_TYPER_SPIS(gic_data.rdists.gicd_typer)


This indicates the maximum SPI INTID that the GIC implementation
supports, should we restrict it to no more than 1020?

ITLinesNumber can be '1', and I saw the following info on my host:
"GICv3: 992 SPIs implemented"


Thanks,
zenghui



Re: [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest

2019-08-13 Thread Zenghui Yu

On 2019/8/12 18:39, Steven Price wrote:

On 09/08/2019 14:51, Zenghui Yu wrote:
[...]

Hi Steven,

Since userspace is not involved yet (right?), no one will create the
PV_TIME device for guest (and no one will specify the IPA of the shared
stolen time region), and I guess we will get a "not supported" error
here.

So what should we do if we want to test this series now?  Any userspace
tools?  If no, do you have any plans for userspace developing? ;-)


At the moment I have the following patch to kvmtool which creates the
PV_TIME device - this isn't in a state to go upstream, and Marc has
asked that I rework the memory allocation, so this will need to change.

It's a little ugly as it simply reserves the first page of RAM to use
for the PV time structures.


Thanks for sharing the code. It's good enough to show what is required
in user-space.

(I'm not familiar with kvmtool. I will first take some time to move the
steal time part to Qemu and see what will happen.)


Thanks,
zenghui


8<
diff --git a/Makefile b/Makefile
index 3862112..a79956b 100644
--- a/Makefile
+++ b/Makefile
@@ -158,7 +158,7 @@ endif
  # ARM
  OBJS_ARM_COMMON   := arm/fdt.o arm/gic.o arm/gicv2m.o 
arm/ioport.o \
   arm/kvm.o arm/kvm-cpu.o arm/pci.o arm/timer.o \
-  arm/pmu.o
+  arm/pmu.o arm/pvtime.o
  HDRS_ARM_COMMON   := arm/include
  ifeq ($(ARCH), arm)
DEFINES += -DCONFIG_ARM
diff --git a/arm/fdt.c b/arm/fdt.c
index c80e6da..19eccbc 100644
--- a/arm/fdt.c
+++ b/arm/fdt.c
@@ -119,6 +119,7 @@ static int setup_fdt(struct kvm *kvm)
  
  	/* Create new tree without a reserve map */

_FDT(fdt_create(fdt, FDT_MAX_SIZE));
+   _FDT(fdt_add_reservemap_entry(fdt, kvm->arch.memory_guest_start, 4096));
_FDT(fdt_finish_reservemap(fdt));
  
  	/* Header */

diff --git a/arm/kvm.c b/arm/kvm.c
index 1f85fc6..8bbfef1 100644
--- a/arm/kvm.c
+++ b/arm/kvm.c
@@ -11,6 +11,8 @@
  #include 
  #include 
  
+int pvtime_create(struct kvm *kvm);

+
  struct kvm_ext kvm_req_ext[] = {
{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
@@ -86,6 +88,10 @@ void kvm__arch_init(struct kvm *kvm, const char 
*hugetlbfs_path, u64 ram_size)
/* Create the virtual GIC. */
if (gic__create(kvm, kvm->cfg.arch.irqchip))
die("Failed to create virtual GIC");
+
+   /* Setup PV time */
+   if (pvtime_create(kvm))
+   die("Failed to initialise PV time");
  }
  
  #define FDT_ALIGN	SZ_2M

diff --git a/arm/pvtime.c b/arm/pvtime.c
new file mode 100644
index 000..abcaab3
--- /dev/null
+++ b/arm/pvtime.c
@@ -0,0 +1,77 @@
+#include "kvm/kvm.h"
+
+#define KVM_DEV_TYPE_ARM_PV_TIME (KVM_DEV_TYPE_ARM_VGIC_ITS+2)
+
+/* Device Control API: PV_TIME */
+#define KVM_DEV_ARM_PV_TIME_PADDR  0
+#define KVM_DEV_ARM_PV_TIME_FREQUENCY  3
+
+#define KVM_DEV_ARM_PV_TIME_ST 0
+#define KVM_DEV_ARM_PV_TIME_LPT1
+
+static int pvtime_fd;
+
+int pvtime_create(struct kvm *kvm);
+
+int pvtime_create(struct kvm *kvm)
+{
+   int err;
+   u64 lpt_paddr = 0x1000;
+   u64 st_paddr = lpt_paddr + 4096;
+   u32 frequency = 100 * 1000 * 1000;
+
+   printf("lpt_paddr=%llx\n", lpt_paddr);
+
+   struct kvm_create_device pvtime_device = {
+   .type = KVM_DEV_TYPE_ARM_PV_TIME,
+   .flags = 0,
+   };
+
+   err = ioctl(kvm->vm_fd, KVM_CREATE_DEVICE, _device);
+   if (err) {
+   printf("Failed to create PV device\n");
+   return 0;
+   }
+
+   pvtime_fd = pvtime_device.fd;
+
+   struct kvm_device_attr lpt_base = {
+   .group = KVM_DEV_ARM_PV_TIME_PADDR,
+   .attr = KVM_DEV_ARM_PV_TIME_LPT,
+   .addr = (u64)(unsigned long)_paddr
+   };
+   struct kvm_device_attr st_base = {
+   .group = KVM_DEV_ARM_PV_TIME_PADDR,
+   .attr = KVM_DEV_ARM_PV_TIME_ST,
+   .addr = (u64)(unsigned long)_paddr
+   };
+
+   struct kvm_device_attr lpt_freq = {
+   .group = KVM_DEV_ARM_PV_TIME_FREQUENCY,
+   .attr = KVM_DEV_ARM_PV_TIME_LPT,
+   .addr = (u64)(unsigned long)
+   };
+
+   err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, _base);
+   if (err) {
+   perror("ioctl lpt_base failed");
+   printf("Ignoring LPT...\n");
+   }
+   err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, _base);
+   if (err) {
+   perror("ioctl st_base failed");
+   goto out_err;
+   }
+   err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, _freq);
+   if (err) {
+   perror("ioctl lpt_freq failed");
+   printf("Ignoring LPT...\n");
+   }
+
+   printf("PV time setup\n");
+
+   return 0;
+out_err:
+   close(pvtime_fd);
+   return err;
+}




[PATCH] irqchip/gic-v3-its: Remove the redundant set_bit for lpi_map

2019-07-27 Thread Zenghui Yu
We try to find a free LPI region in device's lpi_map and allocate them
(set them to 1) when we want to allocate LPIs for this device. This is
what bitmap_find_free_region() has done for us. The following set_bit
is redundant and a bit confusing (since we only set_bit against the first
allocated LPI idx). Remove it, and make the set_bit explicit by comment.

Signed-off-by: Zenghui Yu 
---
 drivers/irqchip/irq-gic-v3-its.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 472053c..7c69176 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2464,6 +2464,7 @@ static int its_alloc_device_irq(struct its_device *dev, 
int nvecs, irq_hw_number
 {
int idx;
 
+   /* Find a free LPI region in lpi_map and allocate them. */
idx = bitmap_find_free_region(dev->event_map.lpi_map,
  dev->event_map.nr_lpis,
  get_count_order(nvecs));
@@ -2471,7 +2472,6 @@ static int its_alloc_device_irq(struct its_device *dev, 
int nvecs, irq_hw_number
return -ENOSPC;
 
*hwirq = dev->event_map.lpi_base + idx;
-   set_bit(idx, dev->event_map.lpi_map);
 
return 0;
 }
-- 
1.8.3.1




[PATCH v2] KVM: arm/arm64: Introduce kvm_pmu_vcpu_init() to setup PMU counter idx

2019-07-18 Thread Zenghui Yu
We use "pmc->idx" and the "chained" bitmap to determine if the pmc is
chained, in kvm_pmu_pmc_is_chained().  But idx might be uninitialized
(and random) when we doing this decision, through a KVM_ARM_VCPU_INIT
ioctl -> kvm_pmu_vcpu_reset(). And the test_bit() against this random
idx will potentially hit a KASAN BUG [1].

In general, idx is the static property of a PMU counter that is not
expected to be modified across resets, as suggested by Julien.  It
looks more reasonable if we can setup the PMU counter idx for a vcpu
in its creation time. Introduce a new function - kvm_pmu_vcpu_init()
for this basic setup. Oh, and the KASAN BUG will get fixed this way.

[1] https://www.spinics.net/lists/kvm-arm/msg36700.html

Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Suggested-by: Andrew Murray 
Suggested-by: Julien Thierry 
Cc: Marc Zyngier 
Signed-off-by: Zenghui Yu 
---

Changes since v1:
 - Introduce kvm_pmu_vcpu_init() in vcpu's creation time, move the
   assignment of pmc->idx into it.
 - Thus change the subject. The old one is "KVM: arm/arm64: Assign
   pmc->idx before kvm_pmu_stop_counter()".

Julien, I haven't collected your Acked-by into this version. If you're
still happy with the change, please Ack again. Thanks!

 include/kvm/arm_pmu.h |  2 ++
 virt/kvm/arm/arm.c|  2 ++
 virt/kvm/arm/pmu.c| 18 +++---
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 16c769a..6db0304 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -34,6 +34,7 @@ struct kvm_pmu {
 u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx);
 void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val);
 u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu);
+void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu);
 void kvm_pmu_disable_counter_mask(struct kvm_vcpu *vcpu, u64 val);
@@ -71,6 +72,7 @@ static inline u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu 
*vcpu)
 {
return 0;
 }
+static inline void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu) {}
 static inline void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu) {}
 static inline void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) {}
 static inline void kvm_pmu_disable_counter_mask(struct kvm_vcpu *vcpu, u64 
val) {}
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index f645c0f..c704fa6 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -340,6 +340,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
/* Set up the timer */
kvm_timer_vcpu_init(vcpu);
 
+   kvm_pmu_vcpu_init(vcpu);
+
kvm_arm_reset_debug_ptr(vcpu);
 
return kvm_vgic_vcpu_init(vcpu);
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 3dd8238..362a018 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -215,6 +215,20 @@ static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, 
struct kvm_pmc *pmc)
 }
 
 /**
+ * kvm_pmu_vcpu_init - assign pmu counter idx for cpu
+ * @vcpu: The vcpu pointer
+ *
+ */
+void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
+{
+   int i;
+   struct kvm_pmu *pmu = >arch.pmu;
+
+   for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
+   pmu->pmc[i].idx = i;
+}
+
+/**
  * kvm_pmu_vcpu_reset - reset pmu state for cpu
  * @vcpu: The vcpu pointer
  *
@@ -224,10 +238,8 @@ void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu)
int i;
struct kvm_pmu *pmu = >arch.pmu;
 
-   for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++) {
+   for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
kvm_pmu_stop_counter(vcpu, >pmc[i]);
-   pmu->pmc[i].idx = i;
-   }
 
bitmap_zero(vcpu->arch.pmu.chained, ARMV8_PMU_MAX_COUNTER_PAIRS);
 }
-- 
1.8.3.1




[tip:irq/core] irq/irqdomain: Fix comment typo

2019-07-06 Thread tip-bot for Zenghui Yu
Commit-ID:  3a1d24ca9573fbc74a3d32c972c333b161e0e9dc
Gitweb: https://git.kernel.org/tip/3a1d24ca9573fbc74a3d32c972c333b161e0e9dc
Author: Zenghui Yu 
AuthorDate: Sat, 6 Jul 2019 04:41:12 +
Committer:  Thomas Gleixner 
CommitDate: Sat, 6 Jul 2019 10:40:20 +0200

irq/irqdomain: Fix comment typo

Fix typo in the comment on top of __irq_domain_add().

Signed-off-by: Zenghui Yu 
Signed-off-by: Thomas Gleixner 
Link: 
https://lkml.kernel.org/r/1562388072-23492-1-git-send-email-yuzeng...@huawei.com

---
 kernel/irq/irqdomain.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index e7d17cc3a3d7..3078d0e48bba 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -123,7 +123,7 @@ EXPORT_SYMBOL_GPL(irq_domain_free_fwnode);
  * @ops: domain callbacks
  * @host_data: Controller private data pointer
  *
- * Allocates and initialize and irq_domain structure.
+ * Allocates and initializes an irq_domain structure.
  * Returns pointer to IRQ domain, or NULL on failure.
  */
 struct irq_domain *__irq_domain_add(struct fwnode_handle *fwnode, int size,


[PATCH] irq/irqdomain: Fix typo in the comment on top of __irq_domain_add()

2019-07-05 Thread Zenghui Yu
Fix typo in the comment on top of __irq_domain_add().

Signed-off-by: Zenghui Yu 
---
 kernel/irq/irqdomain.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index a453e22..db7b713 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -123,7 +123,7 @@ void irq_domain_free_fwnode(struct fwnode_handle *fwnode)
  * @ops: domain callbacks
  * @host_data: Controller private data pointer
  *
- * Allocates and initialize and irq_domain structure.
+ * Allocates and initializes an irq_domain structure.
  * Returns pointer to IRQ domain, or NULL on failure.
  */
 struct irq_domain *__irq_domain_add(struct fwnode_handle *fwnode, int size,
-- 
1.8.3.1




Re: [PATCH v1 1/5] KVM: arm/arm64: Remove kvm_mmio_emulate tracepoint

2019-06-13 Thread Zenghui Yu

Hi James,

On 2019/6/12 20:48, James Morse wrote:

Hi,

On 12/06/2019 10:08, Zenghui Yu wrote:

In current KVM/ARM code, no one will invoke trace_kvm_mmio_emulate().
Remove this TRACE_EVENT definition.


Oooer. We can't just go removing these things, they are visible to user-space.

I recall an article on this: https://lwn.net/Articles/737530/
"Another attempt to address the tracepoint ABI problem"

I agree this is orphaned, it was added by commit 45e96ea6b369 ("KVM: ARM: 
Handle I/O
aborts"), but there never was a caller.

The problem with removing it is 
/sys/kernel/debug/tracing/events/kvm/kvm_mmio_emulate
disappears. Any program relying on that being present (but useless) is now 
broken.

Thanks for the reminder.

It turned out that I knew little about the tracepoint ABI :( .
I'm OK to just drop this patch in next version.


Thanks,
zenghui


.




[tip:perf/core] perf jevents: Remove unused variable

2019-05-18 Thread tip-bot for Zenghui Yu
Commit-ID:  8e8f515d567f9ec1d960e9fdb117d39753b7504d
Gitweb: https://git.kernel.org/tip/8e8f515d567f9ec1d960e9fdb117d39753b7504d
Author: Zenghui Yu 
AuthorDate: Wed, 15 May 2019 11:19:29 +
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 15 May 2019 16:36:49 -0300

perf jevents: Remove unused variable

Address gcc warning:

  pmu-events/jevents.c: In function ‘save_arch_std_events’:
  pmu-events/jevents.c:417:15: warning: unused variable ‘sb’ [-Wunused-variable]
struct stat *sb = data;
 ^~

Signed-off-by: Zenghui Yu 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: John Garry 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: wanghaibin.w...@huawei.com
Link: 
http://lkml.kernel.org/r/1557919169-23972-1-git-send-email-yuzeng...@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/pmu-events/jevents.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c
index daaea5003d4a..58f77fd0f59f 100644
--- a/tools/perf/pmu-events/jevents.c
+++ b/tools/perf/pmu-events/jevents.c
@@ -415,7 +415,6 @@ static int save_arch_std_events(void *data, char *name, 
char *event,
char *metric_name, char *metric_group)
 {
struct event_struct *es;
-   struct stat *sb = data;
 
es = malloc(sizeof(*es));
if (!es)


[PATCH] perf jevents: Remove unused variables

2019-05-15 Thread Zenghui Yu
Fix gcc warning:

pmu-events/jevents.c: In function ‘save_arch_std_events’:
pmu-events/jevents.c:417:15: warning: unused variable ‘sb’ [-Wunused-variable]
  struct stat *sb = data;
   ^~

Signed-off-by: Zenghui Yu 
---
 tools/perf/pmu-events/jevents.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c
index 68c92bb..92e60fd 100644
--- a/tools/perf/pmu-events/jevents.c
+++ b/tools/perf/pmu-events/jevents.c
@@ -414,7 +414,6 @@ static int save_arch_std_events(void *data, char *name, 
char *event,
char *metric_name, char *metric_group)
 {
struct event_struct *es;
-   struct stat *sb = data;
 
es = malloc(sizeof(*es));
if (!es)
-- 
1.8.3.1




Re: [RFC PATCH] irqchip/gic-v3: Correct the usage of GICD_CTLR's RWP field

2019-05-13 Thread Zenghui Yu

Hi Andre,

On 2019/5/13 16:37, Andre Przywara wrote:

On Mon, 13 May 2019 04:15:54 +
Zenghui Yu  wrote:

Hi,


As per ARM IHI 0069D, GICD_CTLR's RWP field tracks updates to:

GICD_CTLR's Group Enable bits, for transitions from 1 to 0 only
GICD_CTLR's ARE bits, E1NWF bit and DS bit (if we have)
GICD_ICENABLER

We seemed use this field in an inappropriate way, somewhere in the
GIC-v3 driver. For some examples:

In gic_set_affinity(), if the interrupt was not enabled, we only need
to write GICD_IROUTER with the appropriate mpidr value. Updates to
GICD_IROUTER will not be tracked by RWP field, and we can remove the
unnecessary RWP waiting.


I am not sure this is the proper fix, see below inline.


In gic_dist_init(), we "Enable distributor with ARE, Group1" by writing
to GICD_CTLR, and we should use gic_dist_wait_for_rwp() here.


That looks reasonable, yes.


These two are obvious cases, and there are some other cases where we don't
need to do RWP waiting, such as in gic_configure_irq() and gic_poke_irq().
But too many modifications makes me not confident. It's hard for me to say
whether this patch is doing the right thing and how does the RWP waiting
affect the system, thus RFC.


So did you actually see a problem, and this patch fixes it? Or was this
just discovered by code inspection and comparing to the spec?


The latter ;-)


Signed-off-by: Zenghui Yu 
---
  drivers/irqchip/irq-gic-v3.c | 8 ++--
  1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 15e55d3..8d63950 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -600,6 +600,7 @@ static void __init gic_dist_init(void)
/* Enable distributor with ARE, Group1 */
writel_relaxed(GICD_CTLR_ARE_NS | GICD_CTLR_ENABLE_G1A |
GICD_CTLR_ENABLE_G1, base + GICD_CTLR);
+   gic_dist_wait_for_rwp();
  
  	/*

 * Set all global interrupts to the boot CPU only. ARE must be
@@ -986,14 +987,9 @@ static int gic_set_affinity(struct irq_data *d,
const struct cpumask *mask_val,
gic_write_irouter(val, reg);
  
-	/*

-* If the interrupt was enabled, enabled it again. Otherwise,
-* just wait for the distributor to have digested our changes.
-*/
+   /* If the interrupt was enabled, enabled it again. */
if (enabled)
gic_unmask_irq(d);
-   else
-   gic_dist_wait_for_rwp();


I think you are right in this is not needed here.
But I guess this call belongs further up in this function, after the
gic_mask_irq() call, as this one writes to GICD_ICENABLER. So in case this
IRQ was enabled, we should wait for the distributor to have properly
disabled it, before changing its affinity.


I still think we don't need this call in gic_set_affinity().
Actually, the writes to GICD_ICENABLER happens in gic_poke_irq(). And we
already have a gic_dist_wait_for_rwp() there, named rwp_wait().


thanks,
zenghui



Cheers,
Andre.

  
  	irq_data_update_effective_affinity(d, cpumask_of(cpu));
  



.





[RFC PATCH] irqchip/gic-v3: Correct the usage of GICD_CTLR's RWP field

2019-05-12 Thread Zenghui Yu
As per ARM IHI 0069D, GICD_CTLR's RWP field tracks updates to:

GICD_CTLR's Group Enable bits, for transitions from 1 to 0 only
GICD_CTLR's ARE bits, E1NWF bit and DS bit (if we have)
GICD_ICENABLER

We seemed use this field in an inappropriate way, somewhere in the
GIC-v3 driver. For some examples:

In gic_set_affinity(), if the interrupt was not enabled, we only need
to write GICD_IROUTER with the appropriate mpidr value. Updates to
GICD_IROUTER will not be tracked by RWP field, and we can remove the
unnecessary RWP waiting.

In gic_dist_init(), we "Enable distributor with ARE, Group1" by writing
to GICD_CTLR, and we should use gic_dist_wait_for_rwp() here.

These two are obvious cases, and there are some other cases where we don't
need to do RWP waiting, such as in gic_configure_irq() and gic_poke_irq().
But too many modifications makes me not confident. It's hard for me to say
whether this patch is doing the right thing and how does the RWP waiting
affect the system, thus RFC.

Signed-off-by: Zenghui Yu 
---
 drivers/irqchip/irq-gic-v3.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 15e55d3..8d63950 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -600,6 +600,7 @@ static void __init gic_dist_init(void)
/* Enable distributor with ARE, Group1 */
writel_relaxed(GICD_CTLR_ARE_NS | GICD_CTLR_ENABLE_G1A | 
GICD_CTLR_ENABLE_G1,
   base + GICD_CTLR);
+   gic_dist_wait_for_rwp();
 
/*
 * Set all global interrupts to the boot CPU only. ARE must be
@@ -986,14 +987,9 @@ static int gic_set_affinity(struct irq_data *d, const 
struct cpumask *mask_val,
 
gic_write_irouter(val, reg);
 
-   /*
-* If the interrupt was enabled, enabled it again. Otherwise,
-* just wait for the distributor to have digested our changes.
-*/
+   /* If the interrupt was enabled, enabled it again. */
if (enabled)
gic_unmask_irq(d);
-   else
-   gic_dist_wait_for_rwp();
 
irq_data_update_effective_affinity(d, cpumask_of(cpu));
 
-- 
1.8.3.1




[PATCH] KVM: arm/arm64: vgic: Restrict setting irq->targets only in GICv2

2019-04-04 Thread Zenghui Yu
Commit ad275b8bb1e6 ("KVM: arm/arm64: vgic-new: vgic_init: implement
vgic_init") had set irq->targets in kvm_vgic_vcpu_init(), regardless of
the GIC architecture (v2 or v3). When the number of vcpu reaches 32
(in v3), UBSAN will complain about it.

 

 UBSAN: Undefined behaviour in 
arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-init.c:223:21
 shift exponent 32 is too large for 32-bit type 'unsigned int'
 CPU: 13 PID: 833 Comm: CPU 32/KVM Kdump: loaded Not tainted 5.1.0-rc1+ #16
 Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.58 10/24/2018
 Call trace:
  dump_backtrace+0x0/0x190
  show_stack+0x24/0x30
  dump_stack+0xc8/0x114
  ubsan_epilogue+0x14/0x50
  __ubsan_handle_shift_out_of_bounds+0x118/0x188
  kvm_vgic_vcpu_init+0x1d4/0x200
  kvm_arch_vcpu_init+0x3c/0x48
  kvm_vcpu_init+0xa8/0x100
  kvm_arch_vcpu_create+0x94/0x120
  kvm_vm_ioctl+0x57c/0xe58
  do_vfs_ioctl+0xc4/0x7f0
  ksys_ioctl+0x8c/0xa0
  __arm64_sys_ioctl+0x28/0x38
  el0_svc_common+0xa0/0x190
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc
 


This patch Restricts setting irq->targets in GICv2, which only supports
a maximum of eight PEs, to keep UBSAN quiet. And since irq->mpidr will
only be used by SPI in GICv3, we decided to set mpidr to 0 for SGI and
PPI.

Like commit ab2d5eb03dbb ("KVM: arm/arm64: vgic: Always initialize the
group of private IRQs"), we should also take the creating order of the
VGIC and VCPUs into consideration.

Cc: Eric Auger 
Cc: Marc Zyngier 
Cc: Andre Przywara 
Cc: Christoffer Dall 
Signed-off-by: Zenghui Yu 
---
 virt/kvm/arm/vgic/vgic-init.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c
index 3bdb31e..0cba92e 100644
--- a/virt/kvm/arm/vgic/vgic-init.c
+++ b/virt/kvm/arm/vgic/vgic-init.c
@@ -220,7 +220,6 @@ int kvm_vgic_vcpu_init(struct kvm_vcpu *vcpu)
irq->intid = i;
irq->vcpu = NULL;
irq->target_vcpu = vcpu;
-   irq->targets = 1U << vcpu->vcpu_id;
kref_init(>refcount);
if (vgic_irq_is_sgi(i)) {
/* SGIs */
@@ -231,10 +230,14 @@ int kvm_vgic_vcpu_init(struct kvm_vcpu *vcpu)
irq->config = VGIC_CONFIG_LEVEL;
}
 
-   if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3)
+   if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) {
irq->group = 1;
-   else
+   irq->mpidr = 0;
+   } else {
irq->group = 0;
+   if (vcpu->vcpu_id < VGIC_V2_MAX_CPUS)
+   irq->targets = 1U << vcpu->vcpu_id;
+   }
}
 
if (!irqchip_in_kernel(vcpu->kvm))
@@ -297,10 +300,13 @@ int vgic_init(struct kvm *kvm)
 
for (i = 0; i < VGIC_NR_PRIVATE_IRQS; i++) {
struct vgic_irq *irq = _cpu->private_irqs[i];
-   if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3)
+   if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) {
irq->group = 1;
-   else
+   irq->mpidr = 0;
+   } else {
irq->group = 0;
+   irq->targets = 1U << vcpu->vcpu_id;
+   }
}
}
 
-- 
1.8.3.1




Re: [RFC PATCH] irqchip/gic-v3: Make gic_handle_irq() notrace

2019-03-29 Thread Zenghui Yu
On Fri, Mar 29, 2019 at 11:46 PM Steven Rostedt  wrote:
>
> On Fri, 29 Mar 2019 23:35:59 +0800
> Zenghui Yu  wrote:
>
> > Hi Steven,
> >
> > On 2019/3/29 22:53, Steven Rostedt wrote:
> > > On Fri, 29 Mar 2019 13:58:40 +
> > > Marc Zyngier  wrote:
> > >
> > >> On the other hand, if you can generate pseudo-NMIs, you could end-up
> > >> tracing gic_handle_irq whilst being inside the tracing code with
> > >> interrupts being notionally disabled (and that could be pretty bad).
> > >
> > > Actually, that should still be safe. The tracing code is expected to
> > > trace NMIs.
> > >
> > > Now the entry of an NMI can be an issue because of the way tracing is
> > > enabled. But this would also cause function tracing to crash, which was
> > > not stated. Does function tracing have the same issue, or is this just
> > > with function_graph tracing?
> >
> > This is just with function_graph tracing. Function tracing works fine.
>
> Then we can rule out the break point issue.
>
> >
> > >
> > > This is because a breakpoint is added to all the places that are going
> > > to be traced so that the "nops" at the beginning of function calls are
> > > going to be converted to calls to the tracer. Now that means a
> > > breakpoint is being added at the beginning of gic_handle_irq(). I don't
> > > know how this pseudo NMI works, but we have notrace set for do_NMI
> > > because breakpoints at the entry (before it can fix things) causes
> > > issues. But this still doesn't make sense because the gic_handle_irq()
> > > call doesn't fix things up either, so other functions that are traced
> > > by gic_handle_irq() should blow up too, which appears (by the patch)
> > > not to be the case.
> >
> > Thanks for your explanation. I will read the code around further, and
> > get back to you if there's any discovery.
>
> Well, if it only affects function_graph tracing, then we don't need to
> investigate the break point issue. If it was the break point issue, it
> would cause function tracing to break too.

OK.

>
> Also, which architecture is this for? ARM or ARM64?

on ARM64.


thanks,

zenghui


Re: [RFC PATCH] irqchip/gic-v3: Make gic_handle_irq() notrace

2019-03-29 Thread Zenghui Yu

Hi Steven,

On 2019/3/29 22:53, Steven Rostedt wrote:

On Fri, 29 Mar 2019 13:58:40 +
Marc Zyngier  wrote:


On the other hand, if you can generate pseudo-NMIs, you could end-up
tracing gic_handle_irq whilst being inside the tracing code with
interrupts being notionally disabled (and that could be pretty bad).


Actually, that should still be safe. The tracing code is expected to
trace NMIs.

Now the entry of an NMI can be an issue because of the way tracing is
enabled. But this would also cause function tracing to crash, which was
not stated. Does function tracing have the same issue, or is this just
with function_graph tracing?


This is just with function_graph tracing. Function tracing works fine.



This is because a breakpoint is added to all the places that are going
to be traced so that the "nops" at the beginning of function calls are
going to be converted to calls to the tracer. Now that means a
breakpoint is being added at the beginning of gic_handle_irq(). I don't
know how this pseudo NMI works, but we have notrace set for do_NMI
because breakpoints at the entry (before it can fix things) causes
issues. But this still doesn't make sense because the gic_handle_irq()
call doesn't fix things up either, so other functions that are traced
by gic_handle_irq() should blow up too, which appears (by the patch)
not to be the case.


Thanks for your explanation. I will read the code around further, and
get back to you if there's any discovery.


thanks,

zenghui



-- Steve

.





Re: [RFC PATCH] irqchip/gic-v3: Make gic_handle_irq() notrace

2019-03-29 Thread Zenghui Yu

Hi Marc,

On 2019/3/29 21:58, Marc Zyngier wrote:

Hi Zenghui,

On 29/03/2019 13:23, Zenghui Yu wrote:

Enable pseudo NMI together with function_graph tracer, will lead
the system to a hang. This is easy to reproduce,

   1) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line
   2) echo function_graph > /sys/kernel/debug/tracing/current_tracer

This patch (RFC) set gic_handle_irq() as notrace and it seems works
fine now. But I have no idea about what the issue is exactly, and
you can regard this patch as a report then :)

Can someone give a look at it and provide some explanations ?

Thanks!

Cc: Julien Thierry 
Cc: Steven Rostedt 
Signed-off-by: Zenghui Yu 
---
  drivers/irqchip/irq-gic-v3.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 15e55d3..8d0c25f 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -487,7 +487,7 @@ static inline void gic_handle_nmi(u32 irqnr, struct pt_regs 
*regs)
gic_deactivate_unhandled(irqnr);
  }
  
-static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs)

+static asmlinkage notrace void __exception_irq_entry gic_handle_irq(struct 
pt_regs *regs)
  {
u32 irqnr;
  




That's interesting. Do you have any out of tree patch that actually
makes use of the pseudo-NMI feature? Without those patches, the
behaviour should stay unchanged.


I am at commit 1a9df9e29c2afecf6e3089442d429b377279ca3c. No more
patches, and this is the most confusing. Just out of curiosity, I
wanted to run Julien's "Use NMI for perf interrupt" patch (posted
on the mailing list), so I have to enable NMI first.

That said, with
  1) Select Kernel Feature -> Support for NMI-like interrupts
  2) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line
  3) No pseudo-NMIs have been generated at all
and this issue was hit.


thanks,

zenghui



On the other hand, if you can generate pseudo-NMIs, you could end-up
tracing gic_handle_irq whilst being inside the tracing code with
interrupts being notionally disabled (and that could be pretty bad).

So, patches or no patches?

Thanks,

M.





[RFC PATCH] irqchip/gic-v3: Make gic_handle_irq() notrace

2019-03-29 Thread Zenghui Yu
Enable pseudo NMI together with function_graph tracer, will lead
the system to a hang. This is easy to reproduce,

  1) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line
  2) echo function_graph > /sys/kernel/debug/tracing/current_tracer

This patch (RFC) set gic_handle_irq() as notrace and it seems works
fine now. But I have no idea about what the issue is exactly, and
you can regard this patch as a report then :)

Can someone give a look at it and provide some explanations ?

Thanks!

Cc: Julien Thierry 
Cc: Steven Rostedt 
Signed-off-by: Zenghui Yu 
---
 drivers/irqchip/irq-gic-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 15e55d3..8d0c25f 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -487,7 +487,7 @@ static inline void gic_handle_nmi(u32 irqnr, struct pt_regs 
*regs)
gic_deactivate_unhandled(irqnr);
 }
 
-static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs 
*regs)
+static asmlinkage notrace void __exception_irq_entry gic_handle_irq(struct 
pt_regs *regs)
 {
u32 irqnr;
 
-- 
1.8.3.1




Re: [RFC] Question about TLB flush while set Stage-2 huge pages

2019-03-14 Thread Zenghui Yu

Hi Suzuki,

On 2019/3/14 18:55, Suzuki K Poulose wrote:

Hi Zheng,

On Wed, Mar 13, 2019 at 05:45:31PM +0800, Zheng Xiang wrote:



On 2019/3/13 2:18, Marc Zyngier wrote:

Hi Zheng,

On 12/03/2019 15:30, Zheng Xiang wrote:

Hi Marc,

On 2019/3/12 19:32, Marc Zyngier wrote:

Hi Zheng,

On 11/03/2019 16:31, Zheng Xiang wrote:

Hi all,

While a page is merged into a transparent huge page, KVM will invalidate 
Stage-2 for
the base address of the huge page and the whole of Stage-1.
However, this just only invalidates the first page within the huge page and the 
other
pages are not invalidated, see bellow:

 +---+--+
 |abcde   2MB-Page  |
 +---+--+

 TLB before setting new pmd:
 +---+--+
 |  VA   |PAGESIZE  |
 +---+--+
 |  a|  4KB |
 +---+--+
 |  b|  4KB |
 +---+--+
 |  c|  4KB |
 +---+--+
 |  d|  4KB |
 +---+--+

 TLB after setting new pmd:
 +---+--+
 |  VA   |PAGESIZE  |
 +---+--+
 |  a|  2MB |
 +---+--+
 |  b|  4KB |
 +---+--+
 |  c|  4KB |
 +---+--+
 |  d|  4KB |
 +---+--+

When VM access *b* address, it will hit the TLB and result in TLB conflict 
aborts or other potential exceptions.


That's really bad. I can only imagine two scenarios:

1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
the PTE table in the process, and place the PMD instead. I can't see
this happening.

2) We fail to invalidate on unmap, and that slightly less bad (but still
quite bad).

Which of the two cases are you seeing?


For example, we need to keep tracking of the VM memory dirty pages when VM is 
in live migration.
KVM will set the memslot READONLY and split the huge pages.
After live migration is canceled and abort, the pages will be merged into THP.
The later access to these pages which are READONLY will cause level-3 
Permission Fault until they are invalidated.

So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), 
like __flush_tlb_range()?
Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.


We should perform an invalidate on each unmap. unmap_stage2_range seems
to do the right thing. __flush_tlb_range only caters for Stage1
mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
TLBs for the whole VM.

I'd really like to understand what you're seeing, and how to reproduce
it. Do you have a minimal example I could run on my own HW?


When I start the live migration for a VM, qemu then begins to log and count 
dirty pages.
During the live migration, KVM set the pages READONLY so that we can count how 
many pages
would be wrote afterwards.

Anything is OK until I cancel the live migration and qemu stop logging. Later 
the VM gets hang.
The trace log shows repeatedly level-3 permission fault caused by a write on a 
same IPA. After
analyzing the source code, I find KVM always return from the bellow *if* 
statement in
stage2_set_pmd_huge() even if we only have a single VCPU:

 /*
  * Multiple vcpus faulting on the same PMD entry, can
  * lead to them sequentially updating the PMD with the
  * same value. Following the break-before-make
  * (pmd_clear() followed by tlb_flush()) process can
  * hinder forward progress due to refaults generated
  * on missing translations.
  *
  * Skip updating the page table if the entry is
  * unchanged.
  */
 if (pmd_val(old_pmd) == pmd_val(*new_pmd))
 return 0;

The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() 
does not invalidate
Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally 
I add some debug
code to flush tlb for all subpages of the PMD, as shown bellow:

 /*
  * Mapping in huge pages should only happen through a
  * fault.  If a page is merged into a transparent huge
  * page, the individual subpages of that huge page
  * should be unmapped through MMU notifiers before we
  * get here.
  *
  * Merging of CompoundPages is not supported; they
  * should become splitting first, unmapped, merged,
  * and mapped back in on-demand.
  */
 VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));

 pmd_clear(pmd);
 for (cnt = 0; cnt < 512; cnt++)
 kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);

Then the 

Re: [RFC PATCH] KVM: arm64: Force a PTE mapping when logging is enabled

2019-03-05 Thread Zenghui Yu




On 2019/3/5 19:13, Suzuki K Poulose wrote:

Hi Zenghui,

On 05/03/2019 11:09, Zenghui Yu wrote:

Hi Marc, Suzuki,

On 2019/3/5 1:34, Marc Zyngier wrote:

Hi Zenghui, Suzuki,

On 04/03/2019 17:13, Suzuki K Poulose wrote:

Hi Zenghui,

On Sun, Mar 03, 2019 at 11:14:38PM +0800, Zenghui Yu wrote:

I think there're still some problems in this patch... Details below.

On Sat, Mar 2, 2019 at 11:39 AM Zenghui Yu  
wrote:


The idea behind this is: we don't want to keep tracking of huge 
pages when
logging_active is true, which will result in performance 
degradation.  We
still need to set vma_pagesize to PAGE_SIZE, so that we can make 
use of it

to force a PTE mapping.


Yes, you're right. We are indeed ignoring the force_pte flag.



Cc: Suzuki K Poulose 
Cc: Punit Agrawal 
Signed-off-by: Zenghui Yu 

---
Atfer looking into https://patchwork.codeaurora.org/patch/647985/ 
, the
"vma_pagesize = PAGE_SIZE" logic was not intended to be deleted. 
As far
as I can tell, we used to have "hugetlb" to force the PTE mapping, 
but
we have "vma_pagesize" currently instead. We should set it 
properly for
performance reasons (e.g, in VM migration). Did I miss something 
important?


---
   virt/kvm/arm/mmu.c | 7 +++
   1 file changed, 7 insertions(+)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 30251e2..7d41b16 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1705,6 +1705,13 @@ static int user_mem_abort(struct kvm_vcpu 
*vcpu, phys_addr_t fault_ipa,
   (vma_pagesize == PUD_SIZE && 
kvm_stage2_has_pmd(kvm))) &&

  !force_pte) {
  gfn = (fault_ipa & 
huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;

+   } else {
+   /*
+    * Fallback to PTE if it's not one of the stage2
+    * supported hugepage sizes or the corresponding 
level

+    * doesn't exist, or logging is enabled.


First, Instead of "logging is enabled", it should be "force_pte is 
true",

since "force_pte" will be true when:

  1) fault_supports_stage2_pmd_mappings() return false; or
  2) "logging is enabled" (e.g, in VM migration).

Second, fallback some unsupported hugepage sizes (e.g, 64K hugepage 
with

4K pages) to PTE is somewhat strange. And it will then _unexpectedly_
reach transparent_hugepage_adjust(), though no real adjustment will 
happen
since commit fd2ef358282c ("KVM: arm/arm64: Ensure only THP is 
candidate
for adjustment"). Keeping "vma_pagesize" there as it is will be 
better,

right?

So I'd just simplify the logic like:


We could fix this right in the beginning. See patch below:



  } else if (force_pte) {
  vma_pagesize = PAGE_SIZE;
  }


Will send a V2 later and waiting for your comments :)



diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 30251e2..529331e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1693,7 +1693,9 @@ static int user_mem_abort(struct kvm_vcpu 
*vcpu, phys_addr_t fault_ipa,

   return -EFAULT;
   }
-    vma_pagesize = vma_kernel_pagesize(vma);
+    /* If we are forced to map at page granularity, force the 
pagesize here */

+    vma_pagesize = force_pte ? PAGE_SIZE : vma_kernel_pagesize(vma);
+
   /*
    * The stage2 has a minimum of 2 level table (For arm64 see
    * kvm_arm_setup_stage2()). Hence, we are guaranteed that we can
@@ -1701,11 +1703,10 @@ static int user_mem_abort(struct kvm_vcpu 
*vcpu, phys_addr_t fault_ipa,

    * As for PUD huge maps, we must make sure that we have at least
    * 3 levels, i.e, PMD is not folded.
    */
-    if ((vma_pagesize == PMD_SIZE ||
- (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) &&
-    !force_pte) {
+    if (vma_pagesize == PMD_SIZE ||
+    (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm)))
   gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> 
PAGE_SHIFT;

-    }
+
   up_read(>mm->mmap_sem);
   /* We need minimum second+third level pages */


A nicer implementation and easier to understand, thanks!


That's pretty interesting, because this is almost what we already have
in the NV code:

https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/tree/virt/kvm/arm/mmu.c?h=kvm-arm64/nv-wip-v5.0-rc7#n1752 



(note that force_pte is gone in that branch).


haha :-) sorry about that. I haven't looked into the NV code yet, so ...

But I'm still wondering: should we fix this wrong mapping size problem
before NV is introduced? Since this problem has not much to do with NV,
and 5.0 has already been released with this problem (and 5.1 will
without fix ...).


Yes, we must fix it. I will soon send out a patch copying on it.
Its just that I find some more issues around forcing the PTE
mappings with PUD huge pages. I will send something out soon.


Sounds good!


zenghui




  1   2   >