from:"Konrad Rzeszutek Wilk"

Re: [Resend RFC PATCH V2 09/12] swiotlb: Add bounce buffer remap address setting function

2021-04-15 Thread Konrad Rzeszutek Wilk

On Wed, Apr 14, 2021 at 10:49:42AM -0400, Tianyu Lan wrote:
> From: Tianyu Lan 
> 
> For Hyper-V isolation VM with AMD SEV SNP, the bounce buffer(shared memory)
> needs to be accessed via extra address space(e.g address above bit39).
> Hyper-V code may remap extra address space outside of swiotlb. 
> swiotlb_bounce()

May? Isn't this a MUST in this case?

> needs to use remap virtual address to copy data from/to bounce buffer. Add
> new interface swiotlb_set_bounce_remap() to do that.

I am bit lost - why can't you use the swiotlb_init and pass in your
DMA pool that is already above bit 39?

> 
> Signed-off-by: Tianyu Lan 
> ---
>  include/linux/swiotlb.h |  5 +
>  kernel/dma/swiotlb.c| 13 -
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index d9c9fc9ca5d2..3ccd08116683 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -82,8 +82,13 @@ unsigned int swiotlb_max_segment(void);
>  size_t swiotlb_max_mapping_size(struct device *dev);
>  bool is_swiotlb_active(void);
>  void __init swiotlb_adjust_size(unsigned long new_size);
> +void swiotlb_set_bounce_remap(unsigned char *vaddr);
>  #else
>  #define swiotlb_force SWIOTLB_NO_FORCE
> +static inline void swiotlb_set_bounce_remap(unsigned char *vaddr)
> +{
> +}
> +
>  static inline bool is_swiotlb_buffer(phys_addr_t paddr)
>  {
>   return false;
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 7c42df6e6100..5fd2db6aa149 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -94,6 +94,7 @@ static unsigned int io_tlb_index;
>   * not be bounced (unless SWIOTLB_FORCE is set).
>   */
>  static unsigned int max_segment;
> +static unsigned char *swiotlb_bounce_remap_addr;
>  
>  /*
>   * We need to save away the original address corresponding to a mapped entry
> @@ -421,6 +422,11 @@ void __init swiotlb_exit(void)
>   swiotlb_cleanup();
>  }
>  
> +void swiotlb_set_bounce_remap(unsigned char *vaddr)
> +{
> + swiotlb_bounce_remap_addr = vaddr;
> +}
> +
>  /*
>   * Bounce: copy the swiotlb buffer from or back to the original dma location
>   */
> @@ -428,7 +434,12 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
> phys_addr_t tlb_addr,
>  size_t size, enum dma_data_direction dir)
>  {
>   unsigned long pfn = PFN_DOWN(orig_addr);
> - unsigned char *vaddr = phys_to_virt(tlb_addr);
> + unsigned char *vaddr;
> +
> + if (swiotlb_bounce_remap_addr)
> + vaddr = swiotlb_bounce_remap_addr + tlb_addr - io_tlb_start;
> + else
> + vaddr = phys_to_virt(tlb_addr);
>  
>   if (PageHighMem(pfn_to_page(pfn))) {
>   /* The buffer does not have a mapping.  Map it in and copy */
> -- 
> 2.25.1
>

Re: [Resend RFC PATCH V2 07/12] HV/Vmbus: Initialize VMbus ring buffer for Isolation VM

2021-04-15 Thread Konrad Rzeszutek Wilk

On Wed, Apr 14, 2021 at 10:49:40AM -0400, Tianyu Lan wrote:
> From: Tianyu Lan 
> 
> VMbus ring buffer are shared with host and it's need to
> be accessed via extra address space of Isolation VM with
> SNP support. This patch is to map the ring buffer
> address in extra address space via ioremap(). HV host

Why do you need to use ioremap()? Why not just use vmap?


> visibility hvcall smears data in the ring buffer and
> so reset the ring buffer memory to zero after calling
> visibility hvcall.

So you are exposing these two:
 EXPORT_SYMBOL_GPL(get_vm_area);
 EXPORT_SYMBOL_GPL(ioremap_page_range);

But if you used vmap wouldn't you get the same thing for free?

> 
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/hv/channel.c  | 10 +
>  drivers/hv/hyperv_vmbus.h |  2 +
>  drivers/hv/ring_buffer.c  | 83 +--
>  mm/ioremap.c  |  1 +
>  mm/vmalloc.c  |  1 +
>  5 files changed, 76 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index 407b74d72f3f..4a9fb7ad4c72 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -634,6 +634,16 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   if (err)
>   goto error_clean_ring;
>  
> + err = hv_ringbuffer_post_init(>outbound,
> +   page, send_pages);
> + if (err)
> + goto error_free_gpadl;
> +
> + err = hv_ringbuffer_post_init(>inbound,
> +   [send_pages], recv_pages);
> + if (err)
> + goto error_free_gpadl;
> +
>   /* Create and init the channel open message */
>   open_info = kzalloc(sizeof(*open_info) +
>  sizeof(struct vmbus_channel_open_channel),
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 0778add21a9c..d78a04ad5490 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -172,6 +172,8 @@ extern int hv_synic_cleanup(unsigned int cpu);
>  /* Interface */
>  
>  void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
> +int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,
> + struct page *pages, u32 page_cnt);
>  
>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
>  struct page *pages, u32 pagecnt);
> diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
> index 35833d4d1a1d..c8b0f7b45158 100644
> --- a/drivers/hv/ring_buffer.c
> +++ b/drivers/hv/ring_buffer.c
> @@ -17,6 +17,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include "hyperv_vmbus.h"
>  
> @@ -188,6 +190,44 @@ void hv_ringbuffer_pre_init(struct vmbus_channel 
> *channel)
>   mutex_init(>outbound.ring_buffer_mutex);
>  }
>  
> +int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,
> +struct page *pages, u32 page_cnt)
> +{
> + struct vm_struct *area;
> + u64 physic_addr = page_to_pfn(pages) << PAGE_SHIFT;
> + unsigned long vaddr;
> + int err = 0;
> +
> + if (!hv_isolation_type_snp())
> + return 0;
> +
> + physic_addr += ms_hyperv.shared_gpa_boundary;
> + area = get_vm_area((2 * page_cnt - 1) * PAGE_SIZE, VM_IOREMAP);
> + if (!area || !area->addr)
> + return -EFAULT;
> +
> + vaddr = (unsigned long)area->addr;
> + err = ioremap_page_range(vaddr, vaddr + page_cnt * PAGE_SIZE,
> +physic_addr, PAGE_KERNEL_IO);
> + err |= ioremap_page_range(vaddr + page_cnt * PAGE_SIZE,
> +   vaddr + (2 * page_cnt - 1) * PAGE_SIZE,
> +   physic_addr + PAGE_SIZE, PAGE_KERNEL_IO);
> + if (err) {
> + vunmap((void *)vaddr);
> + return -EFAULT;
> + }
> +
> + /* Clean memory after setting host visibility. */
> + memset((void *)vaddr, 0x00, page_cnt * PAGE_SIZE);
> +
> + ring_info->ring_buffer = (struct hv_ring_buffer *)vaddr;
> + ring_info->ring_buffer->read_index = 0;
> + ring_info->ring_buffer->write_index = 0;
> + ring_info->ring_buffer->feature_bits.value = 1;
> +
> + return 0;
> +}
> +
>  /* Initialize the ring buffer. */
>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
>  struct page *pages, u32 page_cnt)
> @@ -197,33 +237,34 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
> *ring_info,
>  
>   BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
>  
> - /*
> -  * First page holds struct hv_ring_buffer, do wraparound mapping for
> -  * the rest.
> -  */
> - pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *),
> -GFP_KERNEL);
> - if (!pages_wraparound)
> - return -ENOMEM;
> -
> - pages_wraparound[0] = pages;
> - for (i = 0; i < 2 * (page_cnt - 1); i++)
> - pages_wraparound[i + 1] = [i %

Re: [Resend RFC PATCH V2 06/12] HV/Vmbus: Add SNP support for VMbus channel initiate message

2021-04-15 Thread Konrad Rzeszutek Wilk

On Wed, Apr 14, 2021 at 10:49:39AM -0400, Tianyu Lan wrote:
> From: Tianyu Lan 
> 
> The physical address of monitor pages in the CHANNELMSG_INITIATE_CONTACT
> msg should be in the extra address space for SNP support and these

What is this 'extra address space'? Is that just normal virtual address
space of the Linux kernel?

> pages also should be accessed via the extra address space inside Linux
> guest and remap the extra address by ioremap function.

OK, why do you need to use ioremap on them? Why not use vmap for
example? What is it that makes ioremap the right candidate?





> 
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/hv/connection.c   | 62 +++
>  drivers/hv/hyperv_vmbus.h |  1 +
>  2 files changed, 63 insertions(+)
> 
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 79bca653dce9..a0be9c11d737 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -101,6 +101,12 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
> *msginfo, u32 version)
>  
>   msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
>   msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
> +
> + if (hv_isolation_type_snp()) {
> + msg->monitor_page1 += ms_hyperv.shared_gpa_boundary;
> + msg->monitor_page2 += ms_hyperv.shared_gpa_boundary;
> + }
> +
>   msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
>  
>   /*
> @@ -145,6 +151,29 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
> *msginfo, u32 version)
>   return -ECONNREFUSED;
>   }
>  
> + if (hv_isolation_type_snp()) {
> + vmbus_connection.monitor_pages_va[0]
> + = vmbus_connection.monitor_pages[0];
> + vmbus_connection.monitor_pages[0]
> + = ioremap_cache(msg->monitor_page1, HV_HYP_PAGE_SIZE);
> + if (!vmbus_connection.monitor_pages[0])
> + return -ENOMEM;
> +
> + vmbus_connection.monitor_pages_va[1]
> + = vmbus_connection.monitor_pages[1];
> + vmbus_connection.monitor_pages[1]
> + = ioremap_cache(msg->monitor_page2, HV_HYP_PAGE_SIZE);
> + if (!vmbus_connection.monitor_pages[1]) {
> + vunmap(vmbus_connection.monitor_pages[0]);
> + return -ENOMEM;
> + }
> +
> + memset(vmbus_connection.monitor_pages[0], 0x00,
> +HV_HYP_PAGE_SIZE);
> + memset(vmbus_connection.monitor_pages[1], 0x00,
> +HV_HYP_PAGE_SIZE);
> + }
> +
>   return ret;
>  }
>  
> @@ -156,6 +185,7 @@ int vmbus_connect(void)
>   struct vmbus_channel_msginfo *msginfo = NULL;
>   int i, ret = 0;
>   __u32 version;
> + u64 pfn[2];
>  
>   /* Initialize the vmbus connection */
>   vmbus_connection.conn_state = CONNECTING;
> @@ -213,6 +243,16 @@ int vmbus_connect(void)
>   goto cleanup;
>   }
>  
> + if (hv_isolation_type_snp()) {
> + pfn[0] = virt_to_hvpfn(vmbus_connection.monitor_pages[0]);
> + pfn[1] = virt_to_hvpfn(vmbus_connection.monitor_pages[1]);
> + if (hv_mark_gpa_visibility(2, pfn,
> + VMBUS_PAGE_VISIBLE_READ_WRITE)) {
> + ret = -EFAULT;
> + goto cleanup;
> + }
> + }
> +
>   msginfo = kzalloc(sizeof(*msginfo) +
> sizeof(struct vmbus_channel_initiate_contact),
> GFP_KERNEL);
> @@ -279,6 +319,8 @@ int vmbus_connect(void)
>  
>  void vmbus_disconnect(void)
>  {
> + u64 pfn[2];
> +
>   /*
>* First send the unload request to the host.
>*/
> @@ -298,6 +340,26 @@ void vmbus_disconnect(void)
>   vmbus_connection.int_page = NULL;
>   }
>  
> + if (hv_isolation_type_snp()) {
> + if (vmbus_connection.monitor_pages_va[0]) {
> + vunmap(vmbus_connection.monitor_pages[0]);
> + vmbus_connection.monitor_pages[0]
> + = vmbus_connection.monitor_pages_va[0];
> + vmbus_connection.monitor_pages_va[0] = NULL;
> + }
> +
> + if (vmbus_connection.monitor_pages_va[1]) {
> + vunmap(vmbus_connection.monitor_pages[1]);
> + vmbus_connection.monitor_pages[1]
> + = vmbus_connection.monitor_pages_va[1];
> + vmbus_connection.monitor_pages_va[1] = NULL;
> + }
> +
> + pfn[0] = virt_to_hvpfn(vmbus_connection.monitor_pages[0]);
> + pfn[1] = virt_to_hvpfn(vmbus_connection.monitor_pages[1]);
> + hv_mark_gpa_visibility(2, pfn, VMBUS_PAGE_NOT_VISIBLE);
> + }
> +
>   hv_free_hyperv_page((unsigned

Re: [Resend RFC PATCH V2 04/12] HV: Add Write/Read MSR registers via ghcb

2021-04-15 Thread Konrad Rzeszutek Wilk

On Wed, Apr 14, 2021 at 10:49:37AM -0400, Tianyu Lan wrote:
> From: Tianyu Lan 
> 
> Hyper-V provides GHCB protocol to write Synthetic Interrupt
> Controller MSR registers and these registers are emulated by
> Hypervisor rather than paravisor.

What is paravisor? Is that the VMPL0 to borrow AMD speak from
https://www.amd.com/system/files/TechDocs/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf
 
> 
> Hyper-V requests to write SINTx MSR registers twice(once via
> GHCB and once via wrmsr instruction including the proxy bit 21)

Why? And what does 'proxy bit 21' mean? Isn't it just setting a bit
on the MSR?

Oh. Are you writting to it twice because you need to let the hypervisor
know (Hyper-V) which is done via the WRMSR. And then the inner
hypervisor (VMPL0) via the SINITx MSR?


> Guest OS ID MSR also needs to be set via GHCB.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/hyperv/hv_init.c   |  18 +
>  arch/x86/hyperv/ivm.c   | 130 
>  arch/x86/include/asm/mshyperv.h |  87 +
>  arch/x86/kernel/cpu/mshyperv.c  |   3 +
>  drivers/hv/hv.c |  65 +++-
>  include/asm-generic/mshyperv.h  |   4 +-
>  6 files changed, 261 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 90e65fbf4c58..87b1dd9c84d6 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -475,6 +475,9 @@ void __init hyperv_init(void)
>  
>   ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
>   *ghcb_base = ghcb_va;
> +
> + /* Hyper-V requires to write guest os id via ghcb in SNP IVM. */
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);
>   }
>  
>   rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> @@ -561,6 +564,7 @@ void hyperv_cleanup(void)
>  
>   /* Reset our OS id */
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
>  
>   /*
>* Reset hypercall page reference before reset the page,
> @@ -668,17 +672,3 @@ bool hv_is_hibernation_supported(void)
>   return !hv_root_partition && acpi_sleep_state_supported(ACPI_STATE_S4);
>  }
>  EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
> -
> -enum hv_isolation_type hv_get_isolation_type(void)
> -{
> - if (!(ms_hyperv.features_b & HV_ISOLATION))
> - return HV_ISOLATION_TYPE_NONE;
> - return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
> -}
> -EXPORT_SYMBOL_GPL(hv_get_isolation_type);
> -
> -bool hv_is_isolation_supported(void)
> -{
> - return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
> -}
> -EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index a5950b7a9214..2ec64b367aaf 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -6,12 +6,139 @@
>   *  Tianyu Lan 
>   */
>  
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  
> +union hv_ghcb {
> + struct ghcb ghcb;
> +} __packed __aligned(PAGE_SIZE);
> +
> +void hv_ghcb_msr_write(u64 msr, u64 value)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long flags;
> +
> + if (!ms_hyperv.ghcb_base)
> + return;
> +
> + local_irq_save(flags);
> + ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
> + hv_ghcb = (union hv_ghcb *)*ghcb_base;
> + if (!hv_ghcb) {
> + local_irq_restore(flags);
> + return;
> + }
> +
> + memset(hv_ghcb, 0x00, HV_HYP_PAGE_SIZE);
> +
> + hv_ghcb->ghcb.protocol_version = 1;
> + hv_ghcb->ghcb.ghcb_usage = 0;
> +
> + ghcb_set_sw_exit_code(_ghcb->ghcb, SVM_EXIT_MSR);
> + ghcb_set_rcx(_ghcb->ghcb, msr);
> + ghcb_set_rax(_ghcb->ghcb, lower_32_bits(value));
> + ghcb_set_rdx(_ghcb->ghcb, value >> 32);
> + ghcb_set_sw_exit_info_1(_ghcb->ghcb, 1);
> + ghcb_set_sw_exit_info_2(_ghcb->ghcb, 0);
> +
> + VMGEXIT();
> +
> + if ((hv_ghcb->ghcb.save.sw_exit_info_1 & 0x) == 1)
> + pr_warn("Fail to write msr via ghcb.\n.");
> +
> + local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(hv_ghcb_msr_write);
> +
> +void hv_ghcb_msr_read(u64 msr, u64 *value)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long flags;
> +
> + if (!ms_hyperv.ghcb_base)
> + return;
> +
> + local_irq_save(flags);
> + ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
> + hv_ghcb = (union hv_ghcb *)*ghcb_base;
> + if (!hv_ghcb) {
> + local_irq_restore(flags);
> + return;
> + }
> +
> + memset(hv_ghcb, 0x00, PAGE_SIZE);
> + hv_ghcb->ghcb.protocol_version = 1;
> + hv_ghcb->ghcb.ghcb_usage = 0;
> +
> + ghcb_set_sw_exit_code(_ghcb->ghcb, SVM_EXIT_MSR);
> + ghcb_set_rcx(_ghcb->ghcb,

Re: [PATCH v3] swiotlb: Make SWIOTLB_NO_FORCE perform no allocation

2021-04-09 Thread Konrad Rzeszutek Wilk

On Thu, Apr 08, 2021 at 08:13:07PM -0700, Florian Fainelli wrote:
> 
> 
> On 3/24/2021 1:42 AM, Christoph Hellwig wrote:
> > On Mon, Mar 22, 2021 at 06:53:49PM -0700, Florian Fainelli wrote:
> >> When SWIOTLB_NO_FORCE is used, there should really be no allocations of
> >> default_nslabs to occur since we are not going to use those slabs. If a
> >> platform was somehow setting swiotlb_no_force and a later call to
> >> swiotlb_init() was to be made we would still be proceeding with
> >> allocating the default SWIOTLB size (64MB), whereas if swiotlb=noforce
> >> was set on the kernel command line we would have only allocated 2KB.
> >>
> >> This would be inconsistent and the point of initializing default_nslabs
> >> to 1, was intended to allocate the minimum amount of memory possible, so
> >> simply remove that minimal allocation period.
> >>
> >> Signed-off-by: Florian Fainelli 
> > 
> > Looks good,
> > 
> > Reviewed-by: Christoph Hellwig 
> > 
> 
> Thanks! Konrad, can you apply this patch to your for-linus-5.13 branch
> if you are also happy with it?

It should be now visible?
> -- 
> Florian

Re: [PATCH] ARM: Qualify enabling of swiotlb_init()

2021-04-01 Thread Konrad Rzeszutek Wilk

On Tue, Mar 30, 2021 at 07:36:07AM +0200, Christoph Hellwig wrote:
> On Mon, Mar 29, 2021 at 12:30:42PM -0700, Florian Fainelli wrote:
> > Should I toss this in Russell's patch tracker or do you need me to make
> > some changes to the patch?
> 
> Due to all the other changes in this area I don't think anything but
> the swiotlb tree makes much sense here.

I've put them all on 

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git
devel/for-linus-5.13

Re: [PATCH] ARM: Qualify enabling of swiotlb_init()

2021-03-19 Thread Konrad Rzeszutek Wilk

On Fri, Mar 19, 2021 at 02:07:31PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 18, 2021 at 09:03:33PM -0700, Florian Fainelli wrote:
> >  #ifdef CONFIG_ARM_LPAE
> > +   if (swiotlb_force == SWIOTLB_FORCE ||
> > +   max_pfn > arm_dma_pfn_limit)
> 
> Does arm_dma_pfn_limit do the right thing even with the weirdest
> remapping ranges?  Maybe a commen here would be useful.
> 
> > +   swiotlb_init(1);
> > +   else
> > +   swiotlb_force = SWIOTLB_NO_FORCE;
> 
> Konrad: what do you think of setting swiotlb_force to SWIOTLB_NO_FORCE
> and only switching it to SWIOTLB_NORMAL when swiotlb_init* is called?
> That kind makes more sense than forcing the callers to do it.
> 
> While we're at it, I think swiotlb_force should probably be renamed to
> swiotlb_mode or somethng like that.

swiotlb_mode sounds good.

Also it got me thinking - ARM on Xen at some point was a bit strange, so not 
sure how
the logic works here, Stefano?

Re: [PATCH] swiotlb: Make SWIOTLB_NO_FORCE perform no allocation

2021-03-18 Thread Konrad Rzeszutek Wilk

On Thu, Mar 18, 2021 at 09:00:54PM -0700, Florian Fainelli wrote:
> When SWIOTLB_NO_FORCE is used, there should really be no allocations of
> io_tlb_nslabs to occur since we are not going to use those slabs. If a
> platform was somehow setting swiotlb_no_force and a later call to
> swiotlb_init() was to be made we would still be proceeding with
> allocating the default SWIOTLB size (64MB), whereas if swiotlb=noforce
> was set on the kernel command line we would have only allocated 2KB.
> 
> This would be inconsistent and the point of initializing io_tlb_nslabs
> to 1, was to avoid hitting the test for io_tlb_nslabs being 0/not
> initialized.

Could you rebase this on devel/for-linus-5.13 in

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git

please?
> 
> Signed-off-by: Florian Fainelli 
> ---
>  kernel/dma/swiotlb.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c10e855a03bc..526c8321b76f 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -121,12 +121,10 @@ setup_io_tlb_npages(char *str)
>   }
>   if (*str == ',')
>   ++str;
> - if (!strcmp(str, "force")) {
> + if (!strcmp(str, "force"))
>   swiotlb_force = SWIOTLB_FORCE;
> - } else if (!strcmp(str, "noforce")) {
> + else if (!strcmp(str, "noforce"))
>   swiotlb_force = SWIOTLB_NO_FORCE;
> - io_tlb_nslabs = 1;
> - }
>  
>   return 0;
>  }
> @@ -284,6 +282,9 @@ swiotlb_init(int verbose)
>   unsigned char *vstart;
>   unsigned long bytes;
>  
> + if (swiotlb_force == SWIOTLB_NO_FORCE)
> + goto out;
> +
>   if (!io_tlb_nslabs) {
>   io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
>   io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
> @@ -302,6 +303,7 @@ swiotlb_init(int verbose)
>   io_tlb_start = 0;
>   }
>   pr_warn("Cannot allocate buffer");
> +out:
>   no_iotlb_memory = true;
>  }
>  
> -- 
> 2.25.1
>

Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Konrad Rzeszutek Wilk

> > 
> > In fact I should have looked more closely at that myself - checking
> > debugfs on my 4GB arm64 board actually shows io_tlb_nslabs = 0, and
> > indeed we are bypassing initialisation completely and (ab)using
> > SWIOTLB_NO_FORCE to cover it up, so I guess it probably *is* safe now
> > for the noforce option to do the same for itself and save even that one
> > page.
> 
> OK, I can submit a patch that does that. 5.12-rc3 works correctly for me
> here as well and only allocates SWIOTLB when needed which in our case is
> either:
> 
> - we have DRAM at PA >= 4GB
> - we have limited peripherals (Raspberry Pi 4 derivative) that can only
> address the lower 1GB
> 
> Now let's see if we can get ARM 32-bit to match :)

Whatever patch you come up with, if it is against SWIOTLB please base it on top 
of
devel/for-linus-5.12 in 
https://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git/

Thx
> -- 
> Florian

Re: [PATCH] Documentation: arm64/acpi : clarify arm64 support of IBFT

2021-03-16 Thread Konrad Rzeszutek Wilk

On Tue, Mar 16, 2021 at 12:50:41PM -0600, Tom Saeger wrote:
> In commit 94bccc340710 ("iscsi_ibft: make ISCSI_IBFT dependson ACPI instead
> of ISCSI_IBFT_FIND") Kconfig was disentangled to make ISCSI_IBFT selection
> not depend on x86.
> 
> Update arm64 acpi documentation, changing IBFT support status from
> "Not Supported" to "Optional".
> Opportunistically re-flow paragraph for changed lines.
> 
> Link: 
> https://lore.kernel.org/lkml/1563475054-10680-1-git-send-email-thomas....@oracle.com/
> 
> Signed-off-by: Tom Saeger 

Reviewed-by: Konrad Rzeszutek Wilk 

Thank you!
> ---
>  Documentation/arm64/acpi_object_usage.rst | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/arm64/acpi_object_usage.rst 
> b/Documentation/arm64/acpi_object_usage.rst
> index 377e9d224db0..0609da73970b 100644
> --- a/Documentation/arm64/acpi_object_usage.rst
> +++ b/Documentation/arm64/acpi_object_usage.rst
> @@ -17,12 +17,12 @@ For ACPI on arm64, tables also fall into the following 
> categories:
>  
> -  Recommended: BERT, EINJ, ERST, HEST, PCCT, SSDT
>  
> -   -  Optional: BGRT, CPEP, CSRT, DBG2, DRTM, ECDT, FACS, FPDT, IORT,
> -  MCHI, MPST, MSCT, NFIT, PMTT, RASF, SBST, SLIT, SPMI, SRAT, STAO,
> -   TCPA, TPM2, UEFI, XENV
> +   -  Optional: BGRT, CPEP, CSRT, DBG2, DRTM, ECDT, FACS, FPDT, IBFT,
> +  IORT, MCHI, MPST, MSCT, NFIT, PMTT, RASF, SBST, SLIT, SPMI, SRAT,
> +  STAO, TCPA, TPM2, UEFI, XENV
>  
> -   -  Not supported: BOOT, DBGP, DMAR, ETDT, HPET, IBFT, IVRS, LPIT,
> -  MSDM, OEMx, PSDT, RSDT, SLIC, WAET, WDAT, WDRT, WPBT
> +   -  Not supported: BOOT, DBGP, DMAR, ETDT, HPET, IVRS, LPIT, MSDM, 
> OEMx,
> +  PSDT, RSDT, SLIC, WAET, WDAT, WDRT, WPBT
>  
>  == 
> 
>  Table  Usage for ARMv8 Linux
> -- 
> 2.31.0
>

Re: [PATCH v2] cxl: Make loop variable be 'i' instead of 'j'

2021-03-03 Thread Konrad Rzeszutek Wilk

..snip..
> > cxl_for_each_cmd(cmd) {
> > const struct cxl_command_info *info = >info;
> > +   int i = 0;
> > 
> > -   if (copy_to_user(>commands[j++], info, sizeof(*info)))
> > +   if (copy_to_user(>commands[i++], info, sizeof(*info)))
> > return -EFAULT;
> > 
> > -   if (j == n_commands)
> > +   if (i == n_commands)
> > break;
> 
> 
> Did you test this?
> Looks badly broken to me.

I sent out the v3 which had that fixed. See
https://lore.kernel.org/linux-cxl/20210226222152.48467-1-konrad.w...@oracle.com/T/#u

Re: [PATCH] swiotlb: Fix type of max_slot

2021-03-02 Thread Konrad Rzeszutek Wilk


On 3/2/21 12:21 PM, Kunihiko Hayashi wrote:

After the refactoring phase, the type of max_slot has changed from unsigned
long to unsigned int. The return type of the function get_max_slots() and
the 4th argument type of iommu_is_span_boundary() are different from the
type of max_slot. Finally, asserts BUG_ON in iommu_is_span_boundary().

Cc: Christoph Hellwig 
Fixes: 567d877f9a7d ("swiotlb: refactor swiotlb_tbl_map_single")
Signed-off-by: Kunihiko Hayashi 


I think this is all good. Looking at Linus's master I see:


537 unsigned long max_slots = get_max_slots(boundary_mask);

?


---
  kernel/dma/swiotlb.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 369e4c3..c10e855 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -534,7 +534,7 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, io_tlb_start) & boundary_mask;
-   unsigned int max_slots = get_max_slots(boundary_mask);
+   unsigned long max_slots = get_max_slots(boundary_mask);
unsigned int iotlb_align_mask =
dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
unsigned int nslots = nr_slots(alloc_size), stride;

[PATCH v3] cxl: Make loop variable be 'i' instead of 'j'

2021-02-26 Thread Konrad Rzeszutek Wilk

.. otherwise people spend extra cycles looking for the
inner loop and wondering 'why j'?

This was an oversight when initial work was rebased
so let's fix it here.

Signed-off-by: Konrad Rzeszutek Wilk 
---
v1: Initial posting
v2: Fix per Dan's request
v3: Duh, don't initialize i in the loop, but do it outside of it.
---
 drivers/cxl/mem.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 244cb7d89678..e7246e585e62 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -698,7 +698,7 @@ static int cxl_query_cmd(struct cxl_memdev *cxlmd,
struct device *dev = >dev;
struct cxl_mem_command *cmd;
u32 n_commands;
-   int j = 0;
+   int i;
 
dev_dbg(dev, "Query IOCTL\n");
 
@@ -713,13 +713,14 @@ static int cxl_query_cmd(struct cxl_memdev *cxlmd,
 * otherwise, return max(n_commands, total commands) cxl_command_info
 * structures.
 */
+   i = 0;
cxl_for_each_cmd(cmd) {
const struct cxl_command_info *info = >info;
 
-   if (copy_to_user(>commands[j++], info, sizeof(*info)))
+   if (copy_to_user(>commands[i++], info, sizeof(*info)))
return -EFAULT;
 
-   if (j == n_commands)
+   if (i == n_commands)
break;
}
 
-- 
2.29.2

Re: [PATCH] swiotlb: swiotlb_tbl_map_single() kernel BUG in iommu-helper.h:30

2021-02-26 Thread Konrad Rzeszutek Wilk

On Fri, Feb 26, 2021 at 01:18:14PM -0800, Brad Larson wrote:
> On Fri, Feb 26, 2021 at 12:43:07PM -0800, Brad Larson wrote:
> > Kernel Oops introduced in next-20210222 due to get_max_slots return arg
> size.
> > In the function find_slots() variable max_slots is zero when
> boundary_mask is
> > 0x.
> 
> I am looking at the stable/for-linus-5.12 and what I sent out for
> a GIT PULL and I believe this is already squashed in:
> 
> 531 static int find_slots(struct device *dev, phys_addr_t orig_addr,  
>  
> 
> 532 size_t alloc_size)
>  
> 
> 533 { 
>  
> 534 unsigned long boundary_mask = dma_get_seg_boundary(dev);  
>  
> 
> 535 dma_addr_t tbl_dma_addr = 
>  
> 536 phys_to_dma_unencrypted(dev, io_tlb_start) &
> boundary_mask; 
> 537 unsigned long max_slots = get_max_slots(boundary_mask);
> 
> Could you double-check please?
> 
>  
> Yes this is squashed in the snippet you are showing.  The bug was introduced 
> in
> next-20210222 and is still there when I updated to next-20210226 today. 

Duh! It should be fixed now for the next one. Thank you!

[PATCH v2] cxl: Make loop variable be 'i' instead of 'j'

2021-02-26 Thread Konrad Rzeszutek Wilk

.. otherwise people spend extra cycles looking for the
inner loop and wondering 'why j'?

This was an oversight when initial work was rebased
so let's fix it here.

Signed-off-by: Konrad Rzeszutek Wilk 
---
v1: Initial posting
v2: Fix per Dan's request
---
 drivers/cxl/mem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 244cb7d89678..d43197a193ce 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -698,7 +698,6 @@ static int cxl_query_cmd(struct cxl_memdev *cxlmd,
struct device *dev = >dev;
struct cxl_mem_command *cmd;
u32 n_commands;
-   int j = 0;
 
dev_dbg(dev, "Query IOCTL\n");
 
@@ -715,11 +714,12 @@ static int cxl_query_cmd(struct cxl_memdev *cxlmd,
 */
cxl_for_each_cmd(cmd) {
const struct cxl_command_info *info = >info;
+   int i = 0;
 
-   if (copy_to_user(>commands[j++], info, sizeof(*info)))
+   if (copy_to_user(>commands[i++], info, sizeof(*info)))
return -EFAULT;
 
-   if (j == n_commands)
+   if (i == n_commands)
break;
}
 
-- 
2.13.6

Re: [PATCH] swiotlb: swiotlb_tbl_map_single() kernel BUG in iommu-helper.h:30

2021-02-26 Thread Konrad Rzeszutek Wilk

On Fri, Feb 26, 2021 at 12:43:07PM -0800, Brad Larson wrote:
> Kernel Oops introduced in next-20210222 due to get_max_slots return arg size.
> In the function find_slots() variable max_slots is zero when boundary_mask is
> 0x.

I am looking at the stable/for-linus-5.12 and what I sent out for
a GIT PULL and I believe this is already squashed in:

531 static int find_slots(struct device *dev, phys_addr_t orig_addr,

532 size_t alloc_size)  

533 {   

534 unsigned long boundary_mask = dma_get_seg_boundary(dev);

535 dma_addr_t tbl_dma_addr =   

536 phys_to_dma_unencrypted(dev, io_tlb_start) & boundary_mask; 

537 unsigned long max_slots = get_max_slots(boundary_mask);

Could you double-check please?

> 
> [0.242119] kernel BUG at ./include/linux/iommu-helper.h:30!
> [0.247793] Internal error: Oops - BUG: 0 [#1] SMP
> [0.252595] Modules linked in:
> [0.255657] CPU: 0 PID: 93 Comm: kworker/0:1 Not tainted 
> 5.11.0-next-20210224+ #25
> [0.263245] Hardware name: Elba ASIC Board (DT)
> [0.267784] Workqueue: events_freezable mmc_rescan
> [0.272592] pstate: 6085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> [0.278612] pc : swiotlb_tbl_map_single+0x2b0/0x6a0
> [0.283505] lr : swiotlb_tbl_map_single+0x440/0x6a0
> [0.288395] sp : ffc0122736b0
> [0.291713] x29: ffc0122736b0 x28: ffc010e3
> [0.297039] x27: bbf58000 x26: 
> [0.302364] x25:  x24: 0001
> [0.307689] x23:  x22: 
> [0.313013] x21:  x20: 
> [0.318338] x19: 001241fd4600 x18: ffc010d288c8
> [0.323662] x17: 0007 x16: 0001
> [0.328987] x15: ffc092273367 x14: 3a424c54204f4920
> [0.334311] x13: 6572617774666f73 x12: 20726e2030207865
> [0.339636] x11: 646e692078787820 x10: 3062653737317830
> [0.344960] x9 : 2074666968732031 x8 : ff977cf82368
> [0.350285] x7 : 0001 x6 : c000efff
> [0.355609] x5 : 00017fe8 x4 : 
> [0.360934] x3 :  x2 : 18b0d50da009d000
> [0.366258] x1 :  x0 : 0042
> [0.371583] Call trace:
> [0.374032]  swiotlb_tbl_map_single+0x2b0/0x6a0
> [0.378573]  swiotlb_map+0xa8/0x2b0
> 
> Signed-off-by: Brad Larson 
> ---
>  kernel/dma/swiotlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 369e4c3a0f2b..c10e855a03bc 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -534,7 +534,7 @@ static int find_slots(struct device *dev, phys_addr_t 
> orig_addr,
>   unsigned long boundary_mask = dma_get_seg_boundary(dev);
>   dma_addr_t tbl_dma_addr =
>   phys_to_dma_unencrypted(dev, io_tlb_start) & boundary_mask;
> - unsigned int max_slots = get_max_slots(boundary_mask);
> + unsigned long max_slots = get_max_slots(boundary_mask);
>   unsigned int iotlb_align_mask =
>   dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
>   unsigned int nslots = nr_slots(alloc_size), stride;
> -- 
> 2.17.1
>

[PATCH] cxl: Make loop variable be 'i' instead of 'j'

2021-02-26 Thread Konrad Rzeszutek Wilk

.. otherwise people spend extra cycles looking for the
inner loop and wondering 'why j'?

This was an over-sight when initial work was rebased
so lets fix it here.

Fixes: 583fa5e71cae ("cxl/mem: Add basic IOCTL interface")
Signed-off-by: Konrad Rzeszutek Wilk 
---
 drivers/cxl/mem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 244cb7d89678..2b8265b03b0d 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -698,7 +698,7 @@ static int cxl_query_cmd(struct cxl_memdev *cxlmd,
struct device *dev = >dev;
struct cxl_mem_command *cmd;
u32 n_commands;
-   int j = 0;
+   int i = 0;
 
dev_dbg(dev, "Query IOCTL\n");
 
@@ -716,10 +716,10 @@ static int cxl_query_cmd(struct cxl_memdev *cxlmd,
cxl_for_each_cmd(cmd) {
const struct cxl_command_info *info = >info;
 
-   if (copy_to_user(>commands[j++], info, sizeof(*info)))
+   if (copy_to_user(>commands[i++], info, sizeof(*info)))
return -EFAULT;
 
-   if (j == n_commands)
+   if (i == n_commands)
break;
}
 
-- 
2.13.6

[PATCH v1] Little fix that missed the merge window.

2021-02-26 Thread Konrad Rzeszutek Wilk

When the last set of patches were posted for review right before the
merge window this fix was deferred.

Please at your convience pick it up.

 drivers/cxl/mem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Konrad Rzeszutek Wilk (1):
  cxl: Make loop variable be 'i' instead of 'j'

[GIT PULL] (swiotlb) stable/for-linus-5.12

2021-02-26 Thread Konrad Rzeszutek Wilk

Hey Linus,

Please git pull the following branch:

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git 
stable/for-linus-5.12

which has two memory encryption related patches (SWIOTLB is enabled by
default for AMD-SEV):

 1) Adding support for alignment so that NVME can properly work,
 2) Keep track of requested DMA buffers length, as underlaying hardware devices
can trip SWIOTLB to bounce too much and crash the kernel.

And a tiny fix to use proper APIs in drivers.

Please note you will see the tree a bit fresh - that is due to me squashing
in a fix earlier this week, which caused a conflict later, and me fixing up
the conflict caused the authorship to shift to me which I just fixed up now.

Argh!


 drivers/mmc/host/sdhci.c|   9 +-
 drivers/nvme/host/pci.c |   1 +
 include/linux/device.h  |   1 +
 include/linux/dma-mapping.h |  16 +++
 include/linux/swiotlb.h |   1 +
 kernel/dma/swiotlb.c| 310 +++-
 6 files changed, 215 insertions(+), 123 deletions(-)


Christoph Hellwig (8):
  sdhci: stop poking into swiotlb internals
  swiotlb: add a IO_TLB_SIZE define
  swiotlb: factor out an io_tlb_offset helper
  swiotlb: factor out a nr_slots helper
  swiotlb: clean up swiotlb_tbl_unmap_single
  swiotlb: refactor swiotlb_tbl_map_single
  swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single
  swiotlb: respect min_align_mask

Jianxiong Gao (2):
  driver core: add a min_align_mask field to struct device_dma_parameters
  nvme-pci: set min_align_mask

Martin Radev (1):
  swiotlb: Validate bounce size in the sync/unmap path

Re: [PATCH] mm, kasan: don't poison boot memory

2021-02-22 Thread Konrad Rzeszutek Wilk

On Mon, Feb 22, 2021 at 05:39:29PM +0100, David Hildenbrand wrote:
> On 22.02.21 17:13, David Hildenbrand wrote:
> > On 22.02.21 16:13, George Kennedy wrote:
> > > 
> > > 
> > > On 2/22/2021 4:52 AM, David Hildenbrand wrote:
> > > > On 20.02.21 00:04, George Kennedy wrote:
> > > > > 
> > > > > 
> > > > > On 2/19/2021 11:45 AM, George Kennedy wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2/18/2021 7:09 PM, Andrey Konovalov wrote:
> > > > > > > On Fri, Feb 19, 2021 at 1:06 AM George Kennedy
> > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 2/18/2021 3:55 AM, David Hildenbrand wrote:
> > > > > > > > > On 17.02.21 21:56, Andrey Konovalov wrote:
> > > > > > > > > > During boot, all non-reserved memblock memory is exposed to 
> > > > > > > > > > the
> > > > > > > > > > buddy
> > > > > > > > > > allocator. Poisoning all that memory with KASAN lengthens 
> > > > > > > > > > boot
> > > > > > > > > > time,
> > > > > > > > > > especially on systems with large amount of RAM. This patch 
> > > > > > > > > > makes
> > > > > > > > > > page_alloc to not call kasan_free_pages() on all new memory.
> > > > > > > > > > 
> > > > > > > > > > __free_pages_core() is used when exposing fresh memory 
> > > > > > > > > > during
> > > > > > > > > > system
> > > > > > > > > > boot and when onlining memory during hotplug. This patch 
> > > > > > > > > > adds a new
> > > > > > > > > > FPI_SKIP_KASAN_POISON flag and passes it to 
> > > > > > > > > > __free_pages_ok()
> > > > > > > > > > through
> > > > > > > > > > free_pages_prepare() from __free_pages_core().
> > > > > > > > > > 
> > > > > > > > > > This has little impact on KASAN memory tracking.
> > > > > > > > > > 
> > > > > > > > > > Assuming that there are no references to newly exposed pages
> > > > > > > > > > before they
> > > > > > > > > > are ever allocated, there won't be any intended (but buggy)
> > > > > > > > > > accesses to
> > > > > > > > > > that memory that KASAN would normally detect.
> > > > > > > > > > 
> > > > > > > > > > However, with this patch, KASAN stops detecting wild and 
> > > > > > > > > > large
> > > > > > > > > > out-of-bounds accesses that happen to land on a fresh 
> > > > > > > > > > memory page
> > > > > > > > > > that
> > > > > > > > > > was never allocated. This is taken as an acceptable 
> > > > > > > > > > trade-off.
> > > > > > > > > > 
> > > > > > > > > > All memory allocated normally when the boot is over keeps 
> > > > > > > > > > getting
> > > > > > > > > > poisoned as usual.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Andrey Konovalov 
> > > > > > > > > > Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d
> > > > > > > > > Not sure this is the right thing to do, see
> > > > > > > > > 
> > > > > > > > > https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529...@oracle.com
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Reversing the order in which memory gets allocated + used 
> > > > > > > > > during
> > > > > > > > > boot
> > > > > > > > > (in a patch by me) might have revealed an invalid memory 
> > > > > > > > > access
> > > > > > > > > during
> > > > > > > > > boot.
> > > > > > > > > 
> > > > > > > > > I suspect that that issue would no longer get detected with 
> > > > > > > > > your
> > > > > > > > > patch, as the invalid memory access would simply not get 
> > > > > > > > > detected.
> > > > > > > > > Now, I cannot prove that :)
> > > > > > > > Since David's patch we're having trouble with the iBFT ACPI 
> > > > > > > > table,
> > > > > > > > which
> > > > > > > > is mapped in via kmap() - see acpi_map() in 
> > > > > > > > "drivers/acpi/osl.c".
> > > > > > > > KASAN
> > > > > > > > detects that it is being used after free when ibft_init() 
> > > > > > > > accesses
> > > > > > > > the
> > > > > > > > iBFT table, but as of yet we can't find where it get's freed 
> > > > > > > > (we've
> > > > > > > > instrumented calls to kunmap()).
> > > > > > > Maybe it doesn't get freed, but what you see is a wild or a large
> > > > > > > out-of-bounds access. Since KASAN marks all memory as freed 
> > > > > > > during the
> > > > > > > memblock->page_alloc transition, such bugs can manifest as
> > > > > > > use-after-frees.
> > > > > > 
> > > > > > It gets freed and re-used. By the time the iBFT table is accessed by
> > > > > > ibft_init() the page has been over-written.
> > > > > > 
> > > > > > Setting page flags like the following before the call to kmap()
> > > > > > prevents the iBFT table page from being freed:
> > > > > 
> > > > > Cleaned up version:
> > > > > 
> > > > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > > > > index 0418feb..8f0a8e7 100644
> > > > > --- a/drivers/acpi/osl.c
> > > > > +++ b/drivers/acpi/osl.c
> > > > > @@ -287,9 +287,12 @@ static void __iomem 
> > > > > *acpi_map(acpi_physical_address
> > > > > pg_off, unsigned long pg_sz)
> > > > > 
> > > > >      pfn = pg_off >> PAGE_SHIFT;
> > > > >      if (should_use_kmap(pfn)) {
> > >

Re: [PATCH v5 4/9] cxl/mem: Add basic IOCTL interface

2021-02-19 Thread Konrad Rzeszutek Wilk

..snip..
> +static int handle_mailbox_cmd_from_user(struct cxl_mem *cxlm,
> + const struct cxl_mem_command *cmd,
> + u64 in_payload, u64 out_payload,
> + s32 *size_out, u32 *retval)
> +{
> + struct device *dev = >pdev->dev;
> + struct mbox_cmd mbox_cmd = {
> + .opcode = cmd->opcode,
> + .size_in = cmd->info.size_in,
> + .size_out = cmd->info.size_out,
> + };
> + int rc;
> +
> + if (cmd->info.size_out) {
> + mbox_cmd.payload_out = kvzalloc(cmd->info.size_out, GFP_KERNEL);
> + if (!mbox_cmd.payload_out)
> + return -ENOMEM;
> + }
> +
> + if (cmd->info.size_in) {
> + mbox_cmd.payload_in = vmemdup_user(u64_to_user_ptr(in_payload),
> +cmd->info.size_in);
> + if (IS_ERR(mbox_cmd.payload_in))
> + return PTR_ERR(mbox_cmd.payload_in);

Not that this should happen, but what if info.size_out was set? Should
you also free mbox_cmd.payload_out?

> + }
> +
> + rc = cxl_mem_mbox_get(cxlm);
> + if (rc)
> + goto out;
> +
> + dev_dbg(dev,
> + "Submitting %s command for user\n"
> + "\topcode: %x\n"
> + "\tsize: %ub\n",
> + cxl_command_names[cmd->info.id].name, mbox_cmd.opcode,
> + cmd->info.size_in);
> +
> + rc = __cxl_mem_mbox_send_cmd(cxlm, _cmd);
> + cxl_mem_mbox_put(cxlm);
> + if (rc)
> + goto out;
> +
> + /*
> +  * @size_out contains the max size that's allowed to be written back out
> +  * to userspace. While the payload may have written more output than
> +  * this it will have to be ignored.
> +  */
> + if (mbox_cmd.size_out) {
> + dev_WARN_ONCE(dev, mbox_cmd.size_out > *size_out,
> +   "Invalid return size\n");
> + if (copy_to_user(u64_to_user_ptr(out_payload),
> +  mbox_cmd.payload_out, mbox_cmd.size_out)) {
> + rc = -EFAULT;
> + goto out;
> + }
> + }
> +
> + *size_out = mbox_cmd.size_out;
> + *retval = mbox_cmd.return_code;
> +
> +out:
> + kvfree(mbox_cmd.payload_in);
> + kvfree(mbox_cmd.payload_out);
> + return rc;
> +}

..snip..

> +static int cxl_query_cmd(struct cxl_memdev *cxlmd,
> +  struct cxl_mem_query_commands __user *q)
> +{
> + struct device *dev = >dev;
> + struct cxl_mem_command *cmd;
> + u32 n_commands;
> + int j = 0;

How come it is 'j' instead of the usual 'i'?
> +
> + dev_dbg(dev, "Query IOCTL\n");
> +
> + if (get_user(n_commands, >n_commands))
> + return -EFAULT;
> +
> + /* returns the total number if 0 elements are requested. */
> + if (n_commands == 0)
> + return put_user(cxl_cmd_count, >n_commands);
> +
> + /*
> +  * otherwise, return max(n_commands, total commands) cxl_command_info
> +  * structures.
> +  */
> + cxl_for_each_cmd(cmd) {
> + const struct cxl_command_info *info = >info;
> +
> + if (copy_to_user(>commands[j++], info, sizeof(*info)))
> + return -EFAULT;
> +
> + if (j == n_commands)
> + break;
> + }
> +
> + return 0;
> +}
> +

Re: [PATCH v5 7/9] cxl/mem: Add set of informational commands

2021-02-19 Thread Konrad Rzeszutek Wilk

On Tue, Feb 16, 2021 at 08:09:56PM -0800, Ben Widawsky wrote:
> Add initial set of formal commands beyond basic identify and command
> enumeration.
> 
> Signed-off-by: Ben Widawsky 
> Reviewed-by: Dan Williams 
Reviewed-by: Konrad Rzeszutek Wilk 

> Reviewed-by: Jonathan Cameron  (v2)
> ---
>  drivers/cxl/mem.c| 9 +
>  include/uapi/linux/cxl_mem.h | 5 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index e31b3045e231..6d7d3870b5da 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -45,12 +45,16 @@
>  enum opcode {
>   CXL_MBOX_OP_INVALID = 0x,
>   CXL_MBOX_OP_RAW = CXL_MBOX_OP_INVALID,
> + CXL_MBOX_OP_GET_FW_INFO = 0x0200,
>   CXL_MBOX_OP_ACTIVATE_FW = 0x0202,
>   CXL_MBOX_OP_GET_SUPPORTED_LOGS  = 0x0400,
>   CXL_MBOX_OP_GET_LOG = 0x0401,
>   CXL_MBOX_OP_IDENTIFY= 0x4000,
> + CXL_MBOX_OP_GET_PARTITION_INFO  = 0x4100,
>   CXL_MBOX_OP_SET_PARTITION_INFO  = 0x4101,
> + CXL_MBOX_OP_GET_LSA = 0x4102,
>   CXL_MBOX_OP_SET_LSA = 0x4103,
> + CXL_MBOX_OP_GET_HEALTH_INFO = 0x4200,
>   CXL_MBOX_OP_SET_SHUTDOWN_STATE  = 0x4204,
>   CXL_MBOX_OP_SCAN_MEDIA  = 0x4304,
>   CXL_MBOX_OP_GET_SCAN_MEDIA  = 0x4305,
> @@ -171,6 +175,11 @@ static struct cxl_mem_command mem_commands[] = {
>   CXL_CMD(RAW, ~0, ~0, 0),
>  #endif
>   CXL_CMD(GET_SUPPORTED_LOGS, 0, ~0, CXL_CMD_FLAG_FORCE_ENABLE),
> + CXL_CMD(GET_FW_INFO, 0, 0x50, 0),
> + CXL_CMD(GET_PARTITION_INFO, 0, 0x20, 0),
> + CXL_CMD(GET_LSA, 0x8, ~0, 0),
> + CXL_CMD(GET_HEALTH_INFO, 0, 0x12, 0),
> + CXL_CMD(GET_LOG, 0x18, ~0, CXL_CMD_FLAG_FORCE_ENABLE),
>  };
>  
>  /*
> diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
> index 59227f82a4c1..3155382dfc9b 100644
> --- a/include/uapi/linux/cxl_mem.h
> +++ b/include/uapi/linux/cxl_mem.h
> @@ -24,6 +24,11 @@
>   ___C(IDENTIFY, "Identify Command"),   \
>   ___C(RAW, "Raw device command"),  \
>   ___C(GET_SUPPORTED_LOGS, "Get Supported Logs"),   \
> + ___C(GET_FW_INFO, "Get FW Info"), \
> + ___C(GET_PARTITION_INFO, "Get Partition Information"),\
> + ___C(GET_LSA, "Get Label Storage Area"),  \
> + ___C(GET_HEALTH_INFO, "Get Health Info"), \
> + ___C(GET_LOG, "Get Log"), \
>   ___C(MAX, "invalid / last command")
>  
>  #define ___C(a, b) CXL_MEM_COMMAND_ID_##a
> -- 
> 2.30.1
>

Re: [PATCH v5 6/9] cxl/mem: Enable commands via CEL

2021-02-19 Thread Konrad Rzeszutek Wilk

> +static inline struct cxl_mem_command *cxl_mem_find_command(u16 opcode)
> +{
> + struct cxl_mem_command *c;
> +
> + cxl_for_each_cmd(c)

Would you be amenable to adding {
> + if (c->opcode == opcode)
> + return c;
> +
and } here

(and in the code below as well where cxl_for_each_cmd is used?)

Regardless of that:

Reviewed-by: Konrad Rzeszutek Wilk

Re: [PATCH v5 5/9] cxl/mem: Add a "RAW" send command

2021-02-19 Thread Konrad Rzeszutek Wilk

On Tue, Feb 16, 2021 at 08:09:54PM -0800, Ben Widawsky wrote:
> The CXL memory device send interface will have a number of supported
> commands. The raw command is not such a command. Raw commands allow
> userspace to send a specified opcode to the underlying hardware and
> bypass all driver checks on the command. The primary use for this
> command is to [begrudgingly] allow undocumented vendor specific hardware
> commands.
> 
> While not the main motivation, it also allows prototyping new hardware
> commands without a driver patch and rebuild.
> 
> While this all sounds very powerful it comes with a couple of caveats:
> 1. Bug reports using raw commands will not get the same level of
>attention as bug reports using supported commands (via taint).
> 2. Supported commands will be rejected by the RAW command.
> 
> With this comes new debugfs knob to allow full access to your toes with
> your weapon of choice.
> 
> Cc: Ariel Sibley 
> Signed-off-by: Ben Widawsky 
> Reviewed-by: Dan Williams  (v2)
> Reviewed-by: Jonathan Cameron 
Reviewed-by: Konrad Rzeszutek Wilk

Re: [PATCH v5 1/9] cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints

2021-02-19 Thread Konrad Rzeszutek Wilk

On Tue, Feb 16, 2021 at 08:09:50PM -0800, Ben Widawsky wrote:
> From: Dan Williams 
> 
> The CXL.mem protocol allows a device to act as a provider of "System
> RAM" and/or "Persistent Memory" that is fully coherent as if the memory
> was attached to the typical CPU memory controller.
> 
> With the CXL-2.0 specification a PCI endpoint can implement a "Type-3"
> device interface and give the operating system control over "Host
> Managed Device Memory". See section 2.3 Type 3 CXL Device.
> 
> The memory range exported by the device may optionally be described by
> the platform firmware memory map, or by infrastructure like LIBNVDIMM to
> provision persistent memory capacity from one, or more, CXL.mem devices.
> 
> A pre-requisite for Linux-managed memory-capacity provisioning is this
> cxl_mem driver that can speak the mailbox protocol defined in section
> 8.2.8.4 Mailbox Registers.
> 
> For now just land the initial driver boiler-plate and Documentation/
> infrastructure.
> 
> Link: https://www.computeexpresslink.org/download-the-specification
> Cc: Jonathan Corbet 
> Signed-off-by: Dan Williams 
> Signed-off-by: Ben Widawsky 
> Acked-by: David Rientjes  (v1)
Reviewed-by: Konrad Rzeszutek Wilk 

Albeit you may want to modify 2020 to 2021 in the Copyright sections.

Re: [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays

2021-02-19 Thread Konrad Rzeszutek Wilk

On Sun, Feb 07, 2021 at 04:56:01PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 04, 2021 at 09:40:23AM +0100, Christoph Hellwig wrote:
> > So one thing that has been on my mind for a while:  I'd really like
> > to kill the separate dma ops in Xen swiotlb.  If we compare xen-swiotlb
> > to swiotlb the main difference seems to be:
> > 
> >  - additional reasons to bounce I/O vs the plain DMA capable
> >  - the possibility to do a hypercall on arm/arm64
> >  - an extra translation layer before doing the phys_to_dma and vice
> >versa
> >  - an special memory allocator
> > 
> > I wonder if inbetween a few jump labels or other no overhead enablement
> > options and possibly better use of the dma_range_map we could kill
> > off most of swiotlb-xen instead of maintaining all this code duplication?
> 
> So I looked at this a bit more.
> 
> For x86 with XENFEAT_auto_translated_physmap (how common is that?)

Juergen, Boris please correct me if I am wrong, but that 
XENFEAT_auto_translated_physmap
only works for PVH guests?

> pfn_to_gfn is a nop, so plain phys_to_dma/dma_to_phys do work as-is.
> 
> xen_arch_need_swiotlb always returns true for x86, and
> range_straddles_page_boundary should never be true for the
> XENFEAT_auto_translated_physmap case.

Correct. The kernel should have no clue of what the real MFNs are
for PFNs.
> 
> So as far as I can tell the mapping fast path for the
> XENFEAT_auto_translated_physmap can be trivially reused from swiotlb.
> 
> That leaves us with the next more complicated case, x86 or fully cache
> coherent arm{,64} without XENFEAT_auto_translated_physmap.  In that case
> we need to patch in a phys_to_dma/dma_to_phys that performs the MFN
> lookup, which could be done using alternatives or jump labels.
> I think if that is done right we should also be able to let that cover
> the foreign pages in is_xen_swiotlb_buffer/is_swiotlb_buffer, but
> in that worst case that would need another alternative / jump label.
> 
> For non-coherent arm{,64} we'd also need to use alternatives or jump
> labels to for the cache maintainance ops, but that isn't a hard problem
> either.
> 
>

Re: [PATCH 0/7] x86/seves: Support 32-bit boot path and other updates

2021-02-10 Thread Konrad Rzeszutek Wilk

On Wed, Feb 10, 2021 at 04:12:25PM +0100, Joerg Roedel wrote:
> Hi Konrad,
> 
> On Wed, Feb 10, 2021 at 09:58:35AM -0500, Konrad Rzeszutek Wilk wrote:
> > What GRUB versions are we talking about (CC-ing Daniel Kiper, who owns
> > GRUB).
> 
> I think this was about 32-bit GRUB builds used by distributions. I
> personally tested it with a kernel which has EFI support disabled, in
> this case the OVMF firmware will also boot into the startup_32 boot
> path.

I think I am missing something obvious here - but why would you want
EFI support disabled?

Or is the idea that "legacy" OSes can nicely run under AMD SEV?
But since you are having a kernel patch that is not "legacy OS" anymore.

> 
> > By 'some firmware' we talking SeaBIOS?
> 
> No, SeaBIOS is not supported for SEV-ES, only OVMF has handling for #VC
> so far.
> 
> Regards,
> 
>   Joerg

Re: [PATCH 0/7] x86/seves: Support 32-bit boot path and other updates

2021-02-10 Thread Konrad Rzeszutek Wilk

On Wed, Feb 10, 2021 at 11:21:28AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> Hi,
> 
> these patches add support for the 32-bit boot in the decompressor
> code. This is needed to boot an SEV-ES guest on some firmware and grub
> versions. The patches also add the necessary CPUID sanity checks and a

Could you expand a bit please?

What GRUB versions are we talking about (CC-ing Daniel Kiper, who owns
GRUB).

By 'some firmware' we talking SeaBIOS?

> 32-bit version of the C-bit check.
> 
> Other updates included here:
> 
>   1. Add code to shut down exception handling in the
>  decompressor code before jumping to the real kernel.
>  Once in the real kernel it is not safe anymore to jump
>  back to the decompressor code via exceptions.
> 
>   2. Replace open-coded hlt loops with proper calls to
>  sev_es_terminate().
> 
> Please review.
> 
> Thanks,
> 
>   Joerg
> 
> Joerg Roedel (7):
>   x86/boot/compressed/64: Cleanup exception handling before booting
> kernel
>   x86/boot/compressed/64: Reload CS in startup_32
>   x86/boot/compressed/64: Setup IDT in startup_32 boot path
>   x86/boot/compressed/64: Add 32-bit boot #VC handler
>   x86/boot/compressed/64: Add CPUID sanity check to 32-bit boot-path
>   x86/boot/compressed/64: Check SEV encryption in 32-bit boot-path
>   x86/sev-es: Replace open-coded hlt-loops with sev_es_terminate()
> 
>  arch/x86/boot/compressed/head_64.S | 168 -
>  arch/x86/boot/compressed/idt_64.c  |  14 +++
>  arch/x86/boot/compressed/mem_encrypt.S | 114 -
>  arch/x86/boot/compressed/misc.c|   7 +-
>  arch/x86/boot/compressed/misc.h|   6 +
>  arch/x86/boot/compressed/sev-es.c  |  12 +-
>  arch/x86/kernel/sev-es-shared.c|  10 +-
>  7 files changed, 307 insertions(+), 24 deletions(-)
> 
> -- 
> 2.30.0
>

Re: [PATCH RFC 3/7] kvm: x86: XSAVE state and XFD MSRs context switch

2021-02-08 Thread Konrad Rzeszutek Wilk

On Mon, Feb 08, 2021 at 07:12:22PM +0100, Paolo Bonzini wrote:
> On 08/02/21 19:04, Sean Christopherson wrote:
> > > That said, the case where we saw MSR autoload as faster involved EFER, and
> > > we decided that it was due to TLB flushes (commit f6577a5fa15d, "x86, kvm,
> > > vmx: Always use LOAD_IA32_EFER if available", 2014-11-12). Do you know if
> > > RDMSR/WRMSR is always slower than MSR autoload?
> > RDMSR/WRMSR may be marginally slower, but only because the autoload stuff 
> > avoids
> > serializing the pipeline after every MSR.
> 
> That's probably adding up quickly...
> 
> > The autoload paths are effectively
> > just wrappers around the WRMSR ucode, plus some extra VM-Enter specific 
> > checks,
> > as ucode needs to perform all the normal fault checks on the index and 
> > value.
> > On the flip side, if the load lists are dynamically constructed, I suspect 
> > the
> > code overhead of walking the lists negates any advantages of the load lists.
> 
> ... but yeah this is not very encouraging.
> 
> Context switch time is a problem for XFD.  In a VM that uses AMX, most
> threads in the guest will have nonzero XFD but the vCPU thread itself will
> have zero XFD.  So as soon as one thread in the VM forces the vCPU thread to
> clear XFD, you pay a price on all vmexits and vmentries.
> 
> However, running the host with _more_ bits set than necessary in XFD should
> not be a problem as long as the host doesn't use the AMX instructions.  So
> perhaps Jing can look into keeping XFD=0 for as little time as possible, and
> XFD=host_XFD|guest_XFD as much as possible.

This sounds like the lazy-fpu (eagerfpu?) that used to be part of the
kernel? I recall that we had a CVE for that - so it may also be worth
double-checking that we don't reintroduce that one.

Re: [RESEND RFC: timer passthrough 0/9] Support timer passthrough for VM

2021-02-08 Thread Konrad Rzeszutek Wilk

On Fri, Feb 05, 2021 at 06:03:08PM +0800, Zhimin Feng wrote:
> The main motivation for this patch is to improve the performance of VM.
> This patch series introduces how to enable the timer passthrough in
> non-root mode.

Nice! Those are impressive numbers!

> 
> The main idea is to offload the host timer to the preemtion timer in
> non-root mode. Through doing this, guest can write tscdeadline msr directly
> in non-root mode and host timer isn't lost. If CPU is in root mode,
> guest timer is switched to software timer.

I am sorry - but I am having a hard time understanding the sentence
above so let me ask some specific questions.

- How do you protect against the guest DoS-ing the host and mucking with
  the host timer?

- As in can you explain how the host can still continue scheduling it's
  own quanta?

And one more - what happens with Live Migration? I would assume that
becomes a no-go anymore unless you swap in the guest timer back in? So
we end up emulating the MSR again?

Thanks!

> 
> Testing on Intel(R) Xeon(R) Platinum 8260 server.
> 
> The guest OS is Debian(kernel: 4.19.28). The specific configuration is
>  is as follows: 8 cpu, 16GB memory, guest idle=poll
> memcached in guest(memcached -d -t 8 -u root)
> 
> I use the memtier_benchmark tool to test performance
> (memtier_benchmark -P memcache_text -s guest_ip -c 16 -t 32
>  --key-maximum=100 --random-data --data-size-range=64-128 -p 11211
>  --generate-keys --ratio 5:1 --test-time=500)
> 
> Total Ops can be improved 25% and Avg.Latency can be improved 20% when
> the timer-passthrough is enabled.
> 
> =
>| Enable timer-passth | Disable timer-passth |
> =
> Totals Ops/sec |514869.67| 411766.67|
> -
> Avg.Latency|0.99483  | 1.24294  |
> =
> 
> 
> Zhimin Feng (9):
>   KVM: vmx: hook set_next_event for getting the host tscd
>   KVM: vmx: enable host lapic timer offload preemtion timer
>   KVM: vmx: enable passthrough timer to guest
>   KVM: vmx: enable passth timer switch to sw timer
>   KVM: vmx: use tsc_adjust to enable tsc_offset timer passthrough
>   KVM: vmx: check enable_timer_passth strictly
>   KVM: vmx: save the initial value of host tscd
>   KVM: vmx: Dynamically open or close the timer-passthrough for pre-vm
>   KVM: vmx: query the state of timer-passth for vm
> 
>  arch/x86/include/asm/kvm_host.h |  27 
>  arch/x86/kvm/lapic.c|   1 +
>  arch/x86/kvm/vmx/vmx.c  | 331 
> +++-
>  arch/x86/kvm/x86.c  |  26 +++-
>  include/linux/kvm_host.h|   1 +
>  include/uapi/linux/kvm.h|   3 +
>  kernel/time/tick-common.c   |   1 +
>  tools/include/uapi/linux/kvm.h  |   3 +
>  virt/kvm/kvm_main.c |   1 +
>  9 files changed, 389 insertions(+), 5 deletions(-)
> 
> -- 
> 2.11.0
>

Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-08 Thread Konrad Rzeszutek Wilk

On Fri, Feb 05, 2021 at 06:58:52PM +0100, Christoph Hellwig wrote:
> On Wed, Feb 03, 2021 at 02:36:38PM -0500, Konrad Rzeszutek Wilk wrote:
> > > So what?  If you guys want to provide a new capability you'll have to do
> > > work.  And designing a new protocol based around the fact that the
> > > hardware/hypervisor is not trusted and a copy is always required makes
> > > a lot of more sense than throwing in band aids all over the place.
> > 
> > If you don't trust the hypervisor, what would this capability be in?
> 
> Well, they don't trust the hypervisor to not attack the guest somehow,
> except through the data read.  I never really understood the concept,
> as it leaves too many holes.
> 
> But the point is that these schemes want to force bounce buffering
> because they think it is more secure.  And if that is what you want
> you better have protocol build around the fact that each I/O needs
> to use bounce buffers, so you make those buffers the actual shared
> memory use for communication, and build the protocol around it.

Right. That is what the SWIOTLB pool ends up being as it is allocated at
bootup where the guest tells the hypervisor - these are shared and
clear-text.

> E.g. you don't force the ridiculous NVMe PRP offset rules on the block
> layer, just to make a complicated swiotlb allocation that needs to
> preserve the alignment just do I/O.  But instead you have a trivial

I agree that NVMe is being silly. It could have allocated the coherent
pool and use that and do its own offset within that. That would in
essence carve out a static pool within the SWIOTLB static one..

TTM does that - it has its own DMA machinery on top of DMA API to deal
with its "passing" buffers from one application to another and the fun
of keeping track of that.

> ring buffer or whatever because you know I/O will be copied anyway
> and none of all the hard work higher layers do to make the I/O suitable
> for a normal device apply.

I lost you here. Sorry, are you saying have a simple ring protocol
(like NVME has), where the ring entries (SG or DMA phys) are statically
allocated and whenever NVME driver gets data from user-space it
would copy it in there?

Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays

2021-02-04 Thread Konrad Rzeszutek Wilk

On Thu, Feb 04, 2021 at 11:49:23AM +, Robin Murphy wrote:
> On 2021-02-04 07:29, Christoph Hellwig wrote:
> > On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:
> > > This patch converts several swiotlb related variables to arrays, in
> > > order to maintain stat/status for different swiotlb buffers. Here are
> > > variables involved:
> > > 
> > > - io_tlb_start and io_tlb_end
> > > - io_tlb_nslabs and io_tlb_used
> > > - io_tlb_list
> > > - io_tlb_index
> > > - max_segment
> > > - io_tlb_orig_addr
> > > - no_iotlb_memory
> > > 
> > > There is no functional change and this is to prepare to enable 64-bit
> > > swiotlb.
> > 
> > Claire Chang (on Cc) already posted a patch like this a month ago,
> > which looks much better because it actually uses a struct instead
> > of all the random variables.
> 
> Indeed, I skimmed the cover letter and immediately thought that this whole
> thing is just the restricted DMA pool concept[1] again, only from a slightly
> different angle.

Kind of. Let me lay out how some of these pieces are right now:

+---+  +--+
|   |  |  |
|   |  |  |
|   a)Xen-SWIOTLB   |  | b)SWIOTLB (for !Xen) |
|   |  |  |
+---XX--+  +---X--+
   X
  XX XXX
X   XX

 +--XX---+
 |   |
 |   |
 |   c) SWIOTLB generic  |
 |   |
 +---+

Dongli's patches modify the SWIOTLB generic c), and Xen-SWIOTLB a)
parts.

Also see the IOMMU_INIT logic which lays this a bit more deepth
(for example how to enable SWIOTLB on AMD boxes, or IBM with Calgary
IOMMU, etc - see iommu_table.h).

Furtheremore it lays the groundwork to allocate AMD SEV SWIOTLB buffers
later after boot (so that you can stich different pools together).
All the bits are kind of inside of the SWIOTLB code. And also it changes
the Xen-SWIOTLB to do something similar.

The mempool did it similarly by taking the internal parts (aka the
various io_tlb) of SWIOTLB and exposing them out and having
other code:

+---+  +--+
|   |  |  |
|   |  |  |
| a)Xen-SWIOTLB |  | b)SWIOTLB (for !Xen) |
|   |  |  |
+---XX--+  +---X--+
   X
  XX XXX
X   XX

 +--XX---+ +--+
 |   | | Device tree  |
 |   +<+ enabling SWIOTLB |
 |c) SWIOTLB generic | |  |
 |   | | mempool  |
 +---+ +--+

What I was suggesting to Clarie to follow Xen model, that is
do something like this:

+---+  +--+   ++
|   |  |  |   ||
|   |  |  |   ||
| a)Xen-SWIOTLB |  | b)SWIOTLB (for !Xen) |   | e) DT-SWIOTLB  |
|   |  |  |   ||
+---XX--+  +---X--+   +XX-X+
   XXXX X X XX X XX
  XX XXX
X   XX X

 +--XXX--+
 |   |
 |   |
 |c) SWIOTLB generic |
 |   |
 +---+

so using the SWIOTLB generic parts, and then bolt on top
of the device-tree logic, along with the mempool logic.

But Christopher has an interesting suggestion which is
to squash the all the existing code (a, b, c) all together
and pepper it with various jump-tables.

So:

-+
| SWIOTLB:   |
||
|  a) SWIOTLB (for non-Xen)  |
|  b) Xen-SWIOTLB|
|  c) DT-SWIOTLB |
||
||
-+

with all the various bits (M2P/P2M for Xen, mempool for ARM,
and normal allocation for BM) in one big file.

Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-03 Thread Konrad Rzeszutek Wilk

On Wed, Feb 03, 2021 at 01:49:22PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> > Your comment makes sense but then that would require the cooperation
> > of these vendors and the cloud providers to agree on something meaningful.
> > I am also not sure whether the end result would be better than hardening
> > this interface to catch corruption. There is already some validation in
> > unmap path anyway.
> 
> So what?  If you guys want to provide a new capability you'll have to do
> work.  And designing a new protocol based around the fact that the
> hardware/hypervisor is not trusted and a copy is always required makes
> a lot of more sense than throwing in band aids all over the place.

If you don't trust the hypervisor, what would this capability be in?

I suppose you mean this would need to be in the the guest kernel and
this protocol would depend on .. not-hypervisor and most certainly not
the virtio or any SR-IOV device. That removes a lot of options.

The one sensibile one (since folks will trust OEM vendors like Intel
or AMD to provide the memory encryption so they will also trust the
IOMMU - I hope?) - and they do have plans for that with their IOMMU
frameworks which will remove the need for SWIOTLB (I hope).

But that is not now, but in future.

Re: [PATCH 1/1] iscsi_ibft: KASAN false positive failure occurs in ibft_init()

2021-02-03 Thread Konrad Rzeszutek Wilk

Hey Dmitry, Rafael, George, please see below..

On Wed, Jan 27, 2021 at 10:10:07PM +0100, Dmitry Vyukov wrote:
> On Wed, Jan 27, 2021 at 9:01 PM George Kennedy
>  wrote:
> >
> > Hi Dmitry,
> >
> > On 1/27/2021 1:48 PM, Dmitry Vyukov wrote:
> >
> > On Wed, Jan 27, 2021 at 7:44 PM Konrad Rzeszutek Wilk
> >  wrote:
> >
> > On Tue, Jan 26, 2021 at 01:03:21PM -0500, George Kennedy wrote:
> >
> > During boot of kernel with CONFIG_KASAN the following KASAN false
> > positive failure will occur when ibft_init() reads the
> > ACPI iBFT table: BUG: KASAN: use-after-free in ibft_init
> >
> > The ACPI iBFT table is not allocated, and the iscsi driver uses
> > a pointer to it to calculate checksum, etc. KASAN complains
> > about this pointer with use-after-free, which this is not.
> >
> > Andrey, Alexander, Dmitry,
> >
> > I think this is the right way for this, but was wondering if you have
> > other suggestions?
> >
> > Thanks!
> >
> > Hi George, Konrad,
> >
> > Please provide a sample KASAN report and kernel version to match line 
> > numbers.
> >
> > 5.4.17-2102.200.0.0.20210106_
> >
> > [   24.413536] iBFT detected.
> > [   24.414074]
> > ==
> > [   24.407342] BUG: KASAN: use-after-free in ibft_init+0x134/0xb8b
> > [   24.407342] Read of size 4 at addr 8880be452004 by task swapper/0/1
> > [   24.407342]
> > [   24.407342] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
> > 5.4.17-2102.200.0.0.20210106_.syzk #1
> > [   24.407342] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 0.0.0 02/06/2015
> > [   24.407342] Call Trace:
> > [   24.407342]  dump_stack+0xd4/0x119
> > [   24.407342]  ? ibft_init+0x134/0xb8b
> > [   24.407342]  print_address_description.constprop.6+0x20/0x220
> > [   24.407342]  ? ibft_init+0x134/0xb8b
> > [   24.407342]  ? ibft_init+0x134/0xb8b
> > [   24.407342]  __kasan_report.cold.9+0x37/0x77
> > [   24.407342]  ? ibft_init+0x134/0xb8b
> > [   24.407342]  kasan_report+0x14/0x1b
> > [   24.407342]  __asan_report_load_n_noabort+0xf/0x11
> > [   24.407342]  ibft_init+0x134/0xb8b
> > [   24.407342]  ? dmi_sysfs_init+0x1a5/0x1a5
> > [   24.407342]  ? dmi_walk+0x72/0x89
> > [   24.407342]  ? ibft_check_initiator_for+0x159/0x159
> > [   24.407342]  ? rvt_init_port+0x110/0x101
> > [   24.407342]  ? ibft_check_initiator_for+0x159/0x159
> > [   24.407342]  do_one_initcall+0xc3/0x44d
> > [   24.407342]  ? perf_trace_initcall_level+0x410/0x405
> > [   24.407342]  kernel_init_freeable+0x551/0x673
> > [   24.407342]  ? start_kernel+0x94b/0x94b
> > [   24.407342]  ? __sanitizer_cov_trace_const_cmp1+0x1a/0x1c
> > [   24.407342]  ? __kasan_check_write+0x14/0x16
> > [   24.407342]  ? rest_init+0xe6/0xe6
> > [   24.407342]  kernel_init+0x16/0x1bd
> > [   24.407342]  ? rest_init+0xe6/0xe6
> > [   24.407342]  ret_from_fork+0x2b/0x36
> > [   24.407342]
> > [   24.407342] The buggy address belongs to the page:
> > [   24.407342] page:ea0002f91480 refcount:0 mapcount:0 
> > mapping: index:0x1
> > [   24.407342] flags: 0xfc000()
> > [   24.407342] raw: 000fc000 ea0002fca588 ea0002fb1a88 
> > 
> > [   24.407342] raw: 0001   
> > 
> > [   24.407342] page dumped because: kasan: bad access detected
> > [   24.407342]
> > [   24.407342] Memory state around the buggy address:
> > [   24.407342]  8880be451f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> > ff ff
> > [   24.407342]  8880be451f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> > ff ff
> > [   24.407342] >8880be452000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> > ff ff
> > [   24.407342]^
> > [   24.407342]  8880be452080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> > ff ff
> > [   24.407342]  8880be452100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> > ff ff
> > [   24.407342]
> > ==
> > [   24.407342] Disabling lock debugging due to kernel taint
> > [   24.451021] Kernel panic - not syncing: panic_on_warn set ...
> > [   24.452002] CPU: 1 PID: 1 Comm: swapper/0 Tainted: GB 
> > 5.4.17-2102.200.0.0.20210106_.syzk #1
> > [   24.452002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 0.0.0 02/06/2015
> > [

Re: [PATCH 13/14] cxl/mem: Add limited Get Log command (0401h)

2021-02-03 Thread Konrad Rzeszutek Wilk

On Wed, Feb 03, 2021 at 09:16:10AM -0800, Ben Widawsky wrote:
> On 21-02-02 15:57:03, Dan Williams wrote:
> > On Tue, Feb 2, 2021 at 3:51 PM Ben Widawsky  wrote:
> > >
> > > On 21-02-01 13:28:48, Konrad Rzeszutek Wilk wrote:
> > > > On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> > > > > The Get Log command returns the actual log entries that are advertised
> > > > > via the Get Supported Logs command (0400h). CXL device logs are 
> > > > > selected
> > > > > by UUID which is part of the CXL spec. Because the driver tries to
> > > > > sanitize what is sent to hardware, there becomes a need to restrict 
> > > > > the
> > > > > types of logs which can be accessed by userspace. For example, the
> > > > > vendor specific log might only be consumable by proprietary, or 
> > > > > offline
> > > > > applications, and therefore a good candidate for userspace.
> > > > >
> > > > > The current driver infrastructure does allow basic validation for all
> > > > > commands, but doesn't inspect any of the payload data. Along with Get
> > > > > Log support comes new infrastructure to add a hook for payload
> > > > > validation. This infrastructure is used to filter out the CEL UUID,
> > > > > which the userspace driver doesn't have business knowing, and taints 
> > > > > on
> > > > > invalid UUIDs being sent to hardware.
> > > >
> > > > Perhaps a better option is to reject invalid UUIDs?
> > > >
> > > > And if you really really want to use invalid UUIDs then:
> > > >
> > > > 1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?
> > > >
> > > > 2) Wrap it with lockdown code so that you can't do this at all
> > > >when in LOCKDOWN_INTEGRITY or such?
> > > >
> > >
> > > The commit message needs update btw as CEL is allowed in the latest rev 
> > > of the
> > > patches.
> > >
> > > We could potentially combine this with the now added (in a branch) 
> > > CONFIG_RAW
> > > config option. Indeed I think that makes sense. Dan, thoughts?
> > 
> > Yeah, unknown UUIDs blocking is the same risk as raw commands as a
> > vendor can trigger any behavior they want. A "CONFIG_RAW depends on
> > !CONFIG_INTEGRITY" policy sounds reasonable as well.
> 
> What about LOCKDOWN_NONE though? I think we need something runtime for this.
> 
> Can we summarize the CONFIG options here?
> 
> CXL_MEM_INSECURE_DEBUG // no change
> CXL_MEM_RAW_COMMANDS // if !security_locked_down(LOCKDOWN_NONE)
> 
> bool cxl_unsafe()

Would it be better if this inverted? Aka cxl_safe()..
?
> {
> #ifndef CXL_MEM_RAW_COMMANDS
>   return false;
> #else
>   return !security_locked_down(LOCKDOWN_NONE);

:thumbsup:

(Naturally this would inverted if this was cxl_safe()).


> #endif
> }
> 
> ---
> 
> Did I get that right?

:nods:

Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-02 Thread Konrad Rzeszutek Wilk

On Tue, Feb 02, 2021 at 04:34:09PM -0600, Tom Lendacky wrote:
> On 2/2/21 10:37 AM, Konrad Rzeszutek Wilk wrote:
> > On Mon, Jan 25, 2021 at 07:33:35PM +0100, Martin Radev wrote:
> >> On Mon, Jan 18, 2021 at 10:14:28AM -0500, Konrad Rzeszutek Wilk wrote:
> >>> On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> >>>> On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
> >>>>> On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> >>>>>> The size of the buffer being bounced is not checked if it happens
> >>>>>> to be larger than the size of the mapped buffer. Because the size
> >>>>>> can be controlled by a device, as it's the case with virtio devices,
> >>>>>> this can lead to memory corruption.
> >>>>>>
> >>>>>
> >>>>> I'm really worried about all these hodge podge hacks for not trusted
> >>>>> hypervisors in the I/O stack.  Instead of trying to harden protocols
> >>>>> that are fundamentally not designed for this, how about instead coming
> >>>>> up with a new paravirtualized I/O interface that is specifically
> >>>>> designed for use with an untrusted hypervisor from the start?
> >>>>
> >>>> Your comment makes sense but then that would require the cooperation
> >>>> of these vendors and the cloud providers to agree on something 
> >>>> meaningful.
> >>>> I am also not sure whether the end result would be better than hardening
> >>>> this interface to catch corruption. There is already some validation in
> >>>> unmap path anyway.
> >>>>
> >>>> Another possibility is to move this hardening to the common virtio code,
> >>>> but I think the code may become more complicated there since it would
> >>>> require tracking both the dma_addr and length for each descriptor.
> >>>
> >>> Christoph,
> >>>
> >>> I've been wrestling with the same thing - this is specific to busted
> >>> drivers. And in reality you could do the same thing with a hardware
> >>> virtio device (see example in 
> >>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthunderclap.io%2Fdata=04%7C01%7Cthomas.lendacky%40amd.com%7Cfc27af49d9a943699f6c08d8c798eed4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637478806973542212%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aUVqobkOSDfDhCAEauABOUvCAaIcw%2FTh07YFxeBjBDU%3Dreserved=0)
> >>>  - where the
> >>> mitigation is 'enable the IOMMU to do its job.'.
> >>>
> >>> AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP)..
> >>> and while that is great in the future, SEV without IOMMU is now here.
> >>>
> >>> Doing a full circle here, this issue can be exploited with virtio
> >>> but you could say do that with real hardware too if you hacked the
> >>> firmware, so if you say used Intel SR-IOV NIC that was compromised
> >>> on an AMD SEV machine, and plumbed in the guest - the IOMMU inside
> >>> of the guest would be SWIOTLB code. Last line of defense against
> >>> bad firmware to say.
> >>>
> >>> As such I am leaning towards taking this code, but I am worried
> >>> about the performance hit .. but perhaps I shouldn't as if you
> >>> are using SWIOTLB=force already you are kind of taking a
> >>> performance hit?
> >>>
> >>
> >> I have not measured the performance degradation. This will hit all AMD SEV,
> >> Intel TDX, IBM Protected Virtualization VMs. I don't expect the hit to
> >> be large since there are only few added operations per hundreads of copied
> >> bytes. I could try to measure the performance hit by running some benchmark
> >> with virtio-net/virtio-blk/virtio-rng.
> >>
> >> Earlier I said:
> >>>> Another possibility is to move this hardening to the common virtio code,
> >>>> but I think the code may become more complicated there since it would
> >>>> require tracking both the dma_addr and length for each descriptor.
> >>
> >> Unfortunately, this doesn't make sense. Even if there's validation for
> >> the size in the common virtio layer, there will be some other device
> >> which controls a dma_addr and length passed to dma_unmap* in the
> >> corresponding driver. The device c

Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-02 Thread Konrad Rzeszutek Wilk

On Mon, Jan 25, 2021 at 07:33:35PM +0100, Martin Radev wrote:
> On Mon, Jan 18, 2021 at 10:14:28AM -0500, Konrad Rzeszutek Wilk wrote:
> > On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> > > On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
> > > > On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> > > > > The size of the buffer being bounced is not checked if it happens
> > > > > to be larger than the size of the mapped buffer. Because the size
> > > > > can be controlled by a device, as it's the case with virtio devices,
> > > > > this can lead to memory corruption.
> > > > > 
> > > > 
> > > > I'm really worried about all these hodge podge hacks for not trusted
> > > > hypervisors in the I/O stack.  Instead of trying to harden protocols
> > > > that are fundamentally not designed for this, how about instead coming
> > > > up with a new paravirtualized I/O interface that is specifically
> > > > designed for use with an untrusted hypervisor from the start?
> > > 
> > > Your comment makes sense but then that would require the cooperation
> > > of these vendors and the cloud providers to agree on something meaningful.
> > > I am also not sure whether the end result would be better than hardening
> > > this interface to catch corruption. There is already some validation in
> > > unmap path anyway.
> > > 
> > > Another possibility is to move this hardening to the common virtio code,
> > > but I think the code may become more complicated there since it would
> > > require tracking both the dma_addr and length for each descriptor.
> > 
> > Christoph,
> > 
> > I've been wrestling with the same thing - this is specific to busted
> > drivers. And in reality you could do the same thing with a hardware
> > virtio device (see example in http://thunderclap.io/) - where the
> > mitigation is 'enable the IOMMU to do its job.'.
> > 
> > AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP)..
> > and while that is great in the future, SEV without IOMMU is now here.
> > 
> > Doing a full circle here, this issue can be exploited with virtio
> > but you could say do that with real hardware too if you hacked the
> > firmware, so if you say used Intel SR-IOV NIC that was compromised
> > on an AMD SEV machine, and plumbed in the guest - the IOMMU inside
> > of the guest would be SWIOTLB code. Last line of defense against
> > bad firmware to say.
> > 
> > As such I am leaning towards taking this code, but I am worried
> > about the performance hit .. but perhaps I shouldn't as if you
> > are using SWIOTLB=force already you are kind of taking a
> > performance hit?
> > 
> 
> I have not measured the performance degradation. This will hit all AMD SEV,
> Intel TDX, IBM Protected Virtualization VMs. I don't expect the hit to
> be large since there are only few added operations per hundreads of copied
> bytes. I could try to measure the performance hit by running some benchmark
> with virtio-net/virtio-blk/virtio-rng.
> 
> Earlier I said:
> > > Another possibility is to move this hardening to the common virtio code,
> > > but I think the code may become more complicated there since it would
> > > require tracking both the dma_addr and length for each descriptor.
> 
> Unfortunately, this doesn't make sense. Even if there's validation for
> the size in the common virtio layer, there will be some other device
> which controls a dma_addr and length passed to dma_unmap* in the
> corresponding driver. The device can target a specific dma-mapped private
> buffer by changing the dma_addr and set a good length to overwrite buffers
> following it.
> 
> So, instead of doing the check in every driver and hitting a performance
> cost even when swiotlb is not used, it's probably better to fix it in
> swiotlb.
> 
> @Tom Lendacky, do you think that it makes sense to harden swiotlb or
> some other approach may be better for the SEV features?

I am not Tom, but this change seems the right way forward regardless if
is TDX, AMD SEV, or any other architecture that encrypt memory and use
SWIOTLB.

Let me queue it up in development branch and do some regression testing.
>

Re: [PATCH 08/14] taint: add taint for direct hardware access

2021-02-01 Thread Konrad Rzeszutek Wilk

On Mon, Feb 01, 2021 at 11:01:11AM -0800, Dan Williams wrote:
> On Mon, Feb 1, 2021 at 10:35 AM Ben Widawsky  wrote:
> >
> > On 21-02-01 13:18:45, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Jan 29, 2021 at 04:24:32PM -0800, Ben Widawsky wrote:
> > > > For drivers that moderate access to the underlying hardware it is
> > > > sometimes desirable to allow userspace to bypass restrictions. Once
> > > > userspace has done this, the driver can no longer guarantee the sanctity
> > > > of either the OS or the hardware. When in this state, it is helpful for
> > > > kernel developers to be made aware (via this taint flag) of this fact
> > > > for subsequent bug reports.
> > > >
> > > > Example usage:
> > > > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > > > - The xyzzy driver provides an interface for using waldo, but not fred.
> > > > - quux is convinced they really need the fred command.
> > > > - xyzzy driver allows quux to frob hardware to initiate fred.
> > >
> > > Would it not be easier to _not_ frob the hardware for fred-operation?
> > > Aka not implement it or just disallow in the first place?
> >
> > Yeah. So the idea is you either are in a transient phase of the command and 
> > some
> > future kernel will have real support for fred - or a vendor is being short
> > sighted and not adding support for fred.
> >
> > >
> > >
> > > >   - kernel gets tainted.
> > > > - turns out fred command is borked, and scribbles over memory.
> > > > - developers laugh while closing quux's subsequent bug report.
> > >
> > > Yeah good luck with that theory in-the-field. The customer won't
> > > care about this and will demand a solution for doing fred-operation.
> > >
> > > Just easier to not do fred-operation in the first place,no?
> >
> > The short answer is, in an ideal world you are correct. See nvdimm as an 
> > example
> > of the real world.
> >
> > The longer answer. Unless we want to wait until we have all the hardware 
> > we're
> > ever going to see, it's impossible to have a fully baked, and validated
> > interface. The RAW interface is my admission that I make no guarantees about
> > being able to provide the perfect interface and giving the power back to the
> > hardware vendors and their driver writers.
> >
> > As an example, suppose a vendor shipped a device with their special vendor
> > opcode. They can enable their customers to use that opcode on any driver
> > version. That seems pretty powerful and worthwhile to me.
> >
> 
> Powerful, frightening, and questionably worthwhile when there are
> already examples of commands that need extra coordination for whatever
> reason. However, I still think the decision tilts towards allowing
> this given ongoing spec work.
> 
> NVDIMM ended up allowing unfettered vendor passthrough given the lack
> of an organizing body to unify vendors. CXL on the other hand appears
> to have more gravity to keep vendors honest. A WARN splat with a
> taint, and a debugfs knob for the truly problematic commands seems
> sufficient protection of system integrity while still following the
> Linux ethos of giving system owners enough rope to make their own
> decisions.
> 
> > Or a more realistic example, we ship a driver that adds a command which is
> > totally broken. Customers can utilize the RAW interface until it gets fixed 
> > in a
> > subsequent release which might be quite a ways out.
> >
> > I'll say the RAW interface isn't an encouraged usage, but it's one that I 
> > expect
> > to be needed, and if it's not we can always try to kill it later. If nobody 
> > is
> > actually using it, nobody will complain, right :D
> 
> It might be worthwhile to make RAW support a compile time decision so
> that Linux distros can only ship support for the commands the CXL
> driver-dev community has blessed, but I'll leave it to a distro
> developer to second that approach.

Couple of thoughts here:

 - As distro developer (well, actually middle manager of distro
   developers) this approach of raw interface is a headache.

   Customers will pick it and use it since it is there and the poor
   support folks will have to go through layers of different devices to
   say (for example) to finally find out that some OEM firmware opcode
   X is a debug facility for inserting corrupted data, while for another vendor
   the same X opcode makes it go super-fast.

   Not that anybody would do that, right? Ha!

 - I will imagine that some of the more vocal folk

Re: [PATCH 09/14] cxl/mem: Add a "RAW" send command

2021-02-01 Thread Konrad Rzeszutek Wilk

On Mon, Feb 01, 2021 at 11:27:08AM -0800, Ben Widawsky wrote:
> On 21-02-01 13:24:00, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jan 29, 2021 at 04:24:33PM -0800, Ben Widawsky wrote:
> > > The CXL memory device send interface will have a number of supported
> > > commands. The raw command is not such a command. Raw commands allow
> > > userspace to send a specified opcode to the underlying hardware and
> > > bypass all driver checks on the command. This is useful for a couple of
> > > usecases, mainly:
> > > 1. Undocumented vendor specific hardware commands
> > > 2. Prototyping new hardware commands not yet supported by the driver
> > 
> > This sounds like a recipe for ..
> > 
> > In case you really really want this may I recommend you do two things:
> > 
> > - Wrap this whole thing with #ifdef
> >   CONFIG_CXL_DEBUG_THIS_WILL_DESTROY_YOUR_LIFE
> > 
> >  (or something equivalant to make it clear this should never be
> >   enabled in production kernels).
> > 
> >  - Add a nice big fat printk in dmesg telling the user that they
> >are creating a unstable parallel universe that will lead to their
> >blood pressure going sky-high, or perhaps something more professional
> >sounding.
> > 
> > - Rethink this. Do you really really want to encourage vendors
> >   to use this raw API instead of them using the proper APIs?
> 
> Again, the ideal is proper APIs. Barring that they get a WARN, and a taint if
> they use the raw commands.

Linux upstream is all about proper APIs. Just don't do this.
> 
> > 
> > > 
> > > While this all sounds very powerful it comes with a couple of caveats:
> > > 1. Bug reports using raw commands will not get the same level of
> > >attention as bug reports using supported commands (via taint).
> > > 2. Supported commands will be rejected by the RAW command.
> > > 
> > > With this comes new debugfs knob to allow full access to your toes with
> > > your weapon of choice.
> > 
> > Problem is that debugfs is no longer "debug" but is enabled in
> > production kernel.
> 
> I don't see this as my problem. Again, they've been WARNed and tainted. If 
> they

Right not your problem, nice.

But it is going to be the problem of vendor kernel engineers who don't have 
this luxury.

> want to do this, that's their business. They will be asked to reproduce 
> without
> RAW if they file a bug report.


This is not how customers see the world. "If it is there, then it is
there to used right? Why else would someone give me the keys to this?"

Just kill this. Or better yet, make it a seperate set of patches for
folks developing code but not have it as part of this patchset.



>

Re: [PATCH 13/14] cxl/mem: Add limited Get Log command (0401h)

2021-02-01 Thread Konrad Rzeszutek Wilk

On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> The Get Log command returns the actual log entries that are advertised
> via the Get Supported Logs command (0400h). CXL device logs are selected
> by UUID which is part of the CXL spec. Because the driver tries to
> sanitize what is sent to hardware, there becomes a need to restrict the
> types of logs which can be accessed by userspace. For example, the
> vendor specific log might only be consumable by proprietary, or offline
> applications, and therefore a good candidate for userspace.
> 
> The current driver infrastructure does allow basic validation for all
> commands, but doesn't inspect any of the payload data. Along with Get
> Log support comes new infrastructure to add a hook for payload
> validation. This infrastructure is used to filter out the CEL UUID,
> which the userspace driver doesn't have business knowing, and taints on
> invalid UUIDs being sent to hardware.

Perhaps a better option is to reject invalid UUIDs?

And if you really really want to use invalid UUIDs then:

1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?

2) Wrap it with lockdown code so that you can't do this at all
   when in LOCKDOWN_INTEGRITY or such?

> 
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/mem.c| 42 +++-
>  include/uapi/linux/cxl_mem.h |  1 +
>  2 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index b8ca6dff37b5..086268f1dd6c 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -119,6 +119,8 @@ static const uuid_t log_uuid[] = {
>0x07, 0x19, 0x40, 0x3d, 0x86)
>  };
>  
> +static int validate_log_uuid(void __user *payload, size_t size);
> +
>  /**
>   * struct cxl_mem_command - Driver representation of a memory device command
>   * @info: Command information as it exists for the UAPI
> @@ -132,6 +134,10 @@ static const uuid_t log_uuid[] = {
>   *  * %CXL_CMD_INTERNAL_FLAG_PSEUDO: This is a pseudo command which doesn't 
> have
>   *a direct mapping to hardware. They are implicitly always enabled.
>   *
> + * @validate_payload: A function called after the command is validated but
> + * before it's sent to the hardware. The primary purpose is to validate, or
> + * fixup the actual payload.
> + *
>   * The cxl_mem_command is the driver's internal representation of commands 
> that
>   * are supported by the driver. Some of these commands may not be supported 
> by
>   * the hardware. The driver will use @info to validate the fields passed in 
> by
> @@ -147,9 +153,11 @@ struct cxl_mem_command {
>  #define CXL_CMD_INTERNAL_FLAG_HIDDEN BIT(0)
>  #define CXL_CMD_INTERNAL_FLAG_MANDATORY BIT(1)
>  #define CXL_CMD_INTERNAL_FLAG_PSEUDO BIT(2)
> +
> + int (*validate_payload)(void __user *payload, size_t size);
>  };
>  
> -#define CXL_CMD(_id, _flags, sin, sout, f)   
>   \
> +#define CXL_CMD_VALIDATE(_id, _flags, sin, sout, f, v)   
>   \
>   [CXL_MEM_COMMAND_ID_##_id] = { \
>   .info = {  \
>   .id = CXL_MEM_COMMAND_ID_##_id,\
> @@ -159,8 +167,12 @@ struct cxl_mem_command {
>   }, \
>   .flags = CXL_CMD_INTERNAL_FLAG_##f,\
>   .opcode = CXL_MBOX_OP_##_id,   \
> + .validate_payload = v, \
>   }
>  
> +#define CXL_CMD(_id, _flags, sin, sout, f)   
>   \
> + CXL_CMD_VALIDATE(_id, _flags, sin, sout, f, NULL)
> +
>  /*
>   * This table defines the supported mailbox commands for the driver. This 
> table
>   * is made up of a UAPI structure. Non-negative values as parameters in the
> @@ -176,6 +188,8 @@ static struct cxl_mem_command mem_commands[] = {
>   CXL_CMD(GET_PARTITION_INFO, NONE, 0, 0x20, NONE),
>   CXL_CMD(GET_LSA, NONE, 0x8, ~0, MANDATORY),
>   CXL_CMD(GET_HEALTH_INFO, NONE, 0, 0x12, MANDATORY),
> + CXL_CMD_VALIDATE(GET_LOG, MUTEX, 0x18, ~0, MANDATORY,
> +  validate_log_uuid),
>  };
>  
>  /*
> @@ -563,6 +577,13 @@ static int handle_mailbox_cmd_from_user(struct 
> cxl_memdev *cxlmd,
>   kvzalloc(cxlm->mbox.payload_size, GFP_KERNEL);
>  
>   if (cmd->info.size_in) {
> + if (cmd->validate_payload) {
> + rc = cmd->validate_payload(u64_to_user_ptr(in_payload),
> +cmd->info.size_in);
> + if (rc)
> + goto out;
> + }
> +
>   mbox_cmd.payload_in = kvzalloc(cmd->info.size_in, GFP_KERNEL);
>   if (!mbox_cmd.payload_in) {
>

Re: [PATCH 09/14] cxl/mem: Add a "RAW" send command

2021-02-01 Thread Konrad Rzeszutek Wilk

On Fri, Jan 29, 2021 at 04:24:33PM -0800, Ben Widawsky wrote:
> The CXL memory device send interface will have a number of supported
> commands. The raw command is not such a command. Raw commands allow
> userspace to send a specified opcode to the underlying hardware and
> bypass all driver checks on the command. This is useful for a couple of
> usecases, mainly:
> 1. Undocumented vendor specific hardware commands
> 2. Prototyping new hardware commands not yet supported by the driver

This sounds like a recipe for ..

In case you really really want this may I recommend you do two things:

- Wrap this whole thing with #ifdef
  CONFIG_CXL_DEBUG_THIS_WILL_DESTROY_YOUR_LIFE

 (or something equivalant to make it clear this should never be
  enabled in production kernels).

 - Add a nice big fat printk in dmesg telling the user that they
   are creating a unstable parallel universe that will lead to their
   blood pressure going sky-high, or perhaps something more professional
   sounding.

- Rethink this. Do you really really want to encourage vendors
  to use this raw API instead of them using the proper APIs?

> 
> While this all sounds very powerful it comes with a couple of caveats:
> 1. Bug reports using raw commands will not get the same level of
>attention as bug reports using supported commands (via taint).
> 2. Supported commands will be rejected by the RAW command.
> 
> With this comes new debugfs knob to allow full access to your toes with
> your weapon of choice.

Problem is that debugfs is no longer "debug" but is enabled in
production kernel.

Re: [PATCH 08/14] taint: add taint for direct hardware access

2021-02-01 Thread Konrad Rzeszutek Wilk

On Fri, Jan 29, 2021 at 04:24:32PM -0800, Ben Widawsky wrote:
> For drivers that moderate access to the underlying hardware it is
> sometimes desirable to allow userspace to bypass restrictions. Once
> userspace has done this, the driver can no longer guarantee the sanctity
> of either the OS or the hardware. When in this state, it is helpful for
> kernel developers to be made aware (via this taint flag) of this fact
> for subsequent bug reports.
> 
> Example usage:
> - Hardware xyzzy accepts 2 commands, waldo and fred.
> - The xyzzy driver provides an interface for using waldo, but not fred.
> - quux is convinced they really need the fred command.
> - xyzzy driver allows quux to frob hardware to initiate fred.

Would it not be easier to _not_ frob the hardware for fred-operation?
Aka not implement it or just disallow in the first place?


>   - kernel gets tainted.
> - turns out fred command is borked, and scribbles over memory.
> - developers laugh while closing quux's subsequent bug report.

Yeah good luck with that theory in-the-field. The customer won't
care about this and will demand a solution for doing fred-operation.

Just easier to not do fred-operation in the first place,no?

Re: [PATCH 07/14] cxl/mem: Add send command

2021-02-01 Thread Konrad Rzeszutek Wilk

> +/**
> + * struct cxl_send_command - Send a command to a memory device.
> + * @id: The command to send to the memory device. This must be one of the
> + *   commands returned by the query command.
> + * @flags: Flags for the command (input).
> + * @rsvd: Must be zero.
> + * @retval: Return value from the memory device (output).
> + * @size_in: Size of the payload to provide to the device (input).
> + * @size_out: Size of the payload received from the device (input/output). 
> This
> + * field is filled in by userspace to let the driver know how much
> + * space was allocated for output. It is populated by the driver to
> + * let userspace know how large the output payload actually was.
> + * @in_payload: Pointer to memory for payload input (little endian order).
> + * @out_payload: Pointer to memory for payload output (little endian order).
> + *
> + * Mechanism for userspace to send a command to the hardware for processing. 
> The
> + * driver will do basic validation on the command sizes. In some cases even 
> the
> + * payload may be introspected. Userspace is required to allocate large
> + * enough buffers for size_out which can be variable length in certain
> + * situations.
> + */
I think (and this would help if you ran `pahole` on this structure) has
some gaps in it:

> +struct cxl_send_command {
> + __u32 id;
> + __u32 flags;
> + __u32 rsvd;
> + __u32 retval;
> +
> + struct {
> + __s32 size_in;

Here..Maybe just add:

__u32 rsv_2;
> + __u64 in_payload;
> + };
> +
> + struct {
> + __s32 size_out;

And here. Maybe just add:
__u32 rsv_2;
> + __u64 out_payload;
> + };
> +};

Perhaps to prepare for the future where this may need to be expanded, you
could add a size at the start of the structure, and
maybe what version of structure it is?

Maybe for all the new structs you are adding?

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-01 Thread Konrad Rzeszutek Wilk

On Mon, Feb 01, 2021 at 09:50:41AM -0800, Ben Widawsky wrote:
> On 21-02-01 12:41:36, Konrad Rzeszutek Wilk wrote:
> > > +static int cxl_mem_setup_regs(struct cxl_mem *cxlm)
> > > +{
> > > + struct device *dev = >pdev->dev;
> > > + int cap, cap_count;
> > > + u64 cap_array;
> > > +
> > > + cap_array = readq(cxlm->regs + CXLDEV_CAP_ARRAY_OFFSET);
> > > + if (CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_ID) != 
> > > CXLDEV_CAP_ARRAY_CAP_ID)
> > > + return -ENODEV;
> > > +
> > > + cap_count = CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_COUNT);
> > > +
> > > + for (cap = 1; cap <= cap_count; cap++) {
> > > + void __iomem *register_block;
> > > + u32 offset;
> > > + u16 cap_id;
> > > +
> > > + cap_id = readl(cxlm->regs + cap * 0x10) & 0x;
> > > + offset = readl(cxlm->regs + cap * 0x10 + 0x4);
> > > + register_block = cxlm->regs + offset;
> > > +
> > > + switch (cap_id) {
> > > + case CXLDEV_CAP_CAP_ID_DEVICE_STATUS:
> > > + dev_dbg(dev, "found Status capability (0x%x)\n",
> > > + offset);
> > 
> > That 80 character limit is no longer a requirement. Can you just make
> > this one line? And perhaps change 'found' to 'Found' ?
> > 
> 
> Funny that.
> https://lore.kernel.org/linux-cxl/2020073449.ga16...@infradead.org/

 "If there is a good reason to go against the
style (a line which becomes far less readable if split to fit within the
80-column limit, for example), just do it.
"

I would say that having an offset on its own line is kind of silly.

> 
> > > + cxlm->status.regs = register_block;
> > > + break;
> > > + case CXLDEV_CAP_CAP_ID_PRIMARY_MAILBOX:
> > > + dev_dbg(dev, "found Mailbox capability (0x%x)\n",
> > > + offset);
> > > + cxlm->mbox.regs = register_block;
> > > + break;
> > > + case CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX:
> > > + dev_dbg(dev,
> > > + "found Secondary Mailbox capability (0x%x)\n",
> > > + offset);
> > > + break;
> > > + case CXLDEV_CAP_CAP_ID_MEMDEV:
> > > + dev_dbg(dev, "found Memory Device capability (0x%x)\n",
> > > + offset);
> > > + cxlm->mem.regs = register_block;
> > > + break;
> > > + default:
> > > + dev_warn(dev, "Unknown cap ID: %d (0x%x)\n", cap_id,
> > > +  offset);
> > > + break;
> > > + }
> > > + }
> > > +
> > > + if (!cxlm->status.regs || !cxlm->mbox.regs || !cxlm->mem.regs) {
> > > + dev_err(dev, "registers not found: %s%s%s\n",
> > > + !cxlm->status.regs ? "status " : "",
> > > + !cxlm->mbox.regs ? "mbox " : "",
> > > + !cxlm->mem.regs ? "mem" : "");
> > > + return -ENXIO;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > +{
> > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > +
> > > + cxlm->mbox.payload_size =
> > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > +
> > 
> > I think the static analyzers are not going to be happy that you are not
> > checking the value of `cap` before using it.
> > 
> > Perhaps you should check that first before doing the manipulations?
> > 
> 
> I'm not following the request. CXL_GET_FIELD is just doing the shift and mask 
> on
> cap.
> 
> Can you explain what you're hoping to see?

My thoughts were that if cxl_read_mbox_reg32 gave you -1 (would be wacky
but we live in the world of having a healthy vision of devices not
always giving right values).

Then your payload_size bit shifting will get bent out of shape as you
are effectively casting cap to unsigned, which means it will end up
being 0xf.. and your bit shifting end up with a bunch
of zeros at the end.
> 
> > > + /* 8.2.8.4.3 */
> > > + if (cxlm->mbox.payload_size < 256) {

If this ends up being casted back to signed, this conditional will
catch it (-4096 < 256, true).

But if it does not (so 0x<256, false), then this wacky
number will pass this check and you may reference a payload_size that is
larger than the reality and copy the wrong set of values (buffer
overflow).

So what I was thinking is that you want to check `cap` to make sure it
is not negative nor to large?

> > 
> > #define for 256?

Re: [PATCH 04/14] cxl/mem: Implement polled mode mailbox

2021-02-01 Thread Konrad Rzeszutek Wilk

> +#define cxl_doorbell_busy(cxlm)  
>   \
> + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) &\
> +  CXLDEV_MB_CTRL_DOORBELL)
> +
> +#define CXL_MAILBOX_TIMEOUT_US 2000

You been using the spec for the values. Is that number also from it ?

> +
> +enum opcode {
> + CXL_MBOX_OP_IDENTIFY= 0x4000,
> + CXL_MBOX_OP_MAX = 0x1
> +};
> +
> +/**
> + * struct mbox_cmd - A command to be submitted to hardware.
> + * @opcode: (input) The command set and command submitted to hardware.
> + * @payload_in: (input) Pointer to the input payload.
> + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> + *the caller.
> + * @size_in: (input) Number of bytes to load from @payload.
> + * @size_out: (output) Number of bytes loaded into @payload.
> + * @return_code: (output) Error code returned from hardware.
> + *
> + * This is the primary mechanism used to send commands to the hardware.
> + * All the fields except @payload_* correspond exactly to the fields 
> described in
> + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> + * @payload_out are written to, and read from the Command Payload Registers
> + * defined in (8.2.8.4.8).
> + */
> +struct mbox_cmd {
> + u16 opcode;
> + void *payload_in;
> + void *payload_out;

On a 32-bit OS (not that we use those that more, but lets assume
someone really wants to), the void is 4-bytes, while on 64-bit it is
8-bytes.

`pahole` is your friend as I think there is a gap between opcode and
payload_in in the structure.

> + size_t size_in;
> + size_t size_out;

And those can also change depending on 32-bit/64-bit.

> + u16 return_code;
> +#define CXL_MBOX_SUCCESS 0
> +};

Do you want to use __packed to match with the spec?

Ah, reading later you don't care about it.

In that case may I recommend you move 'return_code' (or perhaps just
call it rc?) to be right after opcode? Less of gaps in that structure.

> +
> +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> +{
> + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> + const unsigned long start = jiffies;
> + unsigned long end = start;
> +
> + while (cxl_doorbell_busy(cxlm)) {
> + end = jiffies;
> +
> + if (time_after(end, start + timeout)) {
> + /* Check again in case preempted before timeout test */
> + if (!cxl_doorbell_busy(cxlm))
> + break;
> + return -ETIMEDOUT;
> + }
> + cpu_relax();
> + }

Hm, that is not very scheduler friendly. I mean we are sitting here for
2000us (2 ms) - that is quite the amount of time spinning.

Should this perhaps be put in a workqueue?
> +
> + dev_dbg(>pdev->dev, "Doorbell wait took %dms",
> + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> + return 0;
> +}
> +
> +static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
> +  struct mbox_cmd *mbox_cmd)
> +{
> + dev_warn(>pdev->dev, "Mailbox command timed out\n");
> + dev_info(>pdev->dev,
> +  "\topcode: 0x%04x\n"
> +  "\tpayload size: %zub\n",
> +  mbox_cmd->opcode, mbox_cmd->size_in);
> +
> + if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
> + print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
> +  mbox_cmd->payload_in, mbox_cmd->size_in,
> +  true);
> + }
> +
> + /* Here's a good place to figure out if a device reset is needed */
> +}
> +
> +/**
> + * cxl_mem_mbox_send_cmd() - Send a mailbox command to a memory device.
> + * @cxlm: The CXL memory device to communicate with.
> + * @mbox_cmd: Command to send to the memory device.
> + *
> + * Context: Any context. Expects mbox_lock to be held.
> + * Return: -ETIMEDOUT if timeout occurred waiting for completion. 0 on 
> success.
> + * Caller should check the return code in @mbox_cmd to make sure it
> + * succeeded.
> + *
> + * This is a generic form of the CXL mailbox send command, thus the only I/O
> + * operations used are cxl_read_mbox_reg(). Memory devices, and perhaps other
> + * types of CXL devices may have further information available upon error
> + * conditions.
> + *
> + * The CXL spec allows for up to two mailboxes. The intention is for the 
> primary
> + * mailbox to be OS controlled and the secondary mailbox to be used by system
> + * firmware. This allows the OS and firmware to communicate with the device 
> and
> + * not need to coordinate with each other. The driver only uses the primary
> + * mailbox.
> + */
> +static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
> +  struct mbox_cmd *mbox_cmd)
> +{
> + void __iomem *payload = cxlm->mbox.regs + CXLDEV_MB_PAYLOAD_OFFSET;
> +

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-01 Thread Konrad Rzeszutek Wilk

> +static int cxl_mem_setup_regs(struct cxl_mem *cxlm)
> +{
> + struct device *dev = >pdev->dev;
> + int cap, cap_count;
> + u64 cap_array;
> +
> + cap_array = readq(cxlm->regs + CXLDEV_CAP_ARRAY_OFFSET);
> + if (CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_ID) != 
> CXLDEV_CAP_ARRAY_CAP_ID)
> + return -ENODEV;
> +
> + cap_count = CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_COUNT);
> +
> + for (cap = 1; cap <= cap_count; cap++) {
> + void __iomem *register_block;
> + u32 offset;
> + u16 cap_id;
> +
> + cap_id = readl(cxlm->regs + cap * 0x10) & 0x;
> + offset = readl(cxlm->regs + cap * 0x10 + 0x4);
> + register_block = cxlm->regs + offset;
> +
> + switch (cap_id) {
> + case CXLDEV_CAP_CAP_ID_DEVICE_STATUS:
> + dev_dbg(dev, "found Status capability (0x%x)\n",
> + offset);

That 80 character limit is no longer a requirement. Can you just make
this one line? And perhaps change 'found' to 'Found' ?

> + cxlm->status.regs = register_block;
> + break;
> + case CXLDEV_CAP_CAP_ID_PRIMARY_MAILBOX:
> + dev_dbg(dev, "found Mailbox capability (0x%x)\n",
> + offset);
> + cxlm->mbox.regs = register_block;
> + break;
> + case CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX:
> + dev_dbg(dev,
> + "found Secondary Mailbox capability (0x%x)\n",
> + offset);
> + break;
> + case CXLDEV_CAP_CAP_ID_MEMDEV:
> + dev_dbg(dev, "found Memory Device capability (0x%x)\n",
> + offset);
> + cxlm->mem.regs = register_block;
> + break;
> + default:
> + dev_warn(dev, "Unknown cap ID: %d (0x%x)\n", cap_id,
> +  offset);
> + break;
> + }
> + }
> +
> + if (!cxlm->status.regs || !cxlm->mbox.regs || !cxlm->mem.regs) {
> + dev_err(dev, "registers not found: %s%s%s\n",
> + !cxlm->status.regs ? "status " : "",
> + !cxlm->mbox.regs ? "mbox " : "",
> + !cxlm->mem.regs ? "mem" : "");
> + return -ENXIO;
> + }
> +
> + return 0;
> +}
> +
> +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> +{
> + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> +
> + cxlm->mbox.payload_size =
> + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> +

I think the static analyzers are not going to be happy that you are not
checking the value of `cap` before using it.

Perhaps you should check that first before doing the manipulations?

> + /* 8.2.8.4.3 */
> + if (cxlm->mbox.payload_size < 256) {

#define for 256?

Re: [PATCH 02/14] cxl/mem: Map memory device registers

2021-02-01 Thread Konrad Rzeszutek Wilk



>  
> - return 0;
> + rc = -ENXIO;
> + for (i = regloc; i < regloc + 0x24; i += 8) {

This 0x24, 8, and
> + u32 reg_lo, reg_hi;
> + u8 reg_type;
> +
> + /* "register low and high" contain other bits */
> + pci_read_config_dword(pdev, i, _lo);
> + pci_read_config_dword(pdev, i + 4, _hi);

4

scream of #define's or sizeof()?

Re: [PATCH 01/14] cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints

2021-02-01 Thread Konrad Rzeszutek Wilk

> +cxl_mem-y := mem.o
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> new file mode 100644
> index ..f4ee9a507ac9
> --- /dev/null
> +++ b/drivers/cxl/mem.c
> @@ -0,0 +1,61 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2020 Intel Corporation. All rights reserved. */

Can those two comments have the same type? As in either
stay with // or do /*.

Also the year is incorrect.

Re: [PATCH 2/3] Add swiotlb offset preserving mapping when dma_dma_parameters->page_offset_mask is non zero.

2021-01-28 Thread Konrad Rzeszutek Wilk

On Wed, Jan 27, 2021 at 04:38:28PM -0800, Jianxiong Gao wrote:
> For devices that need to preserve address offset on mapping through
> swiotlb, this patch adds offset preserving based on page_offset_mask
> and keeps the offset if the mask is non zero. This is needed for
> device drivers like NVMe.



Didn't you send this patch like a month ago and someone pointed
out that the right fix would be in the NVMe driver?

Is there an issue with fixing the NVMe driver?

> 
> Signed-off-by: Jianxiong Gao 
> ---
>  kernel/dma/swiotlb.c | 25 ++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 7c42df6e6100..4cab35f2c9bc 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -468,7 +468,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, 
> phys_addr_t orig_addr,
>   dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(hwdev, io_tlb_start);
>   unsigned long flags;
>   phys_addr_t tlb_addr;
> - unsigned int nslots, stride, index, wrap;
> + unsigned int nslots, stride, index, wrap, page_offset_mask, page_offset;
>   int i;
>   unsigned long mask;
>   unsigned long offset_slots;
> @@ -500,12 +500,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device 
> *hwdev, phys_addr_t orig_addr,
>   ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
>   : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
>  
> + page_offset_mask = dma_get_page_offset_mask(hwdev);
> + page_offset = orig_addr & page_offset_mask;
> + alloc_size += page_offset;
> +
>   /*
>* For mappings greater than or equal to a page, we limit the stride
>* (and hence alignment) to a page size.
>*/
>   nslots = ALIGN(alloc_size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
> - if (alloc_size >= PAGE_SIZE)
> + if ((alloc_size >= PAGE_SIZE) || (page_offset_mask > (1 << 
> IO_TLB_SHIFT)))
>   stride = (1 << (PAGE_SHIFT - IO_TLB_SHIFT));
>   else
>   stride = 1;
> @@ -583,6 +587,11 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, 
> phys_addr_t orig_addr,
>*/
>   for (i = 0; i < nslots; i++)
>   io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
> + /*
> +  * When keeping the offset of the original data, we need to advance
> +  * the tlb_addr by the offset of orig_addr.
> +  */
> + tlb_addr += page_offset;
>   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
>   (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
>   swiotlb_bounce(orig_addr, tlb_addr, mapping_size, 
> DMA_TO_DEVICE);
> @@ -598,7 +607,9 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
> phys_addr_t tlb_addr,
> enum dma_data_direction dir, unsigned long attrs)
>  {
>   unsigned long flags;
> - int i, count, nslots = ALIGN(alloc_size, 1 << IO_TLB_SHIFT) >> 
> IO_TLB_SHIFT;
> + unsigned int num_page_offset_slabs, page_offset_mask = 
> dma_get_page_offset_mask(hwdev);
> + int i, count;
> + int nslots = ALIGN(alloc_size + tlb_addr & page_offset_mask, 1 << 
> IO_TLB_SHIFT) >> IO_TLB_SHIFT;
>   int index = (tlb_addr - io_tlb_start) >> IO_TLB_SHIFT;
>   phys_addr_t orig_addr = io_tlb_orig_addr[index];
>  
> @@ -610,6 +621,14 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
> phys_addr_t tlb_addr,
>   ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
>   swiotlb_bounce(orig_addr, tlb_addr, mapping_size, 
> DMA_FROM_DEVICE);
>  
> + /*
> +  * When dma_get_page_offset_mask is used, we may have padded more slabs
> +  * when padding exceeds one slab. We need to move index back to the
> +  * beginning of the padding.
> +  */
> + num_page_offset_slabs =  (tlb_addr & page_offset_mask) / (1 << 
> IO_TLB_SHIFT);
> + index -= num_page_offset_slabs;
> +
>   /*
>* Return the buffer to the free list by setting the corresponding
>* entries to indicate the number of contiguous entries available.
> -- 
> 2.27.0
>

Re: [PATCH 1/1] iscsi_ibft: KASAN false positive failure occurs in ibft_init()

2021-01-27 Thread Konrad Rzeszutek Wilk

On Tue, Jan 26, 2021 at 01:03:21PM -0500, George Kennedy wrote:
> During boot of kernel with CONFIG_KASAN the following KASAN false
> positive failure will occur when ibft_init() reads the
> ACPI iBFT table: BUG: KASAN: use-after-free in ibft_init
> 
> The ACPI iBFT table is not allocated, and the iscsi driver uses
> a pointer to it to calculate checksum, etc. KASAN complains
> about this pointer with use-after-free, which this is not.
> 

Andrey, Alexander, Dmitry,

I think this is the right way for this, but was wondering if you have
other suggestions?

Thanks!
> Signed-off-by: George Kennedy 
> ---
>  drivers/firmware/Makefile | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
> index 5e013b6..30ddab5 100644
> --- a/drivers/firmware/Makefile
> +++ b/drivers/firmware/Makefile
> @@ -14,6 +14,9 @@ obj-$(CONFIG_INTEL_STRATIX10_SERVICE) += stratix10-svc.o
>  obj-$(CONFIG_INTEL_STRATIX10_RSU) += stratix10-rsu.o
>  obj-$(CONFIG_ISCSI_IBFT_FIND)+= iscsi_ibft_find.o
>  obj-$(CONFIG_ISCSI_IBFT) += iscsi_ibft.o
> +KASAN_SANITIZE_iscsi_ibft.o := n
> +KCOV_INSTRUMENT_iscsi_ibft.o := n
> +
>  obj-$(CONFIG_FIRMWARE_MEMMAP)+= memmap.o
>  obj-$(CONFIG_RASPBERRYPI_FIRMWARE) += raspberrypi.o
>  obj-$(CONFIG_FW_CFG_SYSFS)   += qemu_fw_cfg.o
> -- 
> 1.8.3.1
>

Re: [PATCH 2/5] kernel/dma: remove unnecessary unmap_kernel_range

2021-01-26 Thread Konrad Rzeszutek Wilk

On Tue, Jan 26, 2021 at 02:54:01PM +1000, Nicholas Piggin wrote:
> vunmap will remove ptes.

Should there be some ASSERT after the vunmap to make sure that is the
case? 
> 
> Cc: Christoph Hellwig 
> Cc: Marek Szyprowski 
> Cc: Robin Murphy 
> Cc: io...@lists.linux-foundation.org
> Signed-off-by: Nicholas Piggin 
> ---
>  kernel/dma/remap.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c
> index 905c3fa005f1..b4526668072e 100644
> --- a/kernel/dma/remap.c
> +++ b/kernel/dma/remap.c
> @@ -66,6 +66,5 @@ void dma_common_free_remap(void *cpu_addr, size_t size)
>   return;
>   }
>  
> - unmap_kernel_range((unsigned long)cpu_addr, PAGE_ALIGN(size));
>   vunmap(cpu_addr);
>  }
> -- 
> 2.23.0
> 
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] powerpc/mm: Limit allocation of SWIOTLB on server machines

2021-01-21 Thread Konrad Rzeszutek Wilk

On Fri, Jan 08, 2021 at 09:27:01PM -0300, Thiago Jung Bauermann wrote:
> 
> Ram Pai  writes:
> 
> > On Wed, Dec 23, 2020 at 09:06:01PM -0300, Thiago Jung Bauermann wrote:
> >> 
> >> Hi Ram,
> >> 
> >> Thanks for reviewing this patch.
> >> 
> >> Ram Pai  writes:
> >> 
> >> > On Fri, Dec 18, 2020 at 03:21:03AM -0300, Thiago Jung Bauermann wrote:
> >> >> On server-class POWER machines, we don't need the SWIOTLB unless we're a
> >> >> secure VM. Nevertheless, if CONFIG_SWIOTLB is enabled we unconditionally
> >> >> allocate it.
> >> >> 
> >> >> In most cases this is harmless, but on a few machine configurations 
> >> >> (e.g.,
> >> >> POWER9 powernv systems with 4 GB area reserved for crashdump kernel) it 
> >> >> can
> >> >> happen that memblock can't find a 64 MB chunk of memory for the SWIOTLB 
> >> >> and
> >> >> fails with a scary-looking WARN_ONCE:
> >> >> 
> >> >>  [ cut here ]
> >> >>  memblock: bottom-up allocation failed, memory hotremove may be affected
> >> >>  WARNING: CPU: 0 PID: 0 at mm/memblock.c:332 
> >> >> memblock_find_in_range_node+0x328/0x340
> >> >>  Modules linked in:
> >> >>  CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0-rc2-orig+ #6
> >> >>  NIP:  c0442f38 LR: c0442f34 CTR: c01e0080
> >> >>  REGS: c1def900 TRAP: 0700   Not tainted  (5.10.0-rc2-orig+)
> >> >>  MSR:  92021033   CR: 2802  XER: 
> >> >> 2004
> >> >>  CFAR: c014b7b4 IRQMASK: 1
> >> >>  GPR00: c0442f34 c1defba0 c1deff00 
> >> >> 0047
> >> >>  GPR04: 7fff c1def828 c1def820 
> >> >> 
> >> >>  GPR08: 001ffc3e c1b75478 c1b75478 
> >> >> 0001
> >> >>  GPR12: 2000 c203  
> >> >> 
> >> >>  GPR16:    
> >> >> 0203
> >> >>  GPR20:  0001 0001 
> >> >> c1defc10
> >> >>  GPR24: c1defc08 c1c91868 c1defc18 
> >> >> c1c91890
> >> >>  GPR28:   0400 
> >> >> 
> >> >>  NIP [c0442f38] memblock_find_in_range_node+0x328/0x340
> >> >>  LR [c0442f34] memblock_find_in_range_node+0x324/0x340
> >> >>  Call Trace:
> >> >>  [c1defba0] [c0442f34] 
> >> >> memblock_find_in_range_node+0x324/0x340 (unreliable)
> >> >>  [c1defc90] [c15ac088] 
> >> >> memblock_alloc_range_nid+0xec/0x1b0
> >> >>  [c1defd40] [c15ac1f8] 
> >> >> memblock_alloc_internal+0xac/0x110
> >> >>  [c1defda0] [c15ac4d0] memblock_alloc_try_nid+0x94/0xcc
> >> >>  [c1defe30] [c159c3c8] swiotlb_init+0x78/0x104
> >> >>  [c1defea0] [c158378c] mem_init+0x4c/0x98
> >> >>  [c1defec0] [c157457c] start_kernel+0x714/0xac8
> >> >>  [c1deff90] [c000d244] start_here_common+0x1c/0x58
> >> >>  Instruction dump:
> >> >>  2c23 4182ffd4 ea610088 ea810090 4bfffe84 3921 3d42fff4 3c62ff60
> >> >>  3863c560 992a8bfc 4bd0881d 6000 <0fe0> ea610088 4bfffd94 
> >> >> 6000
> >> >>  random: get_random_bytes called from __warn+0x128/0x184 with 
> >> >> crng_init=0
> >> >>  ---[ end trace  ]---
> >> >>  software IO TLB: Cannot allocate buffer
> >> >> 
> >> >> Unless this is a secure VM the message can actually be ignored, because 
> >> >> the
> >> >> SWIOTLB isn't needed. Therefore, let's avoid the SWIOTLB in those cases.
> >> >
> >> > The above warn_on is conveying a genuine warning. Should it be silenced?
> >> 
> >> Not sure I understand your point. This patch doesn't silence the
> >> warning, it avoids the problem it is warning about.
> >
> > Sorry, I should have explained it better. My point is...  
> >
> > If CONFIG_SWIOTLB is enabled, it means that the kernel is
> > promising the bounce buffering capability. I know, currently we
> > do not have any kernel subsystems that use bounce buffers on
> > non-secure-pseries-kernel or powernv-kernel.  But that does not
> > mean, there wont be any. In case there is such a third-party
> > module needing bounce buffering, it wont be able to operate,
> > because of the proposed change in your patch.
> >
> > Is that a good thing or a bad thing, I do not know. I will let
> > the experts opine.
> 
> Ping? Does anyone else has an opinion on this? The other option I can
> think of is changing the crashkernel code to not reserve so much memory
> below 4 GB. Other people are considering this option, but it's not
> planned for the near future.

That seems a more suitable solution regardless, but there is always
the danger of not being enough or being too big.

There was some autocrashkernel allocation patches going around
for x86 and ARM that perhaps could be re-used?

> 
> Also, there's a patch currently in linux-next which

Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-01-18 Thread Konrad Rzeszutek Wilk

On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> > > The size of the buffer being bounced is not checked if it happens
> > > to be larger than the size of the mapped buffer. Because the size
> > > can be controlled by a device, as it's the case with virtio devices,
> > > this can lead to memory corruption.
> > > 
> > 
> > I'm really worried about all these hodge podge hacks for not trusted
> > hypervisors in the I/O stack.  Instead of trying to harden protocols
> > that are fundamentally not designed for this, how about instead coming
> > up with a new paravirtualized I/O interface that is specifically
> > designed for use with an untrusted hypervisor from the start?
> 
> Your comment makes sense but then that would require the cooperation
> of these vendors and the cloud providers to agree on something meaningful.
> I am also not sure whether the end result would be better than hardening
> this interface to catch corruption. There is already some validation in
> unmap path anyway.
> 
> Another possibility is to move this hardening to the common virtio code,
> but I think the code may become more complicated there since it would
> require tracking both the dma_addr and length for each descriptor.

Christoph,

I've been wrestling with the same thing - this is specific to busted
drivers. And in reality you could do the same thing with a hardware
virtio device (see example in http://thunderclap.io/) - where the
mitigation is 'enable the IOMMU to do its job.'.

AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP)..
and while that is great in the future, SEV without IOMMU is now here.

Doing a full circle here, this issue can be exploited with virtio
but you could say do that with real hardware too if you hacked the
firmware, so if you say used Intel SR-IOV NIC that was compromised
on an AMD SEV machine, and plumbed in the guest - the IOMMU inside
of the guest would be SWIOTLB code. Last line of defense against
bad firmware to say.

As such I am leaning towards taking this code, but I am worried
about the performance hit .. but perhaps I shouldn't as if you
are using SWIOTLB=force already you are kind of taking a
performance hit?

Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-07 Thread Konrad Rzeszutek Wilk

On Thu, Jan 07, 2021 at 10:09:14AM -0800, Florian Fainelli wrote:
> On 1/7/21 9:57 AM, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jan 08, 2021 at 01:39:18AM +0800, Claire Chang wrote:
> >> Hi Greg and Konrad,
> >>
> >> This change is intended to be non-arch specific. Any arch that lacks DMA 
> >> access
> >> control and has devices not behind an IOMMU can make use of it. Could you 
> >> share
> >> why you think this should be arch specific?
> > 
> > The idea behind non-arch specific code is it to be generic. The devicetree
> > is specific to PowerPC, Sparc, and ARM, and not to x86 - hence it should
> > be in arch specific code.
> 
> In premise the same code could be used with an ACPI enabled system with
> an appropriate service to identify the restricted DMA regions and unlock
> them.

Which this patchset is not.

> 
> More than 1 architecture requiring this function (ARM and ARM64 are the
> two I can think of needing this immediately) sort of calls for making
> the code architecture agnostic since past 2, you need something that scales.

I believe the use-case is for ARM64 at this moment.

> 
> There is already code today under kernel/dma/contiguous.c that is only
> activated on a CONFIG_OF=y && CONFIG_OF_RESERVED_MEM=y system, this is
> no different.
> -- 
> Florian

Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool

2021-01-07 Thread Konrad Rzeszutek Wilk

On Fri, Jan 08, 2021 at 01:39:43AM +0800, Claire Chang wrote:
> On Thu, Jan 7, 2021 at 2:58 AM Konrad Rzeszutek Wilk
>  wrote:
> >
> > On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote:
> > > Introduce the new compatible string, restricted-dma-pool, for restricted
> > > DMA. One can specify the address and length of the restricted DMA memory
> > > region by restricted-dma-pool in the device tree.
> > >
> > > Signed-off-by: Claire Chang 
> > > ---
> > >  .../reserved-memory/reserved-memory.txt   | 24 +++
> > >  1 file changed, 24 insertions(+)
> > >
> > > diff --git 
> > > a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
> > > b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> > > index e8d3096d922c..44975e2a1fd2 100644
> > > --- 
> > > a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> > > +++ 
> > > b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> > > @@ -51,6 +51,20 @@ compatible (optional) - standard definition
> > >used as a shared pool of DMA buffers for a set of devices. It 
> > > can
> > >be used by an operating system to instantiate the necessary 
> > > pool
> > >management subsystem if necessary.
> > > +- restricted-dma-pool: This indicates a region of memory meant 
> > > to be
> > > +  used as a pool of restricted DMA buffers for a set of devices. 
> > > The
> > > +  memory region would be the only region accessible to those 
> > > devices.
> > > +  When using this, the no-map and reusable properties must not 
> > > be set,
> > > +  so the operating system can create a virtual mapping that will 
> > > be used
> > > +  for synchronization. The main purpose for restricted DMA is to
> > > +  mitigate the lack of DMA access control on systems without an 
> > > IOMMU,
> > > +  which could result in the DMA accessing the system memory at
> > > +  unexpected times and/or unexpected addresses, possibly leading 
> > > to data
> > > +  leakage or corruption. The feature on its own provides a basic 
> > > level
> > > +  of protection against the DMA overwriting buffer contents at
> > > +  unexpected times. However, to protect against general data 
> > > leakage and
> > > +  system memory corruption, the system needs to provide way to 
> > > restrict
> > > +  the DMA to a predefined memory region.
> >
> > Heya!
> >
> > I think I am missing something obvious here so please bear with my
> > questions:
> >
> >  - This code adds the means of having the SWIOTLB pool tied to a specific
> >memory correct?
> 
> It doesn't affect the existing SWIOTLB. It just utilizes the existing SWIOTLB
> code to create another DMA pool tied to a specific memory region for a given 
> set
> of devices. It bounces the streaming DMA (map/unmap) in and out of that region
> and does the memory allocation (dma_direct_alloc) from the same region.

Right, so why can't it follow the same mechanism that Xen SWIOTLB does - which
had exactly the same problem (needed special handling on the pool) - and do
a similar code?

> 
> >
> >
> >  - Nothing stops the physical device from bypassing the SWIOTLB buffer.
> >That is if an errant device screwed up the length or DMA address, the
> >SWIOTLB would gladly do what the device told it do?
> 
> So the system needs to provide a way to lock down the memory access, e.g. MPU.

OK! Would it be prudent to have this in the description above perhaps?
> 
> >
> >  - This has to be combined with SWIOTLB-force-ish to always use the
> >bounce buffer, otherwise you could still do DMA without using
> >SWIOTLB (by not hitting the criteria for needing to use SWIOTLB)?
> 
> Since restricted DMA is for the devices that are not behind an IOMMU, I change
> the criteria
> `if (unlikely(swiotlb_force == SWIOTLB_FORCE))`
> to
> `if (unlikely(swiotlb_force == SWIOTLB_FORCE) || dev->dma_io_tlb_mem)`
> in dma_direct_map_page().
> 
> Also, even if SWIOTLB=force, the restricted DMA pool is preferred if available
> (get_io_tlb_mem in https://lore.kernel.org/patchwork/patch/1360995/).
> 
> Thanks!

Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-07 Thread Konrad Rzeszutek Wilk

On Fri, Jan 08, 2021 at 01:39:18AM +0800, Claire Chang wrote:
> Hi Greg and Konrad,
> 
> This change is intended to be non-arch specific. Any arch that lacks DMA 
> access
> control and has devices not behind an IOMMU can make use of it. Could you 
> share
> why you think this should be arch specific?

The idea behind non-arch specific code is it to be generic. The devicetree
is specific to PowerPC, Sparc, and ARM, and not to x86 - hence it should
be in arch specific code.

> 
> Thanks!

Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool

2021-01-06 Thread Konrad Rzeszutek Wilk

On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote:
> Introduce the new compatible string, restricted-dma-pool, for restricted
> DMA. One can specify the address and length of the restricted DMA memory
> region by restricted-dma-pool in the device tree.
> 
> Signed-off-by: Claire Chang 
> ---
>  .../reserved-memory/reserved-memory.txt   | 24 +++
>  1 file changed, 24 insertions(+)
> 
> diff --git 
> a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
> b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> index e8d3096d922c..44975e2a1fd2 100644
> --- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> +++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> @@ -51,6 +51,20 @@ compatible (optional) - standard definition
>used as a shared pool of DMA buffers for a set of devices. It can
>be used by an operating system to instantiate the necessary pool
>management subsystem if necessary.
> +- restricted-dma-pool: This indicates a region of memory meant to be
> +  used as a pool of restricted DMA buffers for a set of devices. The
> +  memory region would be the only region accessible to those devices.
> +  When using this, the no-map and reusable properties must not be 
> set,
> +  so the operating system can create a virtual mapping that will be 
> used
> +  for synchronization. The main purpose for restricted DMA is to
> +  mitigate the lack of DMA access control on systems without an 
> IOMMU,
> +  which could result in the DMA accessing the system memory at
> +  unexpected times and/or unexpected addresses, possibly leading to 
> data
> +  leakage or corruption. The feature on its own provides a basic 
> level
> +  of protection against the DMA overwriting buffer contents at
> +  unexpected times. However, to protect against general data leakage 
> and
> +  system memory corruption, the system needs to provide way to 
> restrict
> +  the DMA to a predefined memory region.

Heya!

I think I am missing something obvious here so please bear with my
questions:

 - This code adds the means of having the SWIOTLB pool tied to a specific
   memory correct?

 - Nothing stops the physical device from bypassing the SWIOTLB buffer.
   That is if an errant device screwed up the length or DMA address, the
   SWIOTLB would gladly do what the device told it do?

 - This has to be combined with SWIOTLB-force-ish to always use the
   bounce buffer, otherwise you could still do DMA without using
   SWIOTLB (by not hitting the criteria for needing to use SWIOTLB)?

Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-06 Thread Konrad Rzeszutek Wilk

Hello!

In this file:

> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index e4368159f88a..7fb2ac087d23 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
..

> +static const struct reserved_mem_ops rmem_swiotlb_ops = {
> + .device_init= rmem_swiotlb_device_init,
> + .device_release = rmem_swiotlb_device_release,
> +};
> +
> +static int __init rmem_swiotlb_setup(struct reserved_mem *rmem)
> +{
> + unsigned long node = rmem->fdt_node;
> +
> + if (of_get_flat_dt_prop(node, "reusable", NULL) ||
> + of_get_flat_dt_prop(node, "linux,cma-default", NULL) ||
> + of_get_flat_dt_prop(node, "linux,dma-default", NULL) ||
> + of_get_flat_dt_prop(node, "no-map", NULL))
> + return -EINVAL;
> +
> + rmem->ops = _swiotlb_ops;
> + pr_info("Reserved memory: created device swiotlb memory pool at %pa, 
> size %ld MiB\n",
> + >base, (unsigned long)rmem->size / SZ_1M);
> + return 0;
> +}
> +
> +RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", rmem_swiotlb_setup);

The code should be as much as possible arch-agnostic. That is why there
are multiple -swiotlb files scattered in arch directories that own the
architecture specific code.

Would it be possible to move the code there and perhaps have a ARM
specific front-end for this DMA restricted pool there? See for example
the xen-swiotlb code.

Cheers!

Konrad

Re: [PATCH] x86/iommu: Fix two minimal issues in check_iommu_entries()

2021-01-04 Thread Konrad Rzeszutek Wilk

On Wed, Dec 23, 2020 at 02:24:12PM +0800, Zhenzhong Duan wrote:
> check_iommu_entries() checks for cyclic dependency in iommu entries
> and fixes the cyclic dependency by setting x->depend to NULL. But
> this repairing isn't correct if q is in front of p, there will be
> "EXECUTION ORDER INVALID!" report following. Fix it by NULLing
> whichever in the front.
> 
> The second issue is about the report of exectuion order reverse,
> the order is reversed incorrectly in the report, fix it.

Heya!

When you debugged this, did you by any chance save the
serial logs and the debug logs to double-check it?

Thanks!
> 
> Signed-off-by: Zhenzhong Duan 
> ---
>  arch/x86/kernel/pci-iommu_table.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/pci-iommu_table.c 
> b/arch/x86/kernel/pci-iommu_table.c
> index 2e9006c..40c8249 100644
> --- a/arch/x86/kernel/pci-iommu_table.c
> +++ b/arch/x86/kernel/pci-iommu_table.c
> @@ -60,7 +60,10 @@ void __init check_iommu_entries(struct iommu_table_entry 
> *start,
>   printk(KERN_ERR "CYCLIC DEPENDENCY FOUND! %pS depends 
> on %pS and vice-versa. BREAKING IT.\n",
>  p->detect, q->detect);
>   /* Heavy handed way..*/
> - x->depend = NULL;
> + if (p > q)
> + q->depend = NULL;
> + else
> + p->depend = NULL;
>   }
>   }
>  
> @@ -68,7 +71,7 @@ void __init check_iommu_entries(struct iommu_table_entry 
> *start,
>   q = find_dependents_of(p, finish, p);
>   if (q && q > p) {
>   printk(KERN_ERR "EXECUTION ORDER INVALID! %pS should be 
> called before %pS!\n",
> -p->detect, q->detect);
> +q->detect, p->detect);
>   }
>   }
>  }
> -- 
> 1.8.3.1
>

[GIT PULL] (swiotlb) stable/for-linus-5.11

2020-12-14 Thread Konrad Rzeszutek Wilk

Heya Linus,

Please git pull the following branch:

git pull git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git 
stable/for-linus-5.11

which has a generic (but for right now engaged only with AMD SEV) mechanism to 
adjust a
larger size SWIOTLB based on the total memory of the SEV guests which right now 
require
the bounce buffer for interacting with the outside world. Normal knobs 
(swiotlb=XYZ) still work.

Thank you!!

Ashish Kalra (1):
  x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guests

 arch/x86/include/asm/mem_encrypt.h |  2 ++
 arch/x86/kernel/setup.c|  6 ++
 arch/x86/mm/mem_encrypt.c  | 31 +++
 include/linux/swiotlb.h|  8 
 kernel/dma/swiotlb.c   | 20 ++--
 5 files changed, 65 insertions(+), 2 deletions(-)

Re: [PATCH] [PATCH] Keep offset when mapping data via SWIOTLB.

2020-12-11 Thread Konrad Rzeszutek Wilk

On Mon, Dec 07, 2020 at 01:42:04PM -0800, Jianxiong Gao wrote:
> NVMe driver and other applications depend on the data offset
> to operate correctly. Currently when unaligned data is mapped via
> SWIOTLB, the data is mapped as slab aligned with the SWIOTLB. When
> booting with --swiotlb=force option and using NVMe as interface,
> running mkfs.xfs on Rhel fails because of the unalignment issue.
> This patch makes sure the mapped data preserves
> its offset of the orginal address. Tested on latest kernel that
> this patch fixes the issue.
> 
> Signed-off-by: Jianxiong Gao 
> Acked-by: David Rientjes 

This breaks DHCP with upstream kernel (applied this on top v5.10-rc7)
and used swiotlb=262144,force and now the dhclient is not working:

[  119.300502] bnxt_en :3b:00.0 eno2np0: NIC Link is Up, 25000 Mbps full 
duplex, Flow control: ON - receive & transmit
[  119.437573] bnxt_en :3b:00.0 eno2np0: FEC autoneg off encoding: None
[   90.064220] dracut-initqueue[1477]: Warning: dhcp for interface eno2np0 
failed
[  101.155295] dracut-initqueue[1477]: Warning: dhcp for interfa[  142.361359] 
bnxt_en :3b:00.1 eno3np1: NIC Link is Up, 25000 Mbps full duplex, Flow 
control: ON - receive & transmit
ce eno2np0 faile[  142.501860] bnxt_en :3b:00.1 eno3np1: FEC autoneg off 
encoding: None
d
[  113.054108] dracut-initqueue[1477]: Warning: dhcp for interface eno3np1 
failed
[  123.867108] dracut-initqueue[1477]: Warning: dhcp for interface eno3np1 
failed
[  251.888002] dracut-initqueue[1477]: Warning: dracut-initqueue timeout - 
starting timeout scripts

Dropping from linux-next.

> ---
>  kernel/dma/swiotlb.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 781b9dca197c..56a35e71b3fd 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -483,6 +483,12 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, 
> phys_addr_t orig_addr,
>   max_slots = mask + 1
>   ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
>   : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
> + 
> + /*
> +  * We need to keep the offset when mapping, so adding the offset
> +  * to the total set we need to allocate in SWIOTLB
> +  */
> + alloc_size += offset_in_page(orig_addr);
>  
>   /*
>* For mappings greater than or equal to a page, we limit the stride
> @@ -567,6 +573,11 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, 
> phys_addr_t orig_addr,
>*/
>   for (i = 0; i < nslots; i++)
>   io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
> + /*
> +  * When keeping the offset of the original data, we need to advance
> +  * the tlb_addr by the offset of orig_addr.
> +  */
> + tlb_addr += orig_addr & (PAGE_SIZE - 1);
>   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
>   (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
>   swiotlb_bounce(orig_addr, tlb_addr, mapping_size, 
> DMA_TO_DEVICE);
> -- 
> 2.27.0
> 
>

Re: [PATCH] x86,swiotlb: Fix build warning after merge of the SWIOTLB bounce buffer adjustment patch

2020-12-10 Thread Konrad Rzeszutek Wilk

On Fri, Dec 11, 2020 at 02:45:33AM +, Ashish Kalra wrote:
> From: Ashish Kalra 
> 
> Need to add "inline" to swiotlb_adjust_size() when
> CONFIG_SWIOTLB is not defined.

I am just going to squash that in. Thanks.
> 
> Signed-off-by: Ashish Kalra 
> ---
>  include/linux/swiotlb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 075748f367ea..e8074ff1b8c7 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -104,7 +104,7 @@ static inline bool is_swiotlb_active(void)
>   return false;
>  }
>  
> -static void swiotlb_adjust_size(unsigned long new_size)
> +static inline void swiotlb_adjust_size(unsigned long new_size)
>  {
>  }
>  #endif /* CONFIG_SWIOTLB */
> -- 
> 2.17.1
>

Re: [PATCH v8] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-12-08 Thread Konrad Rzeszutek Wilk

On December 8, 2020 6:01:19 PM EST, Borislav Petkov  wrote:
>On Tue, Dec 08, 2020 at 05:22:20PM -0500, Konrad Rzeszutek Wilk wrote:
>> I will fix it up.
>
>So who's picking this up? If not me then I probably should have a
>detailed look at the x86 bits before it goes in...

I was planning to pick this up (got one more SWIOTLB related patch).

That said  if you have the time to take a peek at the x86 bits -  that would be 
awesome!

Re: [PATCH v8] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-12-08 Thread Konrad Rzeszutek Wilk

On Mon, Dec 07, 2020 at 11:10:57PM +, Ashish Kalra wrote:
> From: Ashish Kalra 
> 
> For SEV, all DMA to and from guest has to use shared (un-encrypted) pages.
> SEV uses SWIOTLB to make this happen without requiring changes to device
> drivers.  However, depending on workload being run, the default 64MB of
> SWIOTLB might not be enough and SWIOTLB may run out of buffers to use
> for DMA, resulting in I/O errors and/or performance degradation for
> high I/O workloads.
> 
> Adjust the default size of SWIOTLB for SEV guests using a
> percentage of the total memory available to guest for SWIOTLB buffers.
> 
> Using late_initcall() interface to invoke swiotlb_adjust() does not
> work as the size adjustment needs to be done before mem_encrypt_init()
> and reserve_crashkernel() which use the allocated SWIOTLB buffer size,
> hence call it explicitly from setup_arch().
> 
> The SWIOTLB default size adjustment needs to be added as an architecture
> specific interface/callback to allow architectures such as those supporting
> memory encryption to adjust/expand SWIOTLB size for their use.
> 
> v5 fixed build errors and warnings as
> Reported-by: kbuild test robot 
> 
> Signed-off-by: Ashish Kalra 
> ---
>  arch/x86/kernel/setup.c   |  2 ++
>  arch/x86/mm/mem_encrypt.c | 37 +
>  include/linux/swiotlb.h   |  6 ++
>  kernel/dma/swiotlb.c  | 22 ++
>  4 files changed, 67 insertions(+)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 84f581c91db4..31e24e198061 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1149,6 +1149,8 @@ void __init setup_arch(char **cmdline_p)
>   if (boot_cpu_has(X86_FEATURE_GBPAGES))
>   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
>  
> + swiotlb_adjust();
> +
>   /*
>* Reserve memory for crash kernel after SRAT is parsed so that it
>* won't consume hotpluggable memory.
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 1bcfbcd2bfd7..d1b8d60040cf 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -485,7 +485,44 @@ static void print_mem_encrypt_feature_info(void)
>   pr_cont("\n");
>  }
>  
> +/*
> + * The percentage of guest memory used here for SWIOTLB buffers
> + * is more of an approximation of the static adjustment which
> + * is 128M for <1G guests, 256M for 1G-4G guests and 512M for >4G guests.

No?

it is 64MB for <1G, and ~128M to 256M for 1G-to-4G

I will fix it up.
> + */
> +#define SEV_ADJUST_SWIOTLB_SIZE_PERCENT  6
> +
>  /* Architecture __weak replacement functions */
> +unsigned long __init arch_swiotlb_adjust(unsigned long iotlb_default_size)
> +{
> + unsigned long size = iotlb_default_size;
> +
> + /*
> +  * For SEV, all DMA has to occur via shared/unencrypted pages.
> +  * SEV uses SWOTLB to make this happen without changing device
> +  * drivers. However, depending on the workload being run, the
> +  * default 64MB of SWIOTLB may not be enough and`SWIOTLB may
> +  * run out of buffers for DMA, resulting in I/O errors and/or
> +  * performance degradation especially with high I/O workloads.
> +  * Adjust the default size of SWIOTLB for SEV guests using
> +  * a percentage of guest memory for SWIOTLB buffers.
> +  * Also as the SWIOTLB bounce buffer memory is allocated
> +  * from low memory, ensure that the adjusted size is within
> +  * the limits of low available memory.
> +  *
> +  */
> + if (sev_active()) {
> + phys_addr_t total_mem = memblock_phys_mem_size();
> +
> + size = total_mem * SEV_ADJUST_SWIOTLB_SIZE_PERCENT / 100;
> + size = clamp_val(size, iotlb_default_size, SZ_1G);
> + pr_info("SWIOTLB bounce buffer size adjusted to %luMB for SEV",
> + size >> 20);
> + }
> +
> + return size;
> +}
> +
>  void __init mem_encrypt_init(void)
>  {
>   if (!sme_me_mask)
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 3bb72266a75a..b5904fa4b67c 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -33,6 +33,7 @@ extern void swiotlb_init(int verbose);
>  int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
>  extern unsigned long swiotlb_nr_tbl(void);
>  unsigned long swiotlb_size_or_default(void);
> +unsigned long __init arch_swiotlb_adjust(unsigned long size);
>  extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
>  extern int swiotlb_late_init_with_default_size(size_t default_size);
>  extern void __init swiotlb_update_mem_attributes(void);
> @@ -77,6 +78,7 @@ void __init swiotlb_exit(void);
>  unsigned int swiotlb_max_segment(void);
>  size_t swiotlb_max_mapping_size(struct device *dev);
>  bool is_swiotlb_active(void);
> +void __init swiotlb_adjust(void);
>  #else
>  #define swiotlb_force SWIOTLB_NO_FORCE
>  static inline bool

Re: [PATCH] [PATCH] Keep offset when mapping data via SWIOTLB.

2020-12-08 Thread Konrad Rzeszutek Wilk

On Mon, Dec 07, 2020 at 01:42:04PM -0800, Jianxiong Gao wrote:
> NVMe driver and other applications depend on the data offset
> to operate correctly. Currently when unaligned data is mapped via
> SWIOTLB, the data is mapped as slab aligned with the SWIOTLB. When
> booting with --swiotlb=force option and using NVMe as interface,
> running mkfs.xfs on Rhel fails because of the unalignment issue.
> This patch makes sure the mapped data preserves
> its offset of the orginal address. Tested on latest kernel that
> this patch fixes the issue.

Lets reword this comment a bit more since you are not providing
the RHEL Bug, and instead are focusing on the upstream kernel.

I can do that for you..

> 
> Signed-off-by: Jianxiong Gao 
> Acked-by: David Rientjes 
> ---
>  kernel/dma/swiotlb.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 781b9dca197c..56a35e71b3fd 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -483,6 +483,12 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, 
> phys_addr_t orig_addr,
>   max_slots = mask + 1
>   ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
>   : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
> + 
> + /*
> +  * We need to keep the offset when mapping, so adding the offset
> +  * to the total set we need to allocate in SWIOTLB
> +  */
> + alloc_size += offset_in_page(orig_addr);
>  
>   /*
>* For mappings greater than or equal to a page, we limit the stride
> @@ -567,6 +573,11 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, 
> phys_addr_t orig_addr,
>*/
>   for (i = 0; i < nslots; i++)
>   io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
> + /*
> +  * When keeping the offset of the original data, we need to advance
> +  * the tlb_addr by the offset of orig_addr.
> +  */
> + tlb_addr += orig_addr & (PAGE_SIZE - 1);
>   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
>   (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
>   swiotlb_bounce(orig_addr, tlb_addr, mapping_size, 
> DMA_TO_DEVICE);
> -- 
> 2.27.0
> 
>

Re: [PATCH v6] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-12-01 Thread Konrad Rzeszutek Wilk

On Tue, Nov 24, 2020 at 11:46:22PM +, Ashish Kalra wrote:
> Hello Konrad, 
> 
> On Mon, Nov 23, 2020 at 10:56:31PM +, Ashish Kalra wrote:
> > Hello Konrad,
> > 
> > On Mon, Nov 23, 2020 at 12:56:32PM -0500, Konrad Rzeszutek Wilk wrote:
> > > On Mon, Nov 23, 2020 at 06:06:47PM +0100, Borislav Petkov wrote:
> > > > On Thu, Nov 19, 2020 at 09:42:05PM +, Ashish Kalra wrote:
> > > > > From: Ashish Kalra 
> > > > > 
> > > > > For SEV, all DMA to and from guest has to use shared (un-encrypted) 
> > > > > pages.
> > > > > SEV uses SWIOTLB to make this happen without requiring changes to 
> > > > > device
> > > > > drivers.  However, depending on workload being run, the default 64MB 
> > > > > of
> > > > > SWIOTLB might not be enough and SWIOTLB may run out of buffers to use
> > > > > for DMA, resulting in I/O errors and/or performance degradation for
> > > > > high I/O workloads.
> > > > > 
> > > > > Increase the default size of SWIOTLB for SEV guests using a minimum
> > > > > value of 128MB and a maximum value of 512MB, determining on amount
> > > > > of provisioned guest memory.
> > > > 
> > > > That sentence needs massaging.
> > > > 
> > > > > Using late_initcall() interface to invoke swiotlb_adjust() does not
> > > > > work as the size adjustment needs to be done before mem_encrypt_init()
> > > > > and reserve_crashkernel() which use the allocated SWIOTLB buffer size,
> > > > > hence calling it explicitly from setup_arch().
> > > > 
> > > > "hence call it ... "
> > > > 
> > > > > 
> > > > > The SWIOTLB default size adjustment is added as an architecture 
> > > > > specific
> > > > 
> > > > "... is added... " needs to be "Add ..."
> > > > 
> > > > > interface/callback to allow architectures such as those supporting 
> > > > > memory
> > > > > encryption to adjust/expand SWIOTLB size for their use.
> > > > > 
> > > > > v5 fixed build errors and warnings as
> > > > > Reported-by: kbuild test robot 
> > > > > 
> > > > > Signed-off-by: Ashish Kalra 
> > > > > ---
> > > > >  arch/x86/kernel/setup.c   |  2 ++
> > > > >  arch/x86/mm/mem_encrypt.c | 32 
> > > > >  include/linux/swiotlb.h   |  6 ++
> > > > >  kernel/dma/swiotlb.c  | 24 
> > > > >  4 files changed, 64 insertions(+)
> > > > > 
> > > > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > > > index 3511736fbc74..b073d58dd4a3 100644
> > > > > --- a/arch/x86/kernel/setup.c
> > > > > +++ b/arch/x86/kernel/setup.c
> > > > > @@ -1166,6 +1166,8 @@ void __init setup_arch(char **cmdline_p)
> > > > >   if (boot_cpu_has(X86_FEATURE_GBPAGES))
> > > > >   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> > > > >  
> > > > > + swiotlb_adjust();
> > > > > +
> > > > >   /*
> > > > >* Reserve memory for crash kernel after SRAT is parsed so that 
> > > > > it
> > > > >* won't consume hotpluggable memory.
> > > > > diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> > > > > index 3f248f0d0e07..c79a0d761db5 100644
> > > > > --- a/arch/x86/mm/mem_encrypt.c
> > > > > +++ b/arch/x86/mm/mem_encrypt.c
> > > > > @@ -490,6 +490,38 @@ static void print_mem_encrypt_feature_info(void)
> > > > >  }
> > > > >  
> > > > >  /* Architecture __weak replacement functions */
> > > > > +unsigned long __init arch_swiotlb_adjust(unsigned long 
> > > > > iotlb_default_size)
> > > > > +{
> > > > > + unsigned long size = 0;
> > > > 
> > > > unsigned long size = iotlb_default_size;
> > > > 
> > > > > +
> > > > > + /*
> > > > > +  * For SEV, all DMA has to occur via shared/unencrypted pages.
> > > > > +  * SEV uses SWOTLB to make this happen without changing device
> > > > > +  * drivers. However, dep

Re: [PATCH 1/1] x86/tboot: Don't disable swiotlb when iommu is forced on

2020-11-25 Thread Konrad Rzeszutek Wilk

On Wed, Nov 25, 2020 at 03:51:30PM +, Will Deacon wrote:
> Hi Konrad,
> 
> On Wed, Nov 25, 2020 at 10:41:53AM -0500, Konrad Rzeszutek Wilk wrote:
> > On Wed, Nov 25, 2020 at 02:05:15PM +, Will Deacon wrote:
> > > On Wed, 25 Nov 2020 09:41:24 +0800, Lu Baolu wrote:
> > > > After commit 327d5b2fee91c ("iommu/vt-d: Allow 32bit devices to uses DMA
> > > > domain"), swiotbl could also be used for direct memory access if IOMMU
> > > > is enabled but a device is configured to pass through the DMA 
> > > > translation.
> > > > Keep swiotlb when IOMMU is forced on, otherwise, some devices won't work
> > > > if "iommu=pt" kernel parameter is used.
> > > 
> > > Applied to arm64 (for-next/iommu/fixes), thanks!
> > > 
> > > [1/1] x86/tboot: Don't disable swiotlb when iommu is forced on
> > >   https://git.kernel.org/arm64/c/e2be2a833ab5
> > 
> > But tboot never ran on ARM. It is a Intel specifc.
> > 
> > I think either me or Thomas should take this patch.
> 
> FWIW, I did check with Thomas before I picked it up. I know it looks weird
> going via arm64, but that's only because I'm temporarily handling the IOMMU
> tree there (including vt-d changes) while Joerg is away. Since this fixes a
> vt-d regression, I thought I'd pick it up along with the other IOMMU fixes I
> have queued for -rc6.
> 

Aah, I missed the memo :-)

> That said, if you insist, then I can revert it. I'm really only trying to
> help here.

Nah. Enjoy picking up patches!

Thanks!
> 
> Will

Re: [PATCH 1/1] x86/tboot: Don't disable swiotlb when iommu is forced on

2020-11-25 Thread Konrad Rzeszutek Wilk

On Wed, Nov 25, 2020 at 02:05:15PM +, Will Deacon wrote:
> On Wed, 25 Nov 2020 09:41:24 +0800, Lu Baolu wrote:
> > After commit 327d5b2fee91c ("iommu/vt-d: Allow 32bit devices to uses DMA
> > domain"), swiotbl could also be used for direct memory access if IOMMU
> > is enabled but a device is configured to pass through the DMA translation.
> > Keep swiotlb when IOMMU is forced on, otherwise, some devices won't work
> > if "iommu=pt" kernel parameter is used.
> 
> Applied to arm64 (for-next/iommu/fixes), thanks!
> 
> [1/1] x86/tboot: Don't disable swiotlb when iommu is forced on
>   https://git.kernel.org/arm64/c/e2be2a833ab5

But tboot never ran on ARM. It is a Intel specifc.

I think either me or Thomas should take this patch.

Re: [PATCH] [PATCH] Adding offset keeping option when mapping data via SWIOTLB.

2020-11-24 Thread Konrad Rzeszutek Wilk

On Mon, Nov 23, 2020 at 02:18:07PM -0800, Jianxiong Gao wrote:
> NVMe driver and other applications may depend on the data offset
> to operate correctly. Currently when unaligned data is mapped via
> SWIOTLB, the data is mapped as slab aligned with the SWIOTLB. When
> booting with --swiotlb=force option and using NVMe as interface,
> running mkfs.xfs on Rhel fails because of the unalignment issue.

RHEL? So a specific RHEL kernel. Is there a Red Hat bug created
for this that can be linked to this patch to make it easier
for folks to figure this?

Why would you be using swiotlb=force?
Ah, you are using AMD SEV!

> This patch adds an option to make sure the mapped data preserves
> its offset of the orginal addrss. Tested on latest kernel that

s/addrss/address/
> this patch fixes the issue.
> 
> Signed-off-by: Jianxiong Gao 
> Acked-by: David Rientjes 
> ---
>  drivers/nvme/host/pci.c |  3 ++-
>  include/linux/dma-mapping.h |  8 
>  kernel/dma/swiotlb.c| 13 +
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 0578ff253c47..a366fb8a1ff0 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -833,7 +833,8 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
> struct request *req,
>   iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
>   else
>   nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
> -  rq_dma_dir(req), DMA_ATTR_NO_WARN);
> + rq_dma_dir(req),
> + DMA_ATTR_NO_WARN|DMA_ATTR_SWIOTLB_KEEP_OFFSET);
>   if (!nr_mapped)
>   goto out;
>  
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 956151052d45..e46d23d9fa20 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -61,6 +61,14 @@
>   */
>  #define DMA_ATTR_PRIVILEGED  (1UL << 9)
>  
> +/*
> + * DMA_ATTR_SWIOTLB_KEEP_OFFSET: used to indicate that the buffer has to keep
> + * its offset when mapped via SWIOTLB. Some application functionality depends
> + * on the address offset, thus when buffers are mapped via SWIOTLB, the 
> offset
> + * needs to be preserved.
> + */
> +#define DMA_ATTR_SWIOTLB_KEEP_OFFSET (1UL << 10)
> +
>  /*
>   * A dma_addr_t can hold any valid DMA or bus address for the platform.  It 
> can
>   * be given to a device to use as a DMA source or target.  It is specific to 
> a
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 781b9dca197c..f43d7be1342d 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -483,6 +483,13 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, 
> phys_addr_t orig_addr,
>   max_slots = mask + 1
>   ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
>   : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
> + 
> + /*
> +  * If we need to keep the offset when mapping, we need to add the offset
> +  * to the total set we need to allocate in SWIOTLB
> +  */
> + if (attrs & DMA_ATTR_SWIOTLB_KEEP_OFFSET)
> + alloc_size += offset_in_page(orig_addr);
>  
>   /*
>* For mappings greater than or equal to a page, we limit the stride
> @@ -567,6 +574,12 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, 
> phys_addr_t orig_addr,
>*/
>   for (i = 0; i < nslots; i++)
>   io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
> + /*
> +  * When keeping the offset of the original data, we need to advance
> +  * the tlb_addr by the offset of orig_addr.
> +  */
> + if (attrs & DMA_ATTR_SWIOTLB_KEEP_OFFSET)
> + tlb_addr += orig_addr & (PAGE_SIZE - 1);
>   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
>   (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
>   swiotlb_bounce(orig_addr, tlb_addr, mapping_size, 
> DMA_TO_DEVICE);
> -- 
> 2.27.0
> 
>

Re: [PATCH v6] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-23 Thread Konrad Rzeszutek Wilk

On Mon, Nov 23, 2020 at 07:02:15PM +0100, Borislav Petkov wrote:
> On Mon, Nov 23, 2020 at 12:56:32PM -0500, Konrad Rzeszutek Wilk wrote:
> > This is not going to work for TDX. I think having a registration
> > to SWIOTLB to have this function would be better going forward.
> > 
> > As in there will be a swiotlb_register_adjuster() which AMD SEV
> > code can call at start, also TDX can do it (and other platforms).
> 
> Oh do tell. It doesn't need to adjust size?

I am assuming that TDX is going to have the same exact issue that 
AMD SEV will have.

Are you recommending to have an unified x86 specific callback
where we check if it:

 - CPUID_AMD_SEV or CPUID_INTEL_TDX is set, and
 - No vIOMMU present, then we adjust the size?

> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v6] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-23 Thread Konrad Rzeszutek Wilk

On Mon, Nov 23, 2020 at 06:06:47PM +0100, Borislav Petkov wrote:
> On Thu, Nov 19, 2020 at 09:42:05PM +, Ashish Kalra wrote:
> > From: Ashish Kalra 
> > 
> > For SEV, all DMA to and from guest has to use shared (un-encrypted) pages.
> > SEV uses SWIOTLB to make this happen without requiring changes to device
> > drivers.  However, depending on workload being run, the default 64MB of
> > SWIOTLB might not be enough and SWIOTLB may run out of buffers to use
> > for DMA, resulting in I/O errors and/or performance degradation for
> > high I/O workloads.
> > 
> > Increase the default size of SWIOTLB for SEV guests using a minimum
> > value of 128MB and a maximum value of 512MB, determining on amount
> > of provisioned guest memory.
> 
> That sentence needs massaging.
> 
> > Using late_initcall() interface to invoke swiotlb_adjust() does not
> > work as the size adjustment needs to be done before mem_encrypt_init()
> > and reserve_crashkernel() which use the allocated SWIOTLB buffer size,
> > hence calling it explicitly from setup_arch().
> 
> "hence call it ... "
> 
> > 
> > The SWIOTLB default size adjustment is added as an architecture specific
> 
> "... is added... " needs to be "Add ..."
> 
> > interface/callback to allow architectures such as those supporting memory
> > encryption to adjust/expand SWIOTLB size for their use.
> > 
> > v5 fixed build errors and warnings as
> > Reported-by: kbuild test robot 
> > 
> > Signed-off-by: Ashish Kalra 
> > ---
> >  arch/x86/kernel/setup.c   |  2 ++
> >  arch/x86/mm/mem_encrypt.c | 32 
> >  include/linux/swiotlb.h   |  6 ++
> >  kernel/dma/swiotlb.c  | 24 
> >  4 files changed, 64 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 3511736fbc74..b073d58dd4a3 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -1166,6 +1166,8 @@ void __init setup_arch(char **cmdline_p)
> > if (boot_cpu_has(X86_FEATURE_GBPAGES))
> > hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> >  
> > +   swiotlb_adjust();
> > +
> > /*
> >  * Reserve memory for crash kernel after SRAT is parsed so that it
> >  * won't consume hotpluggable memory.
> > diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> > index 3f248f0d0e07..c79a0d761db5 100644
> > --- a/arch/x86/mm/mem_encrypt.c
> > +++ b/arch/x86/mm/mem_encrypt.c
> > @@ -490,6 +490,38 @@ static void print_mem_encrypt_feature_info(void)
> >  }
> >  
> >  /* Architecture __weak replacement functions */
> > +unsigned long __init arch_swiotlb_adjust(unsigned long iotlb_default_size)
> > +{
> > +   unsigned long size = 0;
> 
>   unsigned long size = iotlb_default_size;
> 
> > +
> > +   /*
> > +* For SEV, all DMA has to occur via shared/unencrypted pages.
> > +* SEV uses SWOTLB to make this happen without changing device
> > +* drivers. However, depending on the workload being run, the
> > +* default 64MB of SWIOTLB may not be enough & SWIOTLB may
>^
> 
> Use words pls, not "&".
> 
> 
> > +* run out of buffers for DMA, resulting in I/O errors and/or
> > +* performance degradation especially with high I/O workloads.
> > +* Increase the default size of SWIOTLB for SEV guests using
> > +* a minimum value of 128MB and a maximum value of 512MB,
> > +* depending on amount of provisioned guest memory.
> > +*/
> > +   if (sev_active()) {
> > +   phys_addr_t total_mem = memblock_phys_mem_size();
> > +
> > +   if (total_mem <= SZ_1G)
> > +   size = max(iotlb_default_size, (unsigned long) SZ_128M);
> > +   else if (total_mem <= SZ_4G)
> > +   size = max(iotlb_default_size, (unsigned long) SZ_256M);

That is eating 128MB for 1GB, aka 12% of the guest memory allocated statically 
for this.

And for guests that are 2GB, that is 12% until it gets to 3GB when it is 8%
and then 6% at 4GB.

I would prefer this to be based on your memory count, that is 6% of total
memory. And then going forward we can allocate memory _after_ boot and then 
stich
the late SWIOTLB pool and allocate on demand.



> > +   else
> > +   size = max(iotlb_default_size, (unsigned long) SZ_512M);
> > +
> > +   pr_info("SWIOTLB bounce buffer size adjusted to %luMB for SEV 
> > platform",
> 
> just "... for SEV" - no need for "platform".
> 
> ...
> 
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index c19379fabd20..3be9a19ea0a5 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -163,6 +163,30 @@ unsigned long swiotlb_size_or_default(void)
> > return size ? size : (IO_TLB_DEFAULT_SIZE);
> >  }
> >  
> > +unsigned long __init __weak arch_swiotlb_adjust(unsigned long size)
> > +{
> > +   return 0;
> 
> That, of course, needs to return size, not 0.

This is not going to work for TDX. I think

Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-17 Thread Konrad Rzeszutek Wilk

On Tue, Nov 17, 2020 at 07:04:59PM +, Kalra, Ashish wrote:
> Hello Konrad,
> 
> Actually I didn’t get that, do you mean you are taking 1G and <=4G cases out 
> of the patch and only going to apply the >4G case as part of the patch ?

That was the thought, but now I am wondering how TDX is going to work
with this. That is the __weak won't work on distro kernel that has to
run on both AMD and Intel. Hmm.

Let me brush off the late-SWIOTLB patches that internally we developed some 
time ago.

> 
> Thanks,
> Ashish 
> 
> > On Nov 17, 2020, at 11:38 AM, Kalra, Ashish  wrote:
> > 
> > Hello Konrad, 
> > 
> >> On Tue, Nov 17, 2020 at 12:00:03PM -0500, Konrad Rzeszutek Wilk wrote:
> >> .snip..
> >>>>>> Lets break this down:
> >>>>>> 
> >>>>>> How does the performance improve for one single device if you increase 
> >>>>>> the SWIOTLB?
> >>>>>> Is there a specific device/driver that you can talk about that improve 
> >>>>>> with this patch?
> >>>>>> 
> >>>>>> 
> >>>>> 
> >>>>> Yes, these are mainly for multi-queue devices such as NICs or even
> >>>>> multi-queue virtio. 
> >>>>> 
> >>>>> This basically improves performance with concurrent DMA, hence,
> >>>>> basically multi-queue devices.
> >>>> 
> >>>> OK, and for _1GB_ guest - what are the "internal teams/external 
> >>>> customers" amount 
> >>>> of CPUs they use? Please lets use real use-cases.
> >>> 
> >>>>> I am sure you will understand we cannot share any external customer
> >>>>> data as all that customer information is proprietary.
> >>>>> 
> >>>>> In similar situation if you have to share Oracle data, you will
> >>>>> surely have the same concerns and i don't think you will be able
> >>>>> to share any such information externally, i.e., outside Oracle.
> >>>>> 
> >>>> I am asking for a simple query - what amount of CPUs does a 1GB
> >>>> guest have? The reason for this should be fairly obvious - if
> >>>> it is a 1vCPU, then there is no multi-queue and the existing
> >>>> SWIOTLB pool size as it is OK.
> >>>> 
> >>>> If however there are say 2 and multiqueue is enabled, that
> >>>> gives me an idea of how many you use and I can find out what
> >>>> the maximum pool size usage of virtio there is with that configuration.
> >>> 
> >>> Again we cannot share any customer data.
> >>> 
> >>> Also i don't think there can be a definitive answer to how many vCPUs a
> >>> 1GB guest will have, it will depend on what kind of configuration we are
> >>> testing.
> >>> 
> >>> For example, i usually setup 4-16 vCPUs for as low as 512M configured
> >>> gueest memory.
> >> 
> >> Sure, but you are not the normal user.
> >> 
> >> That is I don't like that for 1GB guests your patch ends up doubling the
> >> SWIOTLB memory pool. That seems to me we are trying to solve a problem
> >> that normal users will not hit. That is why I want 'here is the customer
> >> bug'.
> >> 
> >> Here is what I am going to do - I will take out the 1GB and 4GB case out of
> >> your patch and call it a day. If there are customers who start reporting 
> >> issues
> >> we can revist that. Nothing wrong with 'Reported-by' XZY (we often ask the
> >> customer if he or she would like to be recognized on upstream bugs).
> >> 
> > 
> > Ok.
> > 
> >> And in the meantime I am going to look about adding ..
> >>> 
> >>> I have been also testing with 16 vCPUs configuration for 512M-1G guest
> >>> memory with Mellanox SRIOV NICs, and this will be a multi-queue NIC
> >>> device environment.
> >> 
> >> .. late SWIOTLB expansion to stich the DMA pools together as both
> >> Mellanox and VirtIO are not 32-bit DMA bound.
> >> 
> >>> 
> >>> So we might be having less configured guest memory, but we still might
> >>> be using that configuration with I/O intensive workloads.
> >>> 
> > 
> > I am going to submit v4 of my current patch-set which uses max() instead
> > of clamp() and also replaces constants defined in this patch with the
> > pre-defined ones in sizes.h
> > 
> > Thanks,
> > Ashish

Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-17 Thread Konrad Rzeszutek Wilk

.snip..
> > > > Lets break this down:
> > > > 
> > > > How does the performance improve for one single device if you increase 
> > > > the SWIOTLB?
> > > > Is there a specific device/driver that you can talk about that improve 
> > > > with this patch?
> > > > 
> > > > 
> > > 
> > > Yes, these are mainly for multi-queue devices such as NICs or even
> > > multi-queue virtio. 
> > > 
> > > This basically improves performance with concurrent DMA, hence,
> > > basically multi-queue devices.
> > 
> > OK, and for _1GB_ guest - what are the "internal teams/external customers" 
> > amount 
> > of CPUs they use? Please lets use real use-cases.
> 
> >> I am sure you will understand we cannot share any external customer
> >> data as all that customer information is proprietary.
> >>
> >> In similar situation if you have to share Oracle data, you will
> >> surely have the same concerns and i don't think you will be able
> >> to share any such information externally, i.e., outside Oracle.
> >>
> >I am asking for a simple query - what amount of CPUs does a 1GB
> >guest have? The reason for this should be fairly obvious - if
> >it is a 1vCPU, then there is no multi-queue and the existing
> >SWIOTLB pool size as it is OK.
> >
> >If however there are say 2 and multiqueue is enabled, that
> >gives me an idea of how many you use and I can find out what
> >the maximum pool size usage of virtio there is with that configuration.
> 
> Again we cannot share any customer data.
> 
> Also i don't think there can be a definitive answer to how many vCPUs a
> 1GB guest will have, it will depend on what kind of configuration we are
> testing.
> 
> For example, i usually setup 4-16 vCPUs for as low as 512M configured
> gueest memory.

Sure, but you are not the normal user.

That is I don't like that for 1GB guests your patch ends up doubling the
SWIOTLB memory pool. That seems to me we are trying to solve a problem
that normal users will not hit. That is why I want 'here is the customer
bug'.

Here is what I am going to do - I will take out the 1GB and 4GB case out of
your patch and call it a day. If there are customers who start reporting issues
we can revist that. Nothing wrong with 'Reported-by' XZY (we often ask the
customer if he or she would like to be recognized on upstream bugs).

And in the meantime I am going to look about adding ..
> 
> I have been also testing with 16 vCPUs configuration for 512M-1G guest
> memory with Mellanox SRIOV NICs, and this will be a multi-queue NIC
> device environment.

.. late SWIOTLB expansion to stich the DMA pools together as both
Mellanox and VirtIO are not 32-bit DMA bound.

> 
> So we might be having less configured guest memory, but we still might
> be using that configuration with I/O intensive workloads.
> 
> Thanks,
> Ashish

Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-13 Thread Konrad Rzeszutek Wilk

On Thu, Nov 05, 2020 at 09:20:45PM +, Ashish Kalra wrote:
> On Thu, Nov 05, 2020 at 03:20:07PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 05, 2020 at 07:38:28PM +, Ashish Kalra wrote:
> > > On Thu, Nov 05, 2020 at 02:06:49PM -0500, Konrad Rzeszutek Wilk wrote:
> > > > .
> > > > > > Right, so I am wondering if we can do this better.
> > > > > > 
> > > > > > That is you are never going to get any 32-bit devices with SEV 
> > > > > > right? That
> > > > > > is there is nothing that bounds you to always use the memory below 
> > > > > > 4GB?
> > > > > > 
> > > > > 
> > > > > We do support 32-bit PCIe passthrough devices with SEV.
> > > > 
> > > > Ewww..  Which devices would this be?
> > > 
> > > That will be difficult to predict as customers could be doing
> > > passthrough of all kinds of devices.
> > 
> > But SEV is not on some 1990 hardware. It has PCIe, there is no PCI slots in 
> > there.
> > 
> > Is it really possible to have a PCIe device that can't do more than 32-bit 
> > DMA?
> > 
> > > 
> > > > > 
> > > > > Therefore, we can't just depend on >4G memory for SWIOTLB bounce 
> > > > > buffering
> > > > > when there is I/O pressure, because we do need to support device
> > > > > passthrough of 32-bit devices.
> > > > 
> > > > Presumarily there is just a handful of them?
> > > >
> > > Again, it will be incorrect to assume this.
> > > 
> > > > > 
> > > > > Considering this, we believe that this patch needs to adjust/extend
> > > > > boot-allocation of SWIOTLB and we want to keep it simple to do this
> > > > > within a range detemined by amount of allocated guest memory.
> > > > 
> > > > I would prefer to not have to revert this in a year as customers
> > > > complain about "I paid $$$ and I am wasting half a gig on something 
> > > > I am not using" and giving customers knobs to tweak this instead of
> > > > doing the right thing from the start.
> > > 
> > > Currently, we face a lot of situations where we have to tell our
> > > internal teams/external customers to explicitly increase SWIOTLB buffer
> > > via the swiotlb parameter on the kernel command line, especially to
> > > get better I/O performance numbers with SEV. 
> > 
> > Presumarily these are 64-bit?
> > 
> > And what devices do you speak off that are actually affected by 
> > this performance? Increasing the SWIOTLB just means we have more
> > memory, which in mind means you can have _more_ devices in the guest
> > that won't handle the fact that DMA mapping returns an error.
> > 
> > Not neccessarily that one device suddenly can go faster.
> > 
> > > 
> > > So by having this SWIOTLB size adjustment done implicitly (even using a
> > > static logic) is a great win-win situation. In other words, having even
> > > a simple and static default increase of SWIOTLB buffer size for SEV is
> > > really useful for us.
> > > 
> > > We can always think of adding all kinds of heuristics to this, but that
> > > just adds too much complexity without any predictable performance gain.
> > > 
> > > And to add, the patch extends the SWIOTLB size as an architecture
> > > specific callback, currently it is a simple and static logic for SEV/x86
> > > specific, but there is always an option to tweak/extend it with
> > > additional logic in the future.
> > 
> > Right, and that is what I would like to talk about as I think you
> > are going to disappear (aka, busy with other stuff) after this patch goes 
> > in.
> > 
> > I need to understand this more than "performance" and "internal teams"
> > requirements to come up with a better way going forward as surely other
> > platforms will hit the same issue anyhow.
> > 
> > Lets break this down:
> > 
> > How does the performance improve for one single device if you increase the 
> > SWIOTLB?
> > Is there a specific device/driver that you can talk about that improve with 
> > this patch?
> > 
> > 
> 
> Yes, these are mainly for multi-queue devices such as NICs or even
> multi-queue virtio. 
> 
> This basically improves performance with concurrent DMA, hence,
> basically multi-queue devices.

OK, and for _1GB_ guest - what are the "internal teams/external customers" 
amount 
of CPUs they use? Please lets use real use-cases.

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-12 Thread Konrad Rzeszutek Wilk

.monster snip..

> 4. Using CPUID to detect running as guest. But as Thomas pointed out, this
> approach is less reliable as not all hypervisors do this way.

Is that truly true? It is the first time I see the argument that extra
steps are needed and that checking for X86_FEATURE_HYPERVISOR is not enough.

Or is it more "Some hypervisor probably forgot about it, so lets make sure we 
patch
over that possible hole?"


Also is there anything in this spec that precludes this from working
on non-X86 architectures, say ARM systems?

[GIT PULL] (swiotlb) stable/for-linus-5.10-rc2

2020-11-10 Thread Konrad Rzeszutek Wilk

Dear Linus,

Please git pull the following branch:

 git pull git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git 
stable/for-linus-5.10-rc2


which has two tiny fixes that make drivers under Xen unhappy under certain 
conditions.

Thank you!

Christoph Hellwig (1):
  swiotlb: remove the tbl_dma_addr argument to swiotlb_tbl_map_single

Stefano Stabellini (1):
  swiotlb: fix "x86: Don't panic if can not alloc buffer for swiotlb"

 drivers/iommu/intel/iommu.c |  5 ++---
 drivers/xen/swiotlb-xen.c   |  3 +--
 include/linux/swiotlb.h | 10 +++---
 kernel/dma/swiotlb.c| 22 +++---
 4 files changed, 17 insertions(+), 23 deletions(-)

Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-05 Thread Konrad Rzeszutek Wilk

On Thu, Nov 05, 2020 at 07:38:28PM +, Ashish Kalra wrote:
> On Thu, Nov 05, 2020 at 02:06:49PM -0500, Konrad Rzeszutek Wilk wrote:
> > .
> > > > Right, so I am wondering if we can do this better.
> > > > 
> > > > That is you are never going to get any 32-bit devices with SEV right? 
> > > > That
> > > > is there is nothing that bounds you to always use the memory below 4GB?
> > > > 
> > > 
> > > We do support 32-bit PCIe passthrough devices with SEV.
> > 
> > Ewww..  Which devices would this be?
> 
> That will be difficult to predict as customers could be doing
> passthrough of all kinds of devices.

But SEV is not on some 1990 hardware. It has PCIe, there is no PCI slots in 
there.

Is it really possible to have a PCIe device that can't do more than 32-bit DMA?

> 
> > > 
> > > Therefore, we can't just depend on >4G memory for SWIOTLB bounce buffering
> > > when there is I/O pressure, because we do need to support device
> > > passthrough of 32-bit devices.
> > 
> > Presumarily there is just a handful of them?
> >
> Again, it will be incorrect to assume this.
> 
> > > 
> > > Considering this, we believe that this patch needs to adjust/extend
> > > boot-allocation of SWIOTLB and we want to keep it simple to do this
> > > within a range detemined by amount of allocated guest memory.
> > 
> > I would prefer to not have to revert this in a year as customers
> > complain about "I paid $$$ and I am wasting half a gig on something 
> > I am not using" and giving customers knobs to tweak this instead of
> > doing the right thing from the start.
> 
> Currently, we face a lot of situations where we have to tell our
> internal teams/external customers to explicitly increase SWIOTLB buffer
> via the swiotlb parameter on the kernel command line, especially to
> get better I/O performance numbers with SEV. 

Presumarily these are 64-bit?

And what devices do you speak off that are actually affected by 
this performance? Increasing the SWIOTLB just means we have more
memory, which in mind means you can have _more_ devices in the guest
that won't handle the fact that DMA mapping returns an error.

Not neccessarily that one device suddenly can go faster.

> 
> So by having this SWIOTLB size adjustment done implicitly (even using a
> static logic) is a great win-win situation. In other words, having even
> a simple and static default increase of SWIOTLB buffer size for SEV is
> really useful for us.
> 
> We can always think of adding all kinds of heuristics to this, but that
> just adds too much complexity without any predictable performance gain.
> 
> And to add, the patch extends the SWIOTLB size as an architecture
> specific callback, currently it is a simple and static logic for SEV/x86
> specific, but there is always an option to tweak/extend it with
> additional logic in the future.

Right, and that is what I would like to talk about as I think you
are going to disappear (aka, busy with other stuff) after this patch goes in.

I need to understand this more than "performance" and "internal teams"
requirements to come up with a better way going forward as surely other
platforms will hit the same issue anyhow.

Lets break this down:

How does the performance improve for one single device if you increase the 
SWIOTLB?
Is there a specific device/driver that you can talk about that improve with 
this patch?


> 
> Thanks,
> Ashish'
> 
> > 
> > That is the right thing being something less static.
> > 
> > Can you work with me on what that could be please?
> > 
> > > 
> > > Thanks,
> > > Ashish
> > > 
> > > > What I wonder is if we can combine the boot-allocation of the SWIOTLB
> > > > with the post-boot-allocation of SWIOLTB to stitch together
> > > > continous physical ranges.
> > > > 
> > > > That way you have the flexibility at the start of using 64MB but if 
> > > > there
> > > > is pressure, we grow to a bigger size?
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > Ashish
> > > > > 
> > > > > > > memory.
> > > > > > > 
> > > > > > > Using late_initcall() interface to invoke
> > > > > > > swiotlb_adjust() does not work as the size
> > > > > > > adjustment needs to be done before mem_encrypt_init()
> > > > > > > and reserve_crashkernel() which use the allocated
> > > > > > > SWIOTLB buffer size, hence callin

Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-05 Thread Konrad Rzeszutek Wilk

.
> > Right, so I am wondering if we can do this better.
> > 
> > That is you are never going to get any 32-bit devices with SEV right? That
> > is there is nothing that bounds you to always use the memory below 4GB?
> > 
> 
> We do support 32-bit PCIe passthrough devices with SEV.

Ewww..  Which devices would this be?
> 
> Therefore, we can't just depend on >4G memory for SWIOTLB bounce buffering
> when there is I/O pressure, because we do need to support device
> passthrough of 32-bit devices.

Presumarily there is just a handful of them?

> 
> Considering this, we believe that this patch needs to adjust/extend
> boot-allocation of SWIOTLB and we want to keep it simple to do this
> within a range detemined by amount of allocated guest memory.

I would prefer to not have to revert this in a year as customers
complain about "I paid $$$ and I am wasting half a gig on something 
I am not using" and giving customers knobs to tweak this instead of
doing the right thing from the start.

That is the right thing being something less static.

Can you work with me on what that could be please?

> 
> Thanks,
> Ashish
> 
> > What I wonder is if we can combine the boot-allocation of the SWIOTLB
> > with the post-boot-allocation of SWIOLTB to stitch together
> > continous physical ranges.
> > 
> > That way you have the flexibility at the start of using 64MB but if there
> > is pressure, we grow to a bigger size?
> > 
> > > 
> > > Thanks,
> > > Ashish
> > > 
> > > > > memory.
> > > > > 
> > > > > Using late_initcall() interface to invoke
> > > > > swiotlb_adjust() does not work as the size
> > > > > adjustment needs to be done before mem_encrypt_init()
> > > > > and reserve_crashkernel() which use the allocated
> > > > > SWIOTLB buffer size, hence calling it explicitly
> > > > > from setup_arch().
> > > > > 
> > > > > The SWIOTLB default size adjustment is added as an
> > > > > architecture specific interface/callback to allow
> > > > > architectures such as those supporting memory
> > > > > encryption to adjust/expand SWIOTLB size for their
> > > > > use.
> > > > > 
> > > > > Signed-off-by: Ashish Kalra 
> > > > > ---
> > > > >  arch/x86/kernel/setup.c   |  2 ++
> > > > >  arch/x86/mm/mem_encrypt.c | 42 
> > > > > +++
> > > > >  include/linux/swiotlb.h   |  1 +
> > > > >  kernel/dma/swiotlb.c  | 27 +
> > > > >  4 files changed, 72 insertions(+)
> > > > > 
> > > > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > > > index 3511736fbc74..b073d58dd4a3 100644
> > > > > --- a/arch/x86/kernel/setup.c
> > > > > +++ b/arch/x86/kernel/setup.c
> > > > > @@ -1166,6 +1166,8 @@ void __init setup_arch(char **cmdline_p)
> > > > >   if (boot_cpu_has(X86_FEATURE_GBPAGES))
> > > > >   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> > > > >  
> > > > > + swiotlb_adjust();
> > > > > +
> > > > >   /*
> > > > >* Reserve memory for crash kernel after SRAT is parsed so that 
> > > > > it
> > > > >* won't consume hotpluggable memory.
> > > > > diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> > > > > index 3f248f0d0e07..e0deb157cddd 100644
> > > > > --- a/arch/x86/mm/mem_encrypt.c
> > > > > +++ b/arch/x86/mm/mem_encrypt.c
> > > > > @@ -489,7 +489,49 @@ static void print_mem_encrypt_feature_info(void)
> > > > >   pr_cont("\n");
> > > > >  }
> > > > >  
> > > > > +#define TOTAL_MEM_1G 0x4000UL
> > > > > +#define TOTAL_MEM_4G 0x1UL
> > > > > +
> > > > > +#define SIZE_128M (128UL<<20)
> > > > > +#define SIZE_256M (256UL<<20)
> > > > > +#define SIZE_512M (512UL<<20)
> > > > > +
> > > > >  /* Architecture __weak replacement functions */
> > > > > +unsigned long __init arch_swiotlb_adjust(unsigned long 
> > > > > iotlb_default_size)
> > > > > +{
> > > > > + unsigned long size = 0;
> > > > > +
> > > > > + /*
> > > > > +  * For SEV, all DMA has to occur via shared/unencrypted pages.
> > > > > +  * SEV uses SWOTLB to make this happen without changing device
> > > > > +  * drivers. However, depending on the workload being run, the
> > > > > +  * default 64MB of SWIOTLB may not be enough & SWIOTLB may
> > > > > +  * run out of buffers for DMA, resulting in I/O errors and/or
> > > > > +  * performance degradation especially with high I/O workloads.
> > > > > +  * Increase the default size of SWIOTLB for SEV guests using
> > > > > +  * a minimum value of 128MB and a maximum value of 512MB,
> > > > > +  * depending on amount of provisioned guest memory.
> > > > > +  */
> > > > > + if (sev_active()) {
> > > > > + phys_addr_t total_mem = memblock_phys_mem_size();
> > > > > +
> > > > > + if (total_mem <= TOTAL_MEM_1G)
> > > > > + size = clamp(iotlb_default_size * 2, SIZE_128M,
> > > > > +  SIZE_128M);
> > > > > + else if (total_mem <= TOTAL_MEM_4G)
> > > > > +

Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-05 Thread Konrad Rzeszutek Wilk

On Wed, Nov 04, 2020 at 10:39:13PM +, Ashish Kalra wrote:
> Hello Konrad,
> 
> On Wed, Nov 04, 2020 at 05:14:52PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Wed, Nov 04, 2020 at 10:08:04PM +, Ashish Kalra wrote:
> > > From: Ashish Kalra 
> > > 
> > > For SEV, all DMA to and from guest has to use shared
> > > (un-encrypted) pages. SEV uses SWIOTLB to make this
> > > happen without requiring changes to device drivers.
> > > However, depending on workload being run, the default
> > > 64MB of SWIOTLB might not be enough and SWIOTLB
> > > may run out of buffers to use for DMA, resulting
> > > in I/O errors and/or performance degradation for
> > > high I/O workloads.
> > > 
> > > Increase the default size of SWIOTLB for SEV guests
> > > using a minimum value of 128MB and a maximum value
> > 
> > 
> > 
> > 64MB for a 1GB VM is not enough?
> > 
> > > of 512MB, determining on amount of provisioned guest
> > 
> > I like the implementation on how this is done.. but
> > the choices of memory and how much seems very much
> > random. Could there be some math behind this?
> >
> 
> Earlier the patch was based on using a % of guest memory, as below:
> 
> +#define SEV_ADJUST_SWIOTLB_SIZE_PERCENT5
> +#define SEV_ADJUST_SWIOTLB_SIZE_MAX(1UL << 30)
> ...
> ...
> +   if (sev_active() && !io_tlb_nslabs) {
> +   unsigned long total_mem = get_num_physpages() << PAGE_SHIFT;
> +
> +   default_size = total_mem *
> +   SEV_ADJUST_SWIOTLB_SIZE_PERCENT / 100;
> +
> +   default_size = ALIGN(default_size, 1 << IO_TLB_SHIFT);
> +
> +   default_size = clamp_val(default_size, IO_TLB_DEFAULT_SIZE,
> +   SEV_ADJUST_SWIOTLB_SIZE_MAX);
> +   }
> 
> But, then it is difficult to predict what % of guest memory to use ?
> 
> Then there are other factors to consider, such as vcpu_count or if there
> is going to be high I/O workload, etc.
> 
> But that all makes it very complicated, what we basically want is a
> range from 128M to 512M and that's why the current patch which picks up
> this range from the amount of allocated guest memory keeps it simple. 

Right, so I am wondering if we can do this better.

That is you are never going to get any 32-bit devices with SEV right? That
is there is nothing that bounds you to always use the memory below 4GB?

What I wonder is if we can combine the boot-allocation of the SWIOTLB
with the post-boot-allocation of SWIOLTB to stitch together
continous physical ranges.

That way you have the flexibility at the start of using 64MB but if there
is pressure, we grow to a bigger size?

> 
> Thanks,
> Ashish
> 
> > > memory.
> > > 
> > > Using late_initcall() interface to invoke
> > > swiotlb_adjust() does not work as the size
> > > adjustment needs to be done before mem_encrypt_init()
> > > and reserve_crashkernel() which use the allocated
> > > SWIOTLB buffer size, hence calling it explicitly
> > > from setup_arch().
> > > 
> > > The SWIOTLB default size adjustment is added as an
> > > architecture specific interface/callback to allow
> > > architectures such as those supporting memory
> > > encryption to adjust/expand SWIOTLB size for their
> > > use.
> > > 
> > > Signed-off-by: Ashish Kalra 
> > > ---
> > >  arch/x86/kernel/setup.c   |  2 ++
> > >  arch/x86/mm/mem_encrypt.c | 42 +++
> > >  include/linux/swiotlb.h   |  1 +
> > >  kernel/dma/swiotlb.c  | 27 +
> > >  4 files changed, 72 insertions(+)
> > > 
> > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > index 3511736fbc74..b073d58dd4a3 100644
> > > --- a/arch/x86/kernel/setup.c
> > > +++ b/arch/x86/kernel/setup.c
> > > @@ -1166,6 +1166,8 @@ void __init setup_arch(char **cmdline_p)
> > >   if (boot_cpu_has(X86_FEATURE_GBPAGES))
> > >   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> > >  
> > > + swiotlb_adjust();
> > > +
> > >   /*
> > >* Reserve memory for crash kernel after SRAT is parsed so that it
> > >* won't consume hotpluggable memory.
> > > diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> > > index 3f248f0d0e07..e0deb157cddd 100644
> > > --- a/arch/x86/mm/mem_encrypt.c
> > > +++ b/arch/x86/mm/mem_encrypt.c
> > > @@ -489,7 +489,49 @@ st

Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-11-04 Thread Konrad Rzeszutek Wilk

On Wed, Nov 04, 2020 at 10:08:04PM +, Ashish Kalra wrote:
> From: Ashish Kalra 
> 
> For SEV, all DMA to and from guest has to use shared
> (un-encrypted) pages. SEV uses SWIOTLB to make this
> happen without requiring changes to device drivers.
> However, depending on workload being run, the default
> 64MB of SWIOTLB might not be enough and SWIOTLB
> may run out of buffers to use for DMA, resulting
> in I/O errors and/or performance degradation for
> high I/O workloads.
> 
> Increase the default size of SWIOTLB for SEV guests
> using a minimum value of 128MB and a maximum value



64MB for a 1GB VM is not enough?

> of 512MB, determining on amount of provisioned guest

I like the implementation on how this is done.. but
the choices of memory and how much seems very much
random. Could there be some math behind this?

> memory.
> 
> Using late_initcall() interface to invoke
> swiotlb_adjust() does not work as the size
> adjustment needs to be done before mem_encrypt_init()
> and reserve_crashkernel() which use the allocated
> SWIOTLB buffer size, hence calling it explicitly
> from setup_arch().
> 
> The SWIOTLB default size adjustment is added as an
> architecture specific interface/callback to allow
> architectures such as those supporting memory
> encryption to adjust/expand SWIOTLB size for their
> use.
> 
> Signed-off-by: Ashish Kalra 
> ---
>  arch/x86/kernel/setup.c   |  2 ++
>  arch/x86/mm/mem_encrypt.c | 42 +++
>  include/linux/swiotlb.h   |  1 +
>  kernel/dma/swiotlb.c  | 27 +
>  4 files changed, 72 insertions(+)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3511736fbc74..b073d58dd4a3 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1166,6 +1166,8 @@ void __init setup_arch(char **cmdline_p)
>   if (boot_cpu_has(X86_FEATURE_GBPAGES))
>   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
>  
> + swiotlb_adjust();
> +
>   /*
>* Reserve memory for crash kernel after SRAT is parsed so that it
>* won't consume hotpluggable memory.
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 3f248f0d0e07..e0deb157cddd 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -489,7 +489,49 @@ static void print_mem_encrypt_feature_info(void)
>   pr_cont("\n");
>  }
>  
> +#define TOTAL_MEM_1G 0x4000UL
> +#define TOTAL_MEM_4G 0x1UL
> +
> +#define SIZE_128M (128UL<<20)
> +#define SIZE_256M (256UL<<20)
> +#define SIZE_512M (512UL<<20)
> +
>  /* Architecture __weak replacement functions */
> +unsigned long __init arch_swiotlb_adjust(unsigned long iotlb_default_size)
> +{
> + unsigned long size = 0;
> +
> + /*
> +  * For SEV, all DMA has to occur via shared/unencrypted pages.
> +  * SEV uses SWOTLB to make this happen without changing device
> +  * drivers. However, depending on the workload being run, the
> +  * default 64MB of SWIOTLB may not be enough & SWIOTLB may
> +  * run out of buffers for DMA, resulting in I/O errors and/or
> +  * performance degradation especially with high I/O workloads.
> +  * Increase the default size of SWIOTLB for SEV guests using
> +  * a minimum value of 128MB and a maximum value of 512MB,
> +  * depending on amount of provisioned guest memory.
> +  */
> + if (sev_active()) {
> + phys_addr_t total_mem = memblock_phys_mem_size();
> +
> + if (total_mem <= TOTAL_MEM_1G)
> + size = clamp(iotlb_default_size * 2, SIZE_128M,
> +  SIZE_128M);
> + else if (total_mem <= TOTAL_MEM_4G)
> + size = clamp(iotlb_default_size * 4, SIZE_256M,
> +  SIZE_256M);
> + else
> + size = clamp(iotlb_default_size * 8, SIZE_512M,
> +  SIZE_512M);
> +
> + pr_info("SEV adjusted max SWIOTLB size = %luMB",
> + size >> 20);
> + }
> +
> + return size;
> +}
> +
>  void __init mem_encrypt_init(void)
>  {
>   if (!sme_me_mask)
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 046bb94bd4d6..01ae6d891327 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -33,6 +33,7 @@ extern void swiotlb_init(int verbose);
>  int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
>  extern unsigned long swiotlb_nr_tbl(void);
>  unsigned long swiotlb_size_or_default(void);
> +extern void __init swiotlb_adjust(void);
>  extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
>  extern void __init swiotlb_update_mem_attributes(void);
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c19379fabd20..66a9e627bb51 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -163,6 +163,33 @@ unsigned long

Re: [PATCH] fix swiotlb panic on Xen

2020-10-27 Thread Konrad Rzeszutek Wilk

> As the person who first found this and then confirmed this fixes a bug:
> 
> Tested-by: Elliott Mitchell 

Thank you!!

I changed the title and added the various tags and will put it in
linux-next later this week.

>From a1eb2768bf5954d25aa0f0136b38f0aa5d92d984 Mon Sep 17 00:00:00 2001
From: Stefano Stabellini 
Date: Mon, 26 Oct 2020 17:02:14 -0700
Subject: [PATCH] swiotlb: fix "x86: Don't panic if can not alloc buffer for
 swiotlb"

kernel/dma/swiotlb.c:swiotlb_init gets called first and tries to
allocate a buffer for the swiotlb. It does so by calling

  memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);

If the allocation must fail, no_iotlb_memory is set.

Later during initialization swiotlb-xen comes in
(drivers/xen/swiotlb-xen.c:xen_swiotlb_init) and given that io_tlb_start
is != 0, it thinks the memory is ready to use when actually it is not.

When the swiotlb is actually needed, swiotlb_tbl_map_single gets called
and since no_iotlb_memory is set the kernel panics.

Instead, if swiotlb-xen.c:xen_swiotlb_init knew the swiotlb hadn't been
initialized, it would do the initialization itself, which might still
succeed.

Fix the panic by setting io_tlb_start to 0 on swiotlb initialization
failure, and also by setting no_iotlb_memory to false on swiotlb
initialization success.

Fixes: ac2cbab21f31 ("x86: Don't panic if can not alloc buffer for swiotlb")

Reported-by: Elliott Mitchell 
Tested-by: Elliott Mitchell 
Signed-off-by: Stefano Stabellini 
Reviewed-by: Christoph Hellwig 
CC: sta...@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk 
---
 kernel/dma/swiotlb.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 465a567678d9..e08cac39c0ba 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -229,6 +229,7 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
}
io_tlb_index = 0;
+   no_iotlb_memory = false;
 
if (verbose)
swiotlb_print_info();
@@ -260,9 +261,11 @@ swiotlb_init(int verbose)
if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose))
return;
 
-   if (io_tlb_start)
+   if (io_tlb_start) {
memblock_free_early(io_tlb_start,
PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+   io_tlb_start = 0;
+   }
pr_warn("Cannot allocate buffer");
no_iotlb_memory = true;
 }
@@ -360,6 +363,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
}
io_tlb_index = 0;
+   no_iotlb_memory = false;
 
swiotlb_print_info();
 
-- 
2.13.6

Re: [PATCH] fix swiotlb panic on Xen

2020-10-27 Thread Konrad Rzeszutek Wilk

On Mon, Oct 26, 2020 at 05:02:14PM -0700, Stefano Stabellini wrote:
> From: Stefano Stabellini 
> 
> kernel/dma/swiotlb.c:swiotlb_init gets called first and tries to
> allocate a buffer for the swiotlb. It does so by calling
> 
>   memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);
> 
> If the allocation must fail, no_iotlb_memory is set.
> 
> 
> Later during initialization swiotlb-xen comes in
> (drivers/xen/swiotlb-xen.c:xen_swiotlb_init) and given that io_tlb_start
> is != 0, it thinks the memory is ready to use when actually it is not.
> 
> When the swiotlb is actually needed, swiotlb_tbl_map_single gets called
> and since no_iotlb_memory is set the kernel panics.
> 
> Instead, if swiotlb-xen.c:xen_swiotlb_init knew the swiotlb hadn't been
> initialized, it would do the initialization itself, which might still
> succeed.
> 
> 
> Fix the panic by setting io_tlb_start to 0 on swiotlb initialization
> failure, and also by setting no_iotlb_memory to false on swiotlb
> initialization success.

Should this have a Fixes: flag?

> 
> Signed-off-by: Stefano Stabellini 
> 
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c19379fabd20..9924214df60a 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -231,6 +231,7 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
> nslabs, int verbose)
>   io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
>   }
>   io_tlb_index = 0;
> + no_iotlb_memory = false;
>  
>   if (verbose)
>   swiotlb_print_info();
> @@ -262,9 +263,11 @@ swiotlb_init(int verbose)
>   if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose))
>   return;
>  
> - if (io_tlb_start)
> + if (io_tlb_start) {
>   memblock_free_early(io_tlb_start,
>   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
> + io_tlb_start = 0;
> + }
>   pr_warn("Cannot allocate buffer");
>   no_iotlb_memory = true;
>  }
> @@ -362,6 +365,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
> nslabs)
>   io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
>   }
>   io_tlb_index = 0;
> + no_iotlb_memory = false;
>  
>   swiotlb_print_info();
>

[GIT PULL] (swiotlb) stable/for-linus-5.10

2020-10-13 Thread Konrad Rzeszutek Wilk

Hey Linus,

Please git pull the following branch:

 git pull git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git 
stable/for-linus-5.10 

which has a minor enhancement of using %p to print phys_addr_r and also compiler
warnings.

Thank you!

 arch/x86/pci/sta2x11-fixup.c | 1 -
 include/linux/swiotlb.h  | 1 +
 kernel/dma/swiotlb.c | 6 ++
 3 files changed, 3 insertions(+), 5 deletions(-)

Andy Shevchenko (3):
  swiotlb: Use %pa to print phys_addr_t variables
  swiotlb: Declare swiotlb_late_init_with_default_size() in header
  swiotlb: Mark max_segment with static keyword



signature.asc
Description: PGP signature

Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-25 Thread Konrad Rzeszutek Wilk

On Fri, Sep 25, 2020 at 11:05:58AM +0800, Dave Young wrote:
> Hi,
> 
> On 09/24/20 at 01:16pm, boris.ostrov...@oracle.com wrote:
> > 
> > On 9/24/20 12:43 PM, Michael Kelley wrote:
> > > From: Eric W. Biederman  Sent: Thursday, September 
> > > 24, 2020 9:26 AM
> > >> Michael Kelley  writes:
> > >>
> > > Added Hyper-V people and people who created the param, it is below
> > > commit, I also want to remove it if possible, let's see how people
> > > think, but the least way should be to disable the auto setting in 
> > > both systemd
> > > and kernel:
> > >>> Hyper-V uses a notifier to inform the host system that a Linux VM has
> > >>> panic'ed.  Informing the host is particularly important in a public 
> > >>> cloud
> > >>> such as Azure so that the cloud software can alert the customer, and can
> > >>> track cloud-wide reliability statistics.   Whether a kdump is taken is 
> > >>> controlled
> > >>> entirely by the customer and how he configures the VM, and we want
> > >>> the host to be informed either way.
> > >> Why?
> > >>
> > >> Why does the host care?
> > >> Especially if the VM continues executing into a kdump kernel?
> > > The host itself doesn't care.  But the host is a convenient out-of-band
> > > channel for recording that a panic has occurred and to collect basic data
> > > about the panic.  This out-of-band channel is then used to notify the end
> > > customer that his VM has panic'ed.  Sure, the customer should be running
> > > his own monitoring software, but customers don't always do what they
> > > should.  Equally important, the out-of-band channel allows the cloud
> > > infrastructure software to notice trends, such as that the rate of Linux
> > > panics has increased, and that perhaps there is a cloud problem that
> > > should be investigated.
> > 
> > 
> > In many cases (especially in cloud environment) your dump device is remote 
> > (e.g. iscsi) and kdump sometimes (often?) gets stuck because of 
> > connectivity issues (which could be cause of the panic in the first place). 
> > So it is quite desirable to inform the infrastructure that the VM is on its 
> > way out without waiting for kdump to complete.
> 
> That can probably be done in kdump kernel if it is really needed.  Say
> informing host that panic happened and a kdump kernel is runnning.

If kdump kernel gets to that point. Sometimes (sadly) it ends up being
misconfigured and it chokes up - and hence having multiple ways to emit
the crash information before running kdump kernel is a life-saver.

> 
> But I think to set crash_kexec_post_notifiers by default is still bad. 

Because of the way it is run today I presume? If there was some
safe/unsafe policy that should work right? I would think that the
safe ones that work properly all the time are:

 - HyperV CRASH_MSRs,
 - KVM PVPANIC_[PANIC,CRASHLOAD] push button knob,
 - pstore EFI variables
 - Dumping in memory,

And then some that depend on firmware version (aka BIOS, and vendor) are:
 - ACPI ERST,

And then the unsafe:
 - s390, PowerPC (I don't actually know what they are but that
was Dave's primary motivator).

> 
> > 
> > 
> > >
> > >> Further like I have mentioned everytime something like this has come up
> > >> a call on the kexec on panic code path should be a direct call (That can
> > >> be audited) not something hidden in a notifier call chain (which can 
> > >> not).
> > >>
> > 
> > We btw already have a direct call from panic() to kmsg_dump() which is 
> > indirectly controlled by crash_kexec_post_notifiers, and it would also be 
> > preferable to be able to call it before kdump as well.
> 
> Right, that is the same thing we are talking about.
> 
> Thanks
> Dave
>

Re: [PATCH] xen-blkback: add a parameter for disabling of persistent grants

2020-09-24 Thread Konrad Rzeszutek Wilk

.snip..
> > For the reason, I'd like to suggest to keep this as is for now and expand it
> > with the 'exceptions list' idea or something better, if a real use case 
> > comes
> > out later.
> 
> I agree. I'm happy to take patches to implement more fine grained
> control, but that shouldn't prevent us from having a global policy if
> that's useful to users.

Re: [PATCH] xen-blkback: add a parameter for disabling of persistent grants

2020-09-23 Thread Konrad Rzeszutek Wilk

On Tue, Sep 22, 2020 at 09:01:25AM +0200, SeongJae Park wrote:
> From: SeongJae Park 
> 
> Persistent grants feature provides high scalability.  On some small
> systems, however, it could incur data copy overhead[1] and thus it is
> required to be disabled.  But, there is no option to disable it.  For
> the reason, this commit adds a module parameter for disabling of the
> feature.

Would it be better suited to have it per guest?
> 
> [1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
> 
> Signed-off-by: Anthony Liguori 
> Signed-off-by: SeongJae Park 
> ---
>  .../ABI/testing/sysfs-driver-xen-blkback|  8 
>  drivers/block/xen-blkback/xenbus.c  | 17 ++---
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> index ecb7942ff146..0c42285c75ee 100644
> --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> @@ -35,3 +35,11 @@ Description:
>  controls the duration in milliseconds that blkback will not
>  cache any page not backed by a grant mapping.
>  The default is 10ms.
> +
> +What:   /sys/module/xen_blkback/parameters/feature_persistent
> +Date:   September 2020
> +KernelVersion:  5.10
> +Contact:SeongJae Park 
> +Description:
> +Whether to enable the persistent grants feature or not.
> +The default is 1 (enable).
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index b9aa5d1ac10b..9c03d70469f4 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -879,6 +879,12 @@ static void reclaim_memory(struct xenbus_device *dev)
>  
>  /* ** Connection ** */
>  
> +/* Enable the persistent grants feature. */
> +static unsigned int feature_persistent = 1;
> +module_param_named(feature_persistent, feature_persistent, int, 0644);
> +MODULE_PARM_DESC(feature_persistent,
> + "Enables the persistent grants feature");
> +
>  /*
>   * Write the physical details regarding the block device to the store, and
>   * switch to Connected state.
> @@ -906,7 +912,8 @@ static void connect(struct backend_info *be)
>  
>   xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);
>  
> - err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
> + err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u",
> + feature_persistent ? 1 : 0);
>   if (err) {
>   xenbus_dev_fatal(dev, err, "writing %s/feature-persistent",
>dev->nodename);
> @@ -1093,8 +1100,12 @@ static int connect_ring(struct backend_info *be)
>   xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
>   return -ENOSYS;
>   }
> - pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
> -0);
> + if (feature_persistent)
> + pers_grants = xenbus_read_unsigned(dev->otherend,
> + "feature-persistent", 0);
> + else
> + pers_grants = 0;
> +
>   blkif->vbd.feature_gnt_persistent = pers_grants;
>   blkif->vbd.overflow_max_grants = 0;
>  
> -- 
> 2.17.1
>

Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-23 Thread Konrad Rzeszutek Wilk

On Wed, Sep 23, 2020 at 10:43:29AM +0800, Dave Young wrote:
> + more people who may care about this param 

Paarty time!!

(See below, didn't snip any comments)
> On 09/21/20 at 08:45pm, Eric W. Biederman wrote:
> > Konrad Rzeszutek Wilk  writes:
> > 
> > > On Fri, Sep 18, 2020 at 05:47:43PM -0700, Andrew Morton wrote:
> > >> On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young  wrote:
> > >> 
> > >> > crash_kexec_post_notifiers enables running various panic notifier
> > >> > before kdump kernel booting. This increases risks of kdump failure.
> > >> > It is well documented in kernel-parameters.txt. We do not suggest
> > >> > people to enable it together with kdump unless he/she is really sure.
> > >> > This is also not suggested to be enabled by default when users are
> > >> > not aware in distributions.
> > >> > 
> > >> > But unfortunately it is enabled by default in systemd, see below
> > >> > discussions in a systemd report, we can not convince systemd to change
> > >> > it:
> > >> > https://github.com/systemd/systemd/issues/16661
> > >> > 
> > >> > Actually we have got reports about kdump kernel hangs in both s390x
> > >> > and powerpcle cases caused by the systemd change,  also some x86 cases
> > >> > could also be caused by the same (although that is in Hyper-V code
> > >> > instead of systemd, that need to be addressed separately).
> > >
> > > Perhaps it may be better to fix the issus on s390x and PowerPC as well?
> > >
> > >> > 
> > >> > Thus to avoid the auto enablement here just disable the param writable
> > >> > permission in sysfs.
> > >> > 
> > >> 
> > >> Well.  I don't think this is at all a desirable way of resolving a
> > >> disagreement with the systemd developers
> > >> 
> > >> At the above github address I'm seeing "ryncsn added a commit to
> > >> ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't
> > >> enable crash_kexec_post_notifiers by default".  So didn't that address
> > >> the issue?
> > >
> > > It does in systemd, but there is a strong interest in making this on
> > > by default.
> > 
> > There is also a strong interest in removing this code entirely from the
> > kernel.
> 
> Added Hyper-V people and people who created the param, it is below
> commit, I also want to remove it if possible, let's see how people
> think, but the least way should be to disable the auto setting in both systemd
> and kernel:
> 
> commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45
> Author: Masami Hiramatsu 
> Date:   Fri Jun 6 14:37:07 2014 -0700
> 
> kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump 
> after panic_notifers
> 
> Add a "crash_kexec_post_notifiers" boot option to run kdump after
> running panic_notifiers and dump kmsg.  This can help rare situations
> where kdump fails because of unstable crashed kernel or hardware 
> failure
> (memory corruption on critical data/code), or the 2nd kernel is 
> already
> broken by the 1st kernel (it's a broken behavior, but who can 
> guarantee
> that the "crashed" kernel works correctly?).
> 
> Usage: add "crash_kexec_post_notifiers" to kernel boot option.
> 
> Note that this actually increases risks of the failure of kdump.  This
> option should be set only if you worry about the rare case of kdump
> failure rather than increasing the chance of success.


If this is such risky knob that leads to bugs where folks are backing away
from with disgust in their faces - then perhaps the only way to go about
this is - limit the exposure to known working situations on firmware
that we can control?

That is enable only a subset of post notifiers which determine if they
are OK running if the conditions are blessed?

I think that would satisfy the conditions where you have to to deal with 
unsavory
bugs that end up on your plate - and aren't fun because there is no
way to fixing it -  but at the same time allowing multiple ways to save the 
crash?

Please don't take away something that is quite useful in the field. Can we
hammer out something that will remove your pain points?
> 
> > 
> > This failure is a case in point.
> > 
> > I think I am at my I told you so point.  This is what all of the testing
> > over all the years has said.  Leaving functionality to the peculiarities
> > of firmware when you don't have to, and can actually control what is
> > going on doesn't work.
> > 
> > Eric
> > 
> > 
> 
> Thanks
> Dave
>

Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-21 Thread Konrad Rzeszutek Wilk

On Fri, Sep 18, 2020 at 05:47:43PM -0700, Andrew Morton wrote:
> On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young  wrote:
> 
> > crash_kexec_post_notifiers enables running various panic notifier
> > before kdump kernel booting. This increases risks of kdump failure.
> > It is well documented in kernel-parameters.txt. We do not suggest
> > people to enable it together with kdump unless he/she is really sure.
> > This is also not suggested to be enabled by default when users are
> > not aware in distributions.
> > 
> > But unfortunately it is enabled by default in systemd, see below
> > discussions in a systemd report, we can not convince systemd to change
> > it:
> > https://github.com/systemd/systemd/issues/16661
> > 
> > Actually we have got reports about kdump kernel hangs in both s390x
> > and powerpcle cases caused by the systemd change,  also some x86 cases
> > could also be caused by the same (although that is in Hyper-V code
> > instead of systemd, that need to be addressed separately).

Perhaps it may be better to fix the issus on s390x and PowerPC as well?

> > 
> > Thus to avoid the auto enablement here just disable the param writable
> > permission in sysfs.
> > 
> 
> Well.  I don't think this is at all a desirable way of resolving a
> disagreement with the systemd developers
> 
> At the above github address I'm seeing "ryncsn added a commit to
> ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't
> enable crash_kexec_post_notifiers by default".  So didn't that address
> the issue?

It does in systemd, but there is a strong interest in making this on by default.

Re: [PATCH V2] dma-direct: Fix potential NULL pointer dereference

2020-09-17 Thread Konrad Rzeszutek Wilk

On Wed, Sep 16, 2020 at 02:51:06PM -0600, Thomas Tai wrote:
> When booting the kernel v5.9-rc4 on a VM, the kernel would panic when
> printing a warning message in swiotlb_map(). The dev->dma_mask must not
> be a NULL pointer when calling the dma mapping layer. A NULL pointer check
> can potentially avoid the panic.
> 
> [drm] Initialized virtio_gpu 0.1.0 0 for virtio0 on minor 0
>  BUG: kernel NULL pointer dereference, address: 
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x) - not-present page
>  PGD 0 P4D 0
>  Oops:  [#1] SMP PTI
>  CPU: 1 PID: 331 Comm: systemd-udevd Not tainted 5.9.0-rc4 #1
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>  BIOS 1.13.0-1ubuntu1 04/01/2014
>  RIP: 0010:swiotlb_map+0x1ac/0x200
>  Code: e8 d9 fc ff ff 80 3d 92 ee 4c 01 00 75 51 49 8b 84 24 48 02 00 00
>  4d 8b 6c 24 50 c6 05 7c ee 4c 01 01 4d 8b bc 24 58 02 00 00 <4c> 8b 30
>  4d 85 ed 75 04 4d 8b 2c 24 4c 89 e7 e8 10 6b 4f 00 4d 89
>  RSP: 0018:9f96801af6f8 EFLAGS: 00010246
>  RAX:  RBX: 1000 RCX: 0080
>  RDX: 007f RSI: 0202 RDI: 0202
>  RBP: 9f96801af748 R08:  R09: 0020
>  R10:  R11: 8fabfffa3000 R12: 8faad02c7810
>  R13:  R14: 0020 R15: 
>  FS:  7fabc63588c0() GS:8fabf7c8()
>  knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2:  CR3: 000151496005 CR4: 00370ee0
>  DR0:  DR1:  DR2: 
>  DR3:  DR6: fffe0ff0 DR7: 0400
>  Call Trace:
>   dma_direct_map_sg+0x124/0x210
>   dma_map_sg_attrs+0x32/0x50
>   drm_gem_shmem_get_pages_sgt+0x6a/0x90 [drm]
>   virtio_gpu_object_create+0x140/0x2f0 [virtio_gpu]
>   ? ww_mutex_unlock+0x26/0x30
>   virtio_gpu_mode_dumb_create+0xab/0x160 [virtio_gpu]
>   drm_mode_create_dumb+0x82/0x90 [drm]
>   drm_client_framebuffer_create+0xaa/0x200 [drm]
>   drm_fb_helper_generic_probe+0x59/0x150 [drm_kms_helper]
>   drm_fb_helper_single_fb_probe+0x29e/0x3e0 [drm_kms_helper]
>   __drm_fb_helper_initial_config_and_unlock+0x41/0xd0 [drm_kms_helper]
>   drm_fbdev_client_hotplug+0xe6/0x1a0 [drm_kms_helper]
>   drm_fbdev_generic_setup+0xaf/0x170 [drm_kms_helper]
>   virtio_gpu_probe+0xea/0x100 [virtio_gpu]
>   virtio_dev_probe+0x14b/0x1e0 [virtio]
>   really_probe+0x1db/0x440
>   driver_probe_device+0xe9/0x160
>   device_driver_attach+0x5d/0x70
>   __driver_attach+0x8f/0x150
>   ? device_driver_attach+0x70/0x70
>   bus_for_each_dev+0x7e/0xc0
>   driver_attach+0x1e/0x20
>   bus_add_driver+0x152/0x1f0
>   driver_register+0x74/0xd0
>   ? 0xc0529000
>   register_virtio_driver+0x20/0x30 [virtio]
>   virtio_gpu_driver_init+0x15/0x1000 [virtio_gpu]
>   do_one_initcall+0x4a/0x1fa
>   ? _cond_resched+0x19/0x30
>   ? kmem_cache_alloc_trace+0x16b/0x2e0
>   do_init_module+0x62/0x240
>   load_module+0xe0e/0x1100
>   ? security_kernel_post_read_file+0x5c/0x70
>   __do_sys_finit_module+0xbe/0x120
>   ? __do_sys_finit_module+0xbe/0x120
>   __x64_sys_finit_module+0x1a/0x20
>   do_syscall_64+0x38/0x50
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Signed-off-by: Thomas Tai 

Reviewed-by: Konrad Rzeszutek Wilk 

Thank you!
> ---
>  include/linux/dma-direct.h |  3 ---
>  kernel/dma/mapping.c   | 11 +++
>  2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index 6e87225..0648708 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -62,9 +62,6 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size,
>  {
>   dma_addr_t end = addr + size - 1;
>  
> - if (!dev->dma_mask)
> - return false;
> -
>   if (is_ram && !IS_ENABLED(CONFIG_ARCH_DMA_ADDR_T_64BIT) &&
>   min(addr, end) < phys_to_dma(dev, PFN_PHYS(min_low_pfn)))
>   return false;
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index 0d12942..7133d5c 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -144,6 +144,10 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct 
> page *page,
>   dma_addr_t addr;
>  
>   BUG_ON(!valid_dma_direction(dir));
> +
> + if (WARN_ON_ONCE(!dev->dma_mask))
> + return DMA_MAPPING_ERROR;
> +
>   if (dma_map_direct(dev, ops))
>   addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
>   else
> @@ -179,6 +183,10 @@ i

Is: virtio_gpu_object_shmem_init issues? Was:Re: upstream boot error: general protection fault in swiotlb_map

2020-08-24 Thread Konrad Rzeszutek Wilk

On Thu, Aug 06, 2020 at 03:46:23AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:47ec5303 Merge git://git.kernel.org/pub/scm/linux/kernel/g..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=16fe1dea90
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7c06047f622c5724
> dashboard link: https://syzkaller.appspot.com/bug?extid=3f86afd0b1e4bf1cb64c
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+3f86afd0b1e4bf1cb...@syzkaller.appspotmail.com
> 
> ceph: loaded (mds proto 32)
> NET: Registered protocol family 38
> async_tx: api initialized (async)
> Key type asymmetric registered
> Asymmetric key parser 'x509' registered
> Asymmetric key parser 'pkcs8' registered
> Key type pkcs7_test registered
> Asymmetric key parser 'tpm_parser' registered
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 243)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> io scheduler bfq registered
> hgafb: HGA card not detected.
> hgafb: probe of hgafb.0 failed with error -22
> usbcore: registered new interface driver udlfb
> uvesafb: failed to execute /sbin/v86d
> uvesafb: make sure that the v86d helper is installed and executable
> uvesafb: Getting VBE info block failed (eax=0x4f00, err=-2)
> uvesafb: vbe_init() failed with -22
> uvesafb: probe of uvesafb.0 failed with error -22
> vga16fb: mapped to 0x8aac772d
> Console: switching to colour frame buffer device 80x30
> fb0: VGA16 VGA frame buffer device
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> ACPI: Power Button [PWRF]
> ioatdma: Intel(R) QuickData Technology Driver 5.00
> PCI Interrupt Link [GSIF] enabled at IRQ 21
> PCI Interrupt Link [GSIG] enabled at IRQ 22
> PCI Interrupt Link [GSIH] enabled at IRQ 23
> N_HDLC line discipline registered with maxframe=4096
> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> 00:05: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> Cyclades driver 2.6
> Initializing Nozomi driver 2.1d
> RocketPort device driver module, version 2.09, 12-June-2003
> No rocketport ports found; unloading driver
> Non-volatile memory driver v1.3
> Linux agpgart interface v0.103
> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> usbcore: registered new interface driver udl
> [drm] pci: virtio-vga detected at :00:01.0
> fb0: switching to virtiodrmfb from VGA16 VGA
> Console: switching to colour VGA+ 80x25
> virtio-pci :00:01.0: vgaarb: deactivate vga console
> Console: switching to colour dummy device 80x25
> [drm] features: -virgl +edid
> [drm] number of scanouts: 1
> [drm] number of cap sets: 0
> [drm] Initialized virtio_gpu 0.1.0 0 for virtio0 on minor 2
> general protection fault, probably for non-canonical address 
> 0xdc00:  [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x-0x0007]
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.8.0-syzkaller #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> RIP: 0010:swiotlb_map+0x5ac/0x700 kernel/dma/swiotlb.c:683
> Code: 28 04 00 00 48 c1 ea 03 80 3c 02 00 0f 85 4d 01 00 00 4c 8b a5 18 04 00 
> 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f 85 1e 
> 01 00 00 48 8d 7d 50 4d 8b 24 24 48 b8 00 00
> RSP: :c934f3e0 EFLAGS: 00010246
> RAX: dc00 RBX:  RCX: 8162cc1d
> RDX:  RSI: 8162cc98 RDI: 88802971a470
> RBP: 88802971a048 R08: 0001 R09: 8c5dba77
> R10:  R11:  R12: 
> R13: 7ac0 R14: dc00 R15: 1000
> FS:  () GS:88802ce0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 09a8d000 CR4: 00350ef0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  dma_direct_map_page include/linux/dma-direct.h:170 [inline]
>  dma_direct_map_sg+0x3bb/0x670 kernel/dma/direct.c:368
>  dma_map_sg_attrs+0xd0/0x160 kernel/dma/mapping.c:183
>  drm_gem_shmem_get_pages_sgt drivers/gpu/drm/drm_gem_shmem_helper.c:700 
> [inline]
>  drm_gem_shmem_get_pages_sgt+0x1fc/0x310 
> drivers/gpu/drm/drm_gem_shmem_helper.c:679
>  virtio_gpu_object_shmem_init drivers/gpu/drm/virtio/virtgpu_object.c:153 
> [inline]
>  virtio_gpu_object_create+0x2fd/0xa70 
> drivers/gpu/drm/virtio/virtgpu_object.c:232
>  virtio_gpu_gem_create drivers/gpu/drm/virtio/virtgpu_gem.c:45 [inline]
>  virtio_gpu_mode_dumb_create+0x298/0x530 
> drivers/gpu/drm/virtio/virtgpu_gem.c:85
>

Re: [PATCH] x86/pci: fix xen.c build error when CONFIG_ACPI is not set

2020-08-20 Thread Konrad Rzeszutek Wilk

On Wed, Aug 19, 2020 at 08:09:11PM -0700, Randy Dunlap wrote:
> Hi Konrad,

Hey Randy,

I believe Juergen is picking this up.
> 
> ping.
> 
> I am still seeing this build error. It looks like this is
> in your territory to merge...
> 
> 
> On 8/13/20 4:00 PM, Randy Dunlap wrote:
> > From: Randy Dunlap 
> > 
> > Fix build error when CONFIG_ACPI is not set/enabled:
> > 
> > ../arch/x86/pci/xen.c: In function ‘pci_xen_init’:
> > ../arch/x86/pci/xen.c:410:2: error: implicit declaration of function 
> > ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? 
> > [-Werror=implicit-function-declaration]
> >   acpi_noirq_set();
> > 
> > Fixes: 88e9ca161c13 ("xen/pci: Use acpi_noirq_set() helper to avoid #ifdef")
> > Signed-off-by: Randy Dunlap 
> > Cc: Andy Shevchenko 
> > Cc: Bjorn Helgaas 
> > Cc: Konrad Rzeszutek Wilk 
> > Cc: xen-de...@lists.xenproject.org
> > Cc: linux-...@vger.kernel.org
> > ---
> >  arch/x86/pci/xen.c |1 +
> >  1 file changed, 1 insertion(+)
> > 
> > --- linux-next-20200813.orig/arch/x86/pci/xen.c
> > +++ linux-next-20200813/arch/x86/pci/xen.c
> > @@ -26,6 +26,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  
> >  static int xen_pcifront_enable_irq(struct pci_dev *dev)
> > 
> 
> 
> thanks.
> -- 
> ~Randy
>

Re: [PATCH v3] powerpc/pseries/svm: Allocate SWIOTLB buffer anywhere in memory

2020-08-19 Thread Konrad Rzeszutek Wilk

On Tue, Aug 18, 2020 at 07:11:26PM -0300, Thiago Jung Bauermann wrote:
> POWER secure guests (i.e., guests which use the Protection Execution
> Facility) need to use SWIOTLB to be able to do I/O with the hypervisor, but
> they don't need the SWIOTLB memory to be in low addresses since the
> hypervisor doesn't have any addressing limitation.
> 
> This solves a SWIOTLB initialization problem we are seeing in secure guests
> with 128 GB of RAM: they are configured with 4 GB of crashkernel reserved
> memory, which leaves no space for SWIOTLB in low addresses.
> 
> To do this, we use mostly the same code as swiotlb_init(), but allocate the
> buffer using memblock_alloc() instead of memblock_alloc_low().
> 
> Signed-off-by: Thiago Jung Bauermann 

Reviewed-by: Konrad Rzeszutek Wilk 


> ---
>  arch/powerpc/include/asm/svm.h   |  4 
>  arch/powerpc/mm/mem.c|  6 +-
>  arch/powerpc/platforms/pseries/svm.c | 26 ++
>  3 files changed, 35 insertions(+), 1 deletion(-)
> 
> Changes from v2:
> - Panic if unable to allocate buffer, as suggested by Christoph.
> 
> Changes from v1:
> - Open-code swiotlb_init() in arch-specific code, as suggested by
>   Christoph.
> 
> diff --git a/arch/powerpc/include/asm/svm.h b/arch/powerpc/include/asm/svm.h
> index 85580b30aba4..7546402d796a 100644
> --- a/arch/powerpc/include/asm/svm.h
> +++ b/arch/powerpc/include/asm/svm.h
> @@ -15,6 +15,8 @@ static inline bool is_secure_guest(void)
>   return mfmsr() & MSR_S;
>  }
>  
> +void __init svm_swiotlb_init(void);
> +
>  void dtl_cache_ctor(void *addr);
>  #define get_dtl_cache_ctor() (is_secure_guest() ? dtl_cache_ctor : NULL)
>  
> @@ -25,6 +27,8 @@ static inline bool is_secure_guest(void)
>   return false;
>  }
>  
> +static inline void svm_swiotlb_init(void) {}
> +
>  #define get_dtl_cache_ctor() NULL
>  
>  #endif /* CONFIG_PPC_SVM */
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index c2c11eb8dcfc..0f21bcb16405 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -50,6 +50,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -290,7 +291,10 @@ void __init mem_init(void)
>* back to to-down.
>*/
>   memblock_set_bottom_up(true);
> - swiotlb_init(0);
> + if (is_secure_guest())
> + svm_swiotlb_init();
> + else
> + swiotlb_init(0);
>  #endif
>  
>   high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
> diff --git a/arch/powerpc/platforms/pseries/svm.c 
> b/arch/powerpc/platforms/pseries/svm.c
> index 40c0637203d5..81085eb8f225 100644
> --- a/arch/powerpc/platforms/pseries/svm.c
> +++ b/arch/powerpc/platforms/pseries/svm.c
> @@ -7,6 +7,7 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -34,6 +35,31 @@ static int __init init_svm(void)
>  }
>  machine_early_initcall(pseries, init_svm);
>  
> +/*
> + * Initialize SWIOTLB. Essentially the same as swiotlb_init(), except that it
> + * can allocate the buffer anywhere in memory. Since the hypervisor doesn't 
> have
> + * any addressing limitation, we don't need to allocate it in low addresses.
> + */
> +void __init svm_swiotlb_init(void)
> +{
> + unsigned char *vstart;
> + unsigned long bytes, io_tlb_nslabs;
> +
> + io_tlb_nslabs = (swiotlb_size_or_default() >> IO_TLB_SHIFT);
> + io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
> +
> + bytes = io_tlb_nslabs << IO_TLB_SHIFT;
> +
> + vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
> + if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
> + return;
> +
> + if (io_tlb_start)
> + memblock_free_early(io_tlb_start,
> + PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
> + panic("SVM: Cannot allocate SWIOTLB buffer");
> +}
> +
>  int set_memory_encrypted(unsigned long addr, int numpages)
>  {
>   if (!PAGE_ALIGNED(addr))

Re: [PATCH RFC 0/3] io_uring: add restrictions to support untrusted applications and guests

2020-07-10 Thread Konrad Rzeszutek Wilk

.snip..
> Just to recap the proposal, the idea is to add some restrictions to the
> operations (sqe, register, fixed file) to safely allow untrusted applications
> or guests to use io_uring queues.

Hi!

This is neat and quite cool - but one thing that keeps nagging me is
what how much overhead does this cut from the existing setup when you use
virtio (with guests obviously)? That is from a high level view the
beaty of io_uring being passed in the guest is you don't have the
virtio ring -> io_uring processing, right?

Thanks!

Re: [PATCH v2] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-06-24 Thread Konrad Rzeszutek Wilk

.snip..
> > Actually as per the boot flow :
> > 
> > setup_arch() calls reserve_crashkernel() and pci_iommu_alloc() is
> > invoked through mm_init()/mem_init() and not via initmem_init().
> > 
> > start_kernel:
> > ...
> > setup_arch()
> > reserve_crashkernel
> > reserve_crashkernel_low
> > -> swiotlb_size_or_default
> > 
> > ...
> > ...
> > mm_init()
> > mem_init()
> > pci_iommu_alloc
> > -> pci_swiotlb_detect_4gb
> > -> swiotlb_init
> > 
> > So as per the above boot flow, reserve_crashkernel() can get called
> > before swiotlb_detect/init, and hence, if we don't fixup or adjust
> > the SWIOTLB buffer size in swiotlb_size_or_default() then crash kernel
> > will reserve memory which will conflict/overlap with any SWIOTLB bounce
> > buffer allocated memory (adjusted or fixed up later).
> > 
> > Therefore, we need to adjust/fixup SWIOTLB bounce buffer memory in
> > swiotlb_size_or_default() function itself, before swiotlb detect/init
> > funtions get invoked.
> > 
> 
> Also to add here, it looks like swiotlb_size_or_default() is an
> interface function to get the SWIOTLB bounce buffer size for components
> which are initialized before swiotlb_detect/init, so that these 
> components can reserve or allocate their memory requirements with the
> knowledge of how much SWIOTLB bounce buffers are going to use, so
> therefore, any fixups or adjustments to SWIOTLB buffer size will need
> to be made as part of swiotlb_size_or_default(). 

That was never its purpose. It was meant as way to figure out the segment
size for DMA requests and to be used for runtime components. In fact to
be idempotent.

This is why I am disliking this usage and leaning towards something else.

But you pointed out something interesting - you are in fact needing to
adjust the size of the swiotlb based on your needs at bootup. Not any different
from say 'swiotlb' paramter.

Why not have an 'swiotlb_adjust' that is an __init that modifies the
internal swiotlb buffer sizes? Obviously we have to account for 'swiotlb'
parsing as well. The swiotlb_adjust would pick the max from those.

Re: [PATCH] ibft: Replace zero-length array with flexible-array

2020-06-23 Thread Konrad Rzeszutek Wilk

On Thu, May 21, 2020 at 01:32:35PM -0500, Gustavo A. R. Silva wrote:
> Hi all,
> 
> Friendly ping: who can take this?

It is in the tree. Let me send out a git pull to Linus in a week or two.

Thanks!
> 
> Thanks
> --
> Gustavo
> 
> On Thu, May 07, 2020 at 01:55:44PM -0500, Gustavo A. R. Silva wrote:
> > The current codebase makes use of the zero-length array language
> > extension to the C90 standard, but the preferred mechanism to declare
> > variable-length types such as these ones is a flexible array member[1][2],
> > introduced in C99:
> > 
> > struct foo {
> > int stuff;
> > struct boo array[];
> > };
> > 
> > By making use of the mechanism above, we will get a compiler warning
> > in case the flexible array does not occur last in the structure, which
> > will help us prevent some kind of undefined behavior bugs from being
> > inadvertently introduced[3] to the codebase from now on.
> > 
> > Also, notice that, dynamic memory allocations won't be affected by
> > this change:
> > 
> > "Flexible array members have incomplete type, and so the sizeof operator
> > may not be applied. As a quirk of the original implementation of
> > zero-length arrays, sizeof evaluates to zero."[1]
> > 
> > sizeof(flexible-array-member) triggers a warning because flexible array
> > members have incomplete type[1]. There are some instances of code in
> > which the sizeof operator is being incorrectly/erroneously applied to
> > zero-length arrays and the result is zero. Such instances may be hiding
> > some bugs. So, this work (flexible-array member conversions) will also
> > help to get completely rid of those sorts of issues.
> > 
> > This issue was found with the help of Coccinelle.
> > 
> > [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> > [2] https://github.com/KSPP/linux/issues/21
> > [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
> > 
> > Signed-off-by: Gustavo A. R. Silva 
> > ---
> >  drivers/firmware/iscsi_ibft.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/firmware/iscsi_ibft.c b/drivers/firmware/iscsi_ibft.c
> > index 96758b71a8db..7127a04bca19 100644
> > --- a/drivers/firmware/iscsi_ibft.c
> > +++ b/drivers/firmware/iscsi_ibft.c
> > @@ -104,7 +104,7 @@ struct ibft_control {
> > u16 tgt0_off;
> > u16 nic1_off;
> > u16 tgt1_off;
> > -   u16 expansion[0];
> > +   u16 expansion[];
> >  } __attribute__((__packed__));
> >  
> >  struct ibft_initiator {
> >

Re: [PATCH v2] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-06-23 Thread Konrad Rzeszutek Wilk

On Mon, Apr 27, 2020 at 06:53:18PM +, Ashish Kalra wrote:
> Hello Konrad,
> 
> On Mon, Mar 30, 2020 at 10:25:51PM +, Ashish Kalra wrote:
> > Hello Konrad,
> > 
> > On Tue, Mar 03, 2020 at 12:03:53PM -0500, Konrad Rzeszutek Wilk wrote:
> > > On Tue, Feb 04, 2020 at 07:35:00PM +, Ashish Kalra wrote:
> > > > Hello Konrad,
> > > > 
> > > > Looking fwd. to your feedback regarding support of other memory
> > > > encryption architectures such as Power, S390, etc.
> > > > 
> > > > Thanks,
> > > > Ashish
> > > > 
> > > > On Fri, Jan 24, 2020 at 11:00:08PM +, Ashish Kalra wrote:
> > > > > On Tue, Jan 21, 2020 at 03:54:03PM -0500, Konrad Rzeszutek Wilk wrote:
> > > > > > > 
> > > > > > > Additional memory calculations based on # of PCI devices and
> > > > > > > their memory ranges will make it more complicated with so
> > > > > > > many other permutations and combinations to explore, it is
> > > > > > > essential to keep this patch as simple as possible by 
> > > > > > > adjusting the bounce buffer size simply by determining it
> > > > > > > from the amount of provisioned guest memory.
> > > > > >> 
> > > > > >> Please rework the patch to:
> > > > > >> 
> > > > > >>  - Use a log solution instead of the multiplication.
> > > > > >>Feel free to cap it at a sensible value.
> > > > > 
> > > > > Ok.
> > > > > 
> > > > > >> 
> > > > > >>  - Also the code depends on SWIOTLB calling in to the
> > > > > >>adjust_swiotlb_default_size which looks wrong.
> > > > > >> 
> > > > > >>You should not adjust io_tlb_nslabs from 
> > > > > >> swiotlb_size_or_default.
> > > > > 
> > > > > >>That function's purpose is to report a value.
> > > > > >> 
> > > > > >>  - Make io_tlb_nslabs be visible outside of the SWIOTLB code.
> > > > > >> 
> > > > > >>  - Can you utilize the IOMMU_INIT APIs and have your own detect 
> > > > > >> which would
> > > > > >>modify the io_tlb_nslabs (and set swiotbl=1?).
> > > > > 
> > > > > This seems to be a nice option, but then IOMMU_INIT APIs are
> > > > > x86-specific and this swiotlb buffer size adjustment is also needed
> > > > > for other memory encryption architectures like Power, S390, etc.
> > > 
> > > Oh dear. That I hadn't considered.
> > > > > 
> > > > > >> 
> > > > > >>Actually you seem to be piggybacking on pci_swiotlb_detect_4gb 
> > > > > >> - so
> > > > > >>perhaps add in this code ? Albeit it really should be in it's 
> > > > > >> own
> > > > > >>file, not in arch/x86/kernel/pci-swiotlb.c
> > > > > 
> > > > > Actually, we piggyback on pci_swiotlb_detect_override which sets
> > > > > swiotlb=1 as x86_64_start_kernel() and invocation of sme_early_init()
> > > > > forces swiotlb on, but again this is all x86 architecture specific.
> > > 
> > > Then it looks like the best bet is to do it from within swiotlb_init?
> > > We really can't do it from swiotlb_size_or_default - that function
> > > should just return a value and nothing else.
> > > 
> > 
> > Actually, we need to do it in swiotlb_size_or_default() as this gets called 
> > by
> > reserve_crashkernel_low() in arch/x86/kernel/setup.c and used to
> > reserve low crashkernel memory. If we adjust swiotlb size later in
> > swiotlb_init() which gets called later than reserve_crashkernel_low(),
> > then any swiotlb size changes/expansion will conflict/overlap with the
> > low memory reserved for crashkernel.
> > 
> and will also potentially cause SWIOTLB buffer allocation failures.
> 
> Do you have any feedback, comments on the above ?


The init boot chain looks like this:

initmem_init
pci_iommu_alloc
-> pci_swiotlb_detect_4gb
-> swiotlb_init

reserve_crashkernel
reserve_crashkernel_low
-> swiotlb_size_or_default
..


(rootfs code):
pci_iommu_init
-> a bunch of the other

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-11 Thread Konrad Rzeszutek Wilk

On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:
> As testing shows no performance change, switch to that now.

What kind of testing? 100GiB? Low latency?

Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-30 Thread Konrad Rzeszutek Wilk

On Wed, Apr 29, 2020 at 06:20:48AM -0400, Michael S. Tsirkin wrote:
> On Wed, Apr 29, 2020 at 03:39:53PM +0530, Srivatsa Vaddagiri wrote:
> > That would still not work I think where swiotlb is used for pass-thr devices
> > (when private memory is fine) as well as virtio devices (when shared memory 
> > is
> > required).
> 
> So that is a separate question. When there are multiple untrusted
> devices, at the moment it looks like a single bounce buffer is used.
> 
> Which to me seems like a security problem, I think we should protect
> untrusted devices from each other.

There are two DMA pools code in Linux already - the TTM one for graphics
and the mm/dmapool.c - could those be used instead? Or augmented at least?

Re: [PATCH] xen/swiotlb: correct the check for xen_destroy_contiguous_region

2020-04-28 Thread Konrad Rzeszutek Wilk

On Tue, Apr 28, 2020 at 12:19:41PM +0200, Jürgen Groß wrote:
> On 28.04.20 10:25, Peng Fan wrote:

Adding Joe Jin.

Joe, didn't you have some ideas on how this could be implemented?

> > > Subject: Re: [PATCH] xen/swiotlb: correct the check for
> > > xen_destroy_contiguous_region
> > > 
> > > On 28.04.20 09:33, peng@nxp.com wrote:
> > > > From: Peng Fan 
> > > > 
> > > > When booting xen on i.MX8QM, met:
> > > > "
> > > > [3.602128] Unable to handle kernel paging request at virtual address
> > > 00272d40
> > > > [3.610804] Mem abort info:
> > > > [3.613905]   ESR = 0x9604
> > > > [3.617332]   EC = 0x25: DABT (current EL), IL = 32 bits
> > > > [3.623211]   SET = 0, FnV = 0
> > > > [3.626628]   EA = 0, S1PTW = 0
> > > > [3.630128] Data abort info:
> > > > [3.633362]   ISV = 0, ISS = 0x0004
> > > > [3.637630]   CM = 0, WnR = 0
> > > > [3.640955] [00272d40] user address but active_mm is
> > > swapper
> > > > [3.647983] Internal error: Oops: 9604 [#1] PREEMPT SMP
> > > > [3.654137] Modules linked in:
> > > > [3.677285] Hardware name: Freescale i.MX8QM MEK (DT)
> > > > [3.677302] Workqueue: events deferred_probe_work_func
> > > > [3.684253] imx6q-pcie 5f00.pcie: PCI host bridge to bus :00
> > > > [3.688297] pstate: 6005 (nZCv daif -PAN -UAO)
> > > > [3.688310] pc : xen_swiotlb_free_coherent+0x180/0x1c0
> > > > [3.693993] pci_bus :00: root bus resource [bus 00-ff]
> > > > [3.701002] lr : xen_swiotlb_free_coherent+0x44/0x1c0
> > > > "
> > > > 
> > > > In xen_swiotlb_alloc_coherent, if !(dev_addr + size - 1 <= dma_mask)
> > > > or range_straddles_page_boundary(phys, size) are true, it will create
> > > > contiguous region. So when free, we need to free contiguous region use
> > > > upper check condition.
> > > 
> > > No, this will break PV guests on x86.
> > 
> > Could you share more details why alloc and free not matching for the check?
> 
> xen_create_contiguous_region() is needed only in case:
> 
> - the bus address is not within dma_mask, or
> - the memory region is not physically contiguous (can happen only for
>   PV guests)
> 
> In any case it should arrange for the memory to be suitable for the
> DMA operation, so to be contiguous and within dma_mask afterwards. So
> xen_destroy_contiguous_region() should only ever called for areas
> which match above criteria, as otherwise we can be sure
> xen_create_contiguous_region() was not used for making the area DMA-able
> in the beginning.
> 
> And this is very important in the PV case, as in those guests the page
> tables are containing the host-PFNs, not the guest-PFNS, and
> xen_create_contiguous_region() will fiddle with host- vs. guest-PFN
> arrangements, and xen_destroy_contiguous_region() is reverting this
> fiddling. Any call of xen_destroy_contiguous_region() for an area it
> was not intended to be called for might swap physical pages beneath
> random virtual addresses, which was the reason for this test to be
> added by me.
> 
> 
> Juergen
> 
> > 
> > Thanks,
> > Peng.
> > 
> > > 
> > > I think there is something wrong with your setup in combination with the 
> > > ARM
> > > xen_create_contiguous_region() implementation.
> > > 
> > > Stefano?
> > > 
> > > 
> > > Juergen
> > > 
> > > > 
> > > > Signed-off-by: Peng Fan 
> > > > ---
> > > >drivers/xen/swiotlb-xen.c | 4 ++--
> > > >1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> > > > index b6d27762c6f8..ab96e468584f 100644
> > > > --- a/drivers/xen/swiotlb-xen.c
> > > > +++ b/drivers/xen/swiotlb-xen.c
> > > > @@ -346,8 +346,8 @@ xen_swiotlb_free_coherent(struct device *hwdev,
> > > size_t size, void *vaddr,
> > > > /* Convert the size to actually allocated. */
> > > > size = 1UL << (order + XEN_PAGE_SHIFT);
> > > > 
> > > > -   if (!WARN_ON((dev_addr + size - 1 > dma_mask) ||
> > > > -range_straddles_page_boundary(phys, size)) &&
> > > > +   if (((dev_addr + size - 1 > dma_mask) ||
> > > > +   range_straddles_page_boundary(phys, size)) &&
> > > > TestClearPageXenRemapped(virt_to_page(vaddr)))
> > > > xen_destroy_contiguous_region(phys, order);
> > > > 
> > > > 
> > 
>

Re: [RFC v2 0/2] kvm: Use host timekeeping in guest.

2019-10-11 Thread Konrad Rzeszutek Wilk

.snip..
> I wonder how feasible it is to map the host's vdso into the guest and
> thus make the guest use the *same* (as opposed to "synchronized") clock
> as the host's userspace?  Another benefit is that it's essentially an
> ABI so is not changed as liberally as internal structures like
> timekeeper, etc.  There is probably certain complication in handling the
> syscall fallback in the vdso when used in the guest kernel, though.
> 
> You'll also need to ensure neither tsc scaling and nor offsetting is
> applied to the VM once this clock is enabled.

This is how Xen does it - you can register the hypervisor to timestamp
your vDSO regions if you want it. See xen_setup_vsyscall_time_info

> 
> Roman.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5243 matches

Mail list logo