Re: [RFC PATCH 1/7] vfio/spimdev: Add documents for WarpDrive framework
在 2018年08月06日 星期一 08:27 下午, Pavel Machek 写道: Hi! WarpDrive is a common user space accelerator framework. Its main component in Kernel is called spimdev, Share Parent IOMMU Mediated Device. It exposes spimdev is really unfortunate name. It looks like it has something to do with SPI, but it does not. Yes. Let me change it to Share (IOMMU) Domain MDev, SDMdev:) +++ b/Documentation/warpdrive/warpdrive.rst @@ -0,0 +1,153 @@ +Introduction of WarpDrive += + +*WarpDrive* is a general accelerator framework built on top of vfio. +It can be taken as a light weight virtual function, which you can use without +*SR-IOV* like facility and can be shared among multiple processes. + +It can be used as the quick channel for accelerators, network adaptors or +other hardware in user space. It can make some implementation simpler. E.g. +you can reuse most of the *netdev* driver and just share some ring buffer to +the user space driver for *DPDK* or *ODP*. Or you can combine the RSA +accelerator with the *netdev* in the user space as a Web reversed proxy, etc. What is DPDK? ODP? DPDK:https://www.dpdk.org/about/ ODP: https://www.opendataplane.org/ will add the reference in the next RFC +How does it work + + +*WarpDrive* takes the Hardware Accelerator as a heterogeneous processor which +can share some load for the CPU: + +.. image:: wd.svg +:alt: This is a .svg image, if your browser cannot show it, +try to download and view it locally + +So it provides the capability to the user application to: + +1. Send request to the hardware +2. Share memory with the application and other accelerators + +These requirements can be fulfilled by VFIO if the accelerator can serve each +application with a separated Virtual Function. But a *SR-IOV* like VF (we will +call it *HVF* hereinafter) design is too heavy for the accelerator which +service thousands of processes. VFIO? VF? HVF? Also "gup" might be worth spelling out. But I think the reference [1] has explained this. +References +== +.. [1] Accroding to the comment in in mm/gup.c, The *gup* is only safe within + a syscall. Because it can only keep the physical memory in place + without making sure the VMA will always point to it. Maybe we should + raise the VM_PINNED patchset (see + https://lists.gt.net/linux/kernel/1931993) again to solve this probl I went through the docs, but I still don't know what it does. Will refine the doc in next RFC, hope it will help. Pavel ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 3/7] vfio: add spimdev support
在 2018年08月07日 星期二 01:05 上午, Alex Williamson 写道: On Mon, 6 Aug 2018 09:34:28 -0700 "Raj, Ashok" wrote: On Mon, Aug 06, 2018 at 09:49:40AM -0600, Alex Williamson wrote: On Mon, 6 Aug 2018 09:40:04 +0800 Kenneth Lee wrote: 1. It supports thousands of processes. Take zip accelerator as an example, any application need data compression/decompression will need to interact with the accelerator. To support that, you have to create tens of thousands of mdev for their usage. I don't think it is a good idea to have so many devices in the system. Each mdev is a device, regardless of whether there are hardware resources committed to the device, so I don't understand this argument. 2. The application does not want to own the mdev for long. It just need an access point for the hardware service. If it has to interact with an management agent for allocation and release, this makes the problem complex. I don't see how the length of the usage plays a role here either. Are you concerned that the time it takes to create and remove an mdev is significant compared to the usage time? Userspace is certainly welcome to create a pool of devices, but why should it be the kernel's responsibility to dynamically assign resources to an mdev? What's the usage model when resources are unavailable? It seems there's complexity in either case, but it's generally userspace's responsibility to impose a policy. Can vfio dev's created representing an mdev be shared between several processes? It doesn't need to be exclusive. The path to hardware is established by the processes binding to SVM and IOMMU ensuring that the PASID is plummed properly. One can think the same hardware is shared between several processes, hardware knows the isolation is via the PASID. For these cases it isn't required to create a dev per process. The iommu group is the unit of ownership, a vfio group mirrors an iommu group, therefore a vfio group only allows a single open(2). A group also represents the minimum isolation set of devices, therefore devices within a group are not considered isolated and must share the same address space represented by the vfio container. Beyond that, it is possible to share devices among processes, but (I think) it generally implies a hierarchical rather than peer relationship between processes. Thanks, Actually, this is the key problem we concerned. Our logic was: The PASID refer to the connection between the device and the process. So the resource should be allocated only when the process "make use of" the device. This strategy also bring another advantage that the kernel driver can also make use of the resource if no user application open it. We do have another branch that allocate resource to mdev directly. It looks not so nice (many mdevs and user agent is required for resource management). If the conclusion here is to keep the mdev's original semantics, we will send that branch for discussion in next RFC. Cheers Kenneth Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道: On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote: On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote: On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote: On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote: On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote: On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote: On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote: [...] But doorbell is just a notification. Except for DOS (to make hardware busy) it cannot actually take or change anything from the kernel space. And the DOS problem can be always taken as the problem that a group of processes share the same kernel entity. In the coming HIP09 hardware, the doorbell will come with a random number so only the process who allocated the queue can knock it correctly. When doorbell is ring the hardware start fetching commands from the queue and execute them ? If so than a rogue process B might ring the doorbell of process A which would starts execution of random commands (ie whatever random memory value there is left inside the command buffer memory, could be old commands i guess). If this is not how this doorbell works then, yes it can only do a denial of service i guess. Issue i have with doorbell is that i have seen 10 differents implementations in 10 differents hw and each are different as to what ringing or value written to the doorbell does. It is painfull to track what is what for each hw. In our implementation, doorbell is simply a notification, just like an interrupt to the accelerator. The command is all about what's in the queue. I agree that there is no simple and standard way to track the shared IO space. But I think we have to trust the driver in some way. If the driver is malicious, even a simple ioctl can become an attack. Trusting kernel space driver is fine, trusting user space driver is not in my view. AFAICT every driver developer so far always made sure that someone could not abuse its device to do harmfull thing to other process. Fully agree. That is why this driver shares only the doorbell space. There is only the doorbell is shared in the whole page, nothing else. Maybe you are concerning the user driver will give malicious command to the hardware? But these commands cannot influence the other process. If we can trust the hardware design, the process cannot do any harm. My questions was what happens if a process B ring the doorbell of process A. On some hardware the value written in the doorbell is use as an index in command buffer. On other it just wakeup the hardware to go look at a structure private to the process. They are other variations of those themes. If it is the former ie the value is use to advance in the command buffer then a rogue process can force another process to advance its command buffer and what is in the command buffer can be some random old memory values which can be more harmfull than just Denial Of Service. Yes. We have considered that. There is no other information in the doorbell. The indexes, such as head and tail pointers, are all in the shared memory between the hardware and the user process. The other process cannot touch it. My more general question is do we want to grow VFIO to become a more generic device driver API. This patchset adds a command queue concept to it (i don't think it exist today but i have not follow VFIO closely). The thing is, VFIO is the only place to support DMA from user land. If we don't put it here, we have to create another similar facility to support the same. No it is not, network device, GPU, block device, ... they all do support DMA. The point i am trying to make here is that even in Sorry, wait a minute, are we talking the same thing? I meant "DMA from user land", not "DMA from kernel driver". To do that we have to manipulate the IOMMU(Unit). I think it can only be done by default_domain or vfio domain. Or the user space have to directly access the IOMMU. GPU do DMA in the sense that you pass to the kernel a valid virtual address (kernel driver do all the proper check) and then you can use the GPU to copy from or to that range of virtual address. Exactly how you want to use this compression engine. It does not rely on SVM but SVM going forward would still be the prefered option. No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We rely on Jean's series because of multi-process (PASID or substream ID) support. But of couse, WarpDrive can still benefit from the SVM feature. We are getting side tracked here. PASID/ID do not require VFIO. Yes, PASID itself do not require VFIO. But what if: 1. Support DMA from user space. 2. The hardware makes use of standard IOMMU/SMMU for IO address translation. 3. The IOMMU facility is shared by both kernel and user drivers. 4. Support PASID with the current IOMMU facility your mechanisms the userspace must have a specific
Re: [PATCH 1/3] hexagon: remove the sync_single_for_cpu DMA operation
On Tue, Aug 07, 2018 at 09:01:36AM +0200, Christoph Hellwig wrote: > On Tue, Jul 31, 2018 at 05:22:29PM +0200, Christoph Hellwig wrote: > > On Wed, Jul 25, 2018 at 06:39:27AM +0200, Christoph Hellwig wrote: > > > On Tue, Jul 24, 2018 at 10:29:48PM -0500, Richard Kuo wrote: > > > > Patch series looks good. Definitely appreciate the cleanup. > > > > > > > > I can take it through my tree, or if not: > > > > > > > > Acked-by: Richard Kuo > > > > > > Please take it through your tree, thanks! > > > > I haven't seen it in linux-next yet, do you still plan to take it? > > > > Otherwise I'll merge it in the dma-mapping tree. > > Given that I haven't seen this in linux-next nor haven't heard back > from you I assume you are on your well deserved summer vacation. > > If I don't hear back until tomorrow night I'll merge it through the > dma-mapping tree, so that I have it in linux-next before heading out > to my summer vacation. I am here, and I have the patch queued up but it's waiting for approval before I get to push it out through my tree. That said, I am perfectly fine with this going through a different tree if expedience is needed. Thanks, Richard Kuo -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] arm64: dts: stratix10: Add Stratix10 SMMU support
On Wed, Jul 25, 2018 at 01:24:36PM -0500, thor.tha...@linux.intel.com wrote: > From: Thor Thayer > > Add SMMU support to the Stratix10 Device Tree which > includes adding the SMMU node and adding IOMMU stream > ids to the SMMU peripherals. Update bindings. > > Signed-off-by: Thor Thayer > --- > This patch is dependent on the patch series > "iommu/arm-smmu: Add runtime pm/sleep support" > (https://patchwork.ozlabs.org/cover/946160/) I don't think so. > --- > .../devicetree/bindings/iommu/arm,smmu.txt | 25 ++ > arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 30 > ++ > 2 files changed, 55 insertions(+) Reviewed-by: Rob Herring ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 10/10] [SCSI] mpt3sas: replace chain_dma_pool
Replace chain_dma_pool with direct calls to dma_alloc_coherent() and dma_free_coherent(). Since the chain lookup can involve hundreds of thousands of allocations, it is worthwile to avoid the overhead of the dma_pool API. Signed-off-by: Tony Battersby --- No changes since v1. The original code called _base_release_memory_pools() before "goto out" if dma_pool_alloc() failed, but this was unnecessary because mpt3sas_base_attach() will call _base_release_memory_pools() after "goto out_free_resources". It may have been that way because the out-of-tree vendor driver (from https://www.broadcom.com/support/download-search) has a slightly-more-complicated error handler there that adjusts max_request_credit, calls _base_release_memory_pools() and then does "goto retry_allocation" under some circumstances, but that is missing from the in-tree driver. diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c index 569392d..2cb567a 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.c +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c @@ -4224,6 +4224,134 @@ void mpt3sas_base_clear_st(struct MPT3SAS_ADAPTER *ioc, } /** + * _base_release_chain_lookup - release chain_lookup memory pools + * @ioc: per adapter object + * + * Free memory allocated from _base_allocate_chain_lookup. + */ +static void +_base_release_chain_lookup(struct MPT3SAS_ADAPTER *ioc) +{ + unsigned int chains_avail = 0; + struct chain_tracker *ct; + int i, j; + + if (!ioc->chain_lookup) + return; + + /* +* NOTE +* +* To make this code easier to understand and maintain, the for loops +* and the management of the chains_avail value are designed to be +* similar to the _base_allocate_chain_lookup() function. That way, +* the code for freeing the memory is similar to the code for +* allocating the memory. +*/ + for (i = 0; i < ioc->scsiio_depth; i++) { + if (!ioc->chain_lookup[i].chains_per_smid) + break; + + for (j = ioc->chains_per_prp_buffer; + j < ioc->chains_needed_per_io; j++) { + /* +* If chains_avail is 0, then the chain represents a +* real allocation, so free it. +* +* If chains_avail is nonzero, then the chain was +* initialized at an offset from a previous allocation, +* so don't free it. +*/ + if (chains_avail == 0) { + ct = >chain_lookup[i].chains_per_smid[j]; + if (ct->chain_buffer) + dma_free_coherent( + >pdev->dev, + ioc->chain_allocation_sz, + ct->chain_buffer, + ct->chain_buffer_dma); + chains_avail = ioc->chains_per_allocation; + } + chains_avail--; + } + kfree(ioc->chain_lookup[i].chains_per_smid); + } + + kfree(ioc->chain_lookup); + ioc->chain_lookup = NULL; +} + +/** + * _base_allocate_chain_lookup - allocate chain_lookup memory pools + * @ioc: per adapter object + * @total_sz: external value that tracks total amount of memory allocated + * + * Return: 0 success, anything else error + */ +static int +_base_allocate_chain_lookup(struct MPT3SAS_ADAPTER *ioc, u32 *total_sz) +{ + unsigned int aligned_chain_segment_sz; + const unsigned int align = 16; + unsigned int chains_avail = 0; + struct chain_tracker *ct; + dma_addr_t dma_addr = 0; + void *vaddr = NULL; + int i, j; + + /* Round up the allocation size for alignment. */ + aligned_chain_segment_sz = ioc->chain_segment_sz; + if (aligned_chain_segment_sz % align != 0) + aligned_chain_segment_sz = + ALIGN(aligned_chain_segment_sz, align); + + /* Allocate a page of chain buffers at a time. */ + ioc->chain_allocation_sz = + max_t(unsigned int, aligned_chain_segment_sz, PAGE_SIZE); + + /* Calculate how many chain buffers we can get from one allocation. */ + ioc->chains_per_allocation = + ioc->chain_allocation_sz / aligned_chain_segment_sz; + + for (i = 0; i < ioc->scsiio_depth; i++) { + for (j = ioc->chains_per_prp_buffer; + j < ioc->chains_needed_per_io; j++) { + /* +* Check if there are any chain buffers left in the +* previously-allocated block. +*/ + if
[PATCH v3 09/10] dmapool: debug: prevent endless loop in case of corruption
Prevent a possible endless loop with DMAPOOL_DEBUG enabled if a buggy driver corrupts DMA pool memory. Signed-off-by: Tony Battersby --- Changes since v2: This is closer to the improved version from "dmapool: reduce footprint in struct page" in v2 thanks to a previous patch adding blks_per_alloc. --- linux/mm/dmapool.c.orig 2018-08-06 17:52:53.0 -0400 +++ linux/mm/dmapool.c 2018-08-06 17:53:31.0 -0400 @@ -454,17 +454,39 @@ void dma_pool_free(struct dma_pool *pool { void *page_vaddr = vaddr - offset; unsigned int chain = page->dma_free_off; + unsigned int free_blks = 0; + while (chain < pool->allocation) { - if (chain != offset) { - chain = *(int *)(page_vaddr + chain); - continue; + if (unlikely(chain == offset)) { + spin_unlock_irqrestore(>lock, flags); + dev_err(pool->dev, + "dma_pool_free %s, dma %pad already free\n", + pool->name, ); + return; + } + + /* +* A buggy driver could corrupt the freelist by +* use-after-free, buffer overflow, etc. Besides +* checking for corruption, this also prevents an +* endless loop in case corruption causes a circular +* loop in the freelist. +*/ + if (unlikely(++free_blks + page->dma_in_use > +pool->blks_per_alloc)) { + freelist_corrupt: + spin_unlock_irqrestore(>lock, flags); + dev_err(pool->dev, + "dma_pool_free %s, freelist corrupted\n", + pool->name); + return; } - spin_unlock_irqrestore(>lock, flags); - dev_err(pool->dev, - "dma_pool_free %s, dma %pad already free\n", - pool->name, ); - return; + + chain = *(int *)(page_vaddr + chain); } + if (unlikely(free_blks + page->dma_in_use != +pool->blks_per_alloc)) + goto freelist_corrupt; } memset(vaddr, POOL_POISON_FREED, pool->size); #endif ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 08/10] dmapool: improve accuracy of debug statistics
The "total number of blocks in pool" debug statistic currently does not take the boundary value into account, so it diverges from the "total number of blocks in use" statistic when a boundary is in effect. Add a calculation for the number of blocks per allocation that takes the boundary into account, and use it to replace the inaccurate calculation. Signed-off-by: Tony Battersby --- This was split off from "dmapool: reduce footprint in struct page" in v2. This depends on patch #1 "dmapool: fix boundary comparison" for the calculated blks_per_alloc value to be correct. The added blks_per_alloc value will also be used in the next patch. --- linux/mm/dmapool.c.orig 2018-08-06 17:48:54.0 -0400 +++ linux/mm/dmapool.c 2018-08-06 17:52:53.0 -0400 @@ -61,6 +61,7 @@ struct dma_pool { /* the pool */ struct device *dev; unsigned int allocation; unsigned int boundary; + unsigned int blks_per_alloc; char name[32]; struct list_head pools; }; @@ -105,8 +106,7 @@ show_pools(struct device *dev, struct de /* per-pool info, no real statistics yet */ temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n", pool->name, blocks, -(size_t) pages * -(pool->allocation / pool->size), +(size_t) pages * pool->blks_per_alloc, pool->size, pages); size -= temp; next += temp; @@ -182,6 +182,9 @@ struct dma_pool *dma_pool_create(const c retval->size = size; retval->boundary = boundary; retval->allocation = allocation; + retval->blks_per_alloc = + (allocation / boundary) * (boundary / size) + + (allocation % boundary) / size; INIT_LIST_HEAD(>pools); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 07/10] dmapool: cleanup integer types
To represent the size of a single allocation, dmapool currently uses 'unsigned int' in some places and 'size_t' in other places. Standardize on 'unsigned int' to reduce overhead, but use 'size_t' when counting all the blocks in the entire pool. Signed-off-by: Tony Battersby --- This was split off from "dmapool: reduce footprint in struct page" in v2. This puts an upper bound on 'size' of INT_MAX to avoid overflowing the following comparison in pool_initialize_free_block_list(): unsigned int offset = 0; unsigned int next = offset + pool->size; if (unlikely((next + pool->size) > ... The actual maximum allocation size is probably lower anyway, probably KMALLOC_MAX_SIZE, but that gets into the implementation details of other subsystems which don't export a predefined maximum, so I didn't want to hardcode it here. The purpose of the added bounds check is to avoid overflowing integers, not to check the actual (platform/device/config-specific?) maximum allocation size. 'boundary' is passed in as a size_t but gets stored as an unsigned int. 'boundary' values >= 'allocation' do not have any effect, so clipping 'boundary' to 'allocation' keeps it within the range of unsigned int without affecting anything else. A few lines above (not in the diff) you can see that if 'boundary' is passed in as 0 then it is set to 'allocation', so it is nothing new. For reference, here is the relevant code after being patched: if (!boundary) boundary = allocation; else if ((boundary < size) || (boundary & (boundary - 1))) return NULL; boundary = min(boundary, allocation); --- linux/mm/dmapool.c.orig 2018-08-06 17:48:19.0 -0400 +++ linux/mm/dmapool.c 2018-08-06 17:48:54.0 -0400 @@ -57,10 +57,10 @@ struct dma_pool { /* the pool */ #define POOL_MAX_IDX2 struct list_head page_list[POOL_MAX_IDX]; spinlock_t lock; - size_t size; + unsigned int size; struct device *dev; - size_t allocation; - size_t boundary; + unsigned int allocation; + unsigned int boundary; char name[32]; struct list_head pools; }; @@ -86,7 +86,7 @@ show_pools(struct device *dev, struct de mutex_lock(_lock); list_for_each_entry(pool, >dma_pools, pools) { unsigned pages = 0; - unsigned blocks = 0; + size_t blocks = 0; int list_idx; spin_lock_irq(>lock); @@ -103,9 +103,10 @@ show_pools(struct device *dev, struct de spin_unlock_irq(>lock); /* per-pool info, no real statistics yet */ - temp = scnprintf(next, size, "%-16s %4u %4zu %4zu %2u\n", + temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n", pool->name, blocks, -pages * (pool->allocation / pool->size), +(size_t) pages * +(pool->allocation / pool->size), pool->size, pages); size -= temp; next += temp; @@ -150,7 +151,7 @@ struct dma_pool *dma_pool_create(const c else if (align & (align - 1)) return NULL; - if (size == 0) + if (size == 0 || size > INT_MAX) return NULL; else if (size < 4) size = 4; @@ -165,6 +166,8 @@ struct dma_pool *dma_pool_create(const c else if ((boundary < size) || (boundary & (boundary - 1))) return NULL; + boundary = min(boundary, allocation); + retval = kmalloc_node(sizeof(*retval), GFP_KERNEL, dev_to_node(dev)); if (!retval) return retval; @@ -344,7 +347,7 @@ void *dma_pool_alloc(struct dma_pool *po { unsigned long flags; struct page *page; - size_t offset; + unsigned int offset; void *retval; void *vaddr; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 03/10] dmapool: cleanup dma_pool_destroy
Remove a small amount of code duplication between dma_pool_destroy() and pool_free_page() in preparation for adding more code without having to duplicate it. No functional changes. Signed-off-by: Tony Battersby --- No changes since v2. --- linux/mm/dmapool.c.orig 2018-08-02 09:59:15.0 -0400 +++ linux/mm/dmapool.c 2018-08-02 10:01:26.0 -0400 @@ -249,13 +249,22 @@ static inline bool is_page_busy(struct d static void pool_free_page(struct dma_pool *pool, struct dma_page *page) { + void *vaddr = page->vaddr; dma_addr_t dma = page->dma; + list_del(>page_list); + + if (is_page_busy(page)) { + dev_err(pool->dev, + "dma_pool_destroy %s, %p busy\n", + pool->name, vaddr); + /* leak the still-in-use consistent memory */ + } else { #ifdef DMAPOOL_DEBUG - memset(page->vaddr, POOL_POISON_FREED, pool->allocation); + memset(vaddr, POOL_POISON_FREED, pool->allocation); #endif - dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma); - list_del(>page_list); + dma_free_coherent(pool->dev, pool->allocation, vaddr, dma); + } kfree(page); } @@ -269,6 +278,7 @@ static void pool_free_page(struct dma_po */ void dma_pool_destroy(struct dma_pool *pool) { + struct dma_page *page; bool empty = false; if (unlikely(!pool)) @@ -284,19 +294,10 @@ void dma_pool_destroy(struct dma_pool *p device_remove_file(pool->dev, _attr_pools); mutex_unlock(_reg_lock); - while (!list_empty(>page_list)) { - struct dma_page *page; - page = list_entry(pool->page_list.next, - struct dma_page, page_list); - if (is_page_busy(page)) { - dev_err(pool->dev, - "dma_pool_destroy %s, %p busy\n", - pool->name, page->vaddr); - /* leak the still-in-use consistent memory */ - list_del(>page_list); - kfree(page); - } else - pool_free_page(pool, page); + while ((page = list_first_entry_or_null(>page_list, + struct dma_page, + page_list))) { + pool_free_page(pool, page); } kfree(pool); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 06/10] dmapool: improve scalability of dma_pool_free
dma_pool_free() scales poorly when the pool contains many pages because pool_find_page() does a linear scan of all allocated pages. Improve its scalability by replacing the linear scan with virt_to_page() and storing dmapool private data directly in 'struct page', thereby eliminating 'struct dma_page'. In big O notation, this improves the algorithm from O(n^2) to O(n) while also reducing memory usage. Thanks to Matthew Wilcox for the suggestion to use struct page. Signed-off-by: Tony Battersby --- Changes since v2: Just a re-diff after the changes in prior patches. --- linux/include/linux/mm_types.h.orig 2018-08-01 17:59:46.0 -0400 +++ linux/include/linux/mm_types.h 2018-08-01 17:59:56.0 -0400 @@ -153,6 +153,12 @@ struct page { unsigned long _zd_pad_1;/* uses mapping */ }; + struct {/* dma_pool pages */ + struct list_head dma_list; + dma_addr_t dma; + unsigned int dma_free_off; + }; + /** @rcu_head: You can use this to free a page by RCU. */ struct rcu_head rcu_head; }; @@ -174,6 +180,8 @@ struct page { unsigned int active;/* SLAB */ int units; /* SLOB */ + + unsigned int dma_in_use;/* dma_pool pages */ }; /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */ --- linux/mm/dmapool.c.orig 2018-08-03 17:47:03.0 -0400 +++ linux/mm/dmapool.c 2018-08-03 17:47:22.0 -0400 @@ -25,6 +25,10 @@ * list tracks pages that have one or more free blocks, and the 'full' list * tracks pages that have no free blocks. Pages are moved from one list to * the other as their blocks are allocated and freed. + * + * When allocating DMA pages, we use some available space in 'struct page' to + * store data private to dmapool; search 'dma_pool' in the definition of + * 'struct page' for details. */ #include @@ -61,14 +65,6 @@ struct dma_pool {/* the pool */ struct list_head pools; }; -struct dma_page { /* cacheable header for 'allocation' bytes */ - struct list_head dma_list; - void *vaddr; - dma_addr_t dma; - unsigned int dma_in_use; - unsigned int dma_free_off; -}; - static DEFINE_MUTEX(pools_lock); static DEFINE_MUTEX(pools_reg_lock); @@ -95,7 +91,7 @@ show_pools(struct device *dev, struct de spin_lock_irq(>lock); for (list_idx = 0; list_idx < POOL_MAX_IDX; list_idx++) { - struct dma_page *page; + struct page *page; list_for_each_entry(page, >page_list[list_idx], @@ -218,7 +214,7 @@ struct dma_pool *dma_pool_create(const c } EXPORT_SYMBOL(dma_pool_create); -static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page) +static void pool_initialize_free_block_list(struct dma_pool *pool, void *vaddr) { unsigned int offset = 0; unsigned int next_boundary = pool->boundary; @@ -229,47 +225,57 @@ static void pool_initialise_page(struct next = next_boundary; next_boundary += pool->boundary; } - *(int *)(page->vaddr + offset) = next; + *(int *)(vaddr + offset) = next; offset = next; } while (offset < pool->allocation); } -static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags) +static struct page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags) { - struct dma_page *page; + struct page *page; + dma_addr_t dma; + void *vaddr; - page = kmalloc(sizeof(*page), mem_flags); - if (!page) + vaddr = dma_alloc_coherent(pool->dev, pool->allocation, , + mem_flags); + if (!vaddr) return NULL; - page->vaddr = dma_alloc_coherent(pool->dev, pool->allocation, ->dma, mem_flags); - if (page->vaddr) { + #ifdef DMAPOOL_DEBUG - memset(page->vaddr, POOL_POISON_FREED, pool->allocation); + memset(vaddr, POOL_POISON_FREED, pool->allocation); #endif - pool_initialise_page(pool, page); - page->dma_in_use = 0; - page->dma_free_off = 0; - } else { - kfree(page); - page = NULL; - } + pool_initialize_free_block_list(pool, vaddr); + + page = virt_to_page(vaddr); + page->dma = dma; + page->dma_free_off = 0; + page->dma_in_use = 0; + return page; } -static inline bool is_page_busy(struct dma_page *page) +static inline bool is_page_busy(struct page *page) { return page->dma_in_use != 0; } -static void
[PATCH v3 05/10] dmapool: rename fields in dma_page
Rename fields in 'struct dma_page' in preparation for moving them into 'struct page'. No functional changes. in_use -> dma_in_use offset -> dma_free_off Signed-off-by: Tony Battersby --- Changes since v2: Use dma_free_off instead of dma_free_o. --- linux/mm/dmapool.c.orig 2018-08-03 17:46:13.0 -0400 +++ linux/mm/dmapool.c 2018-08-03 17:46:24.0 -0400 @@ -65,8 +65,8 @@ struct dma_page { /* cacheable header f struct list_head dma_list; void *vaddr; dma_addr_t dma; - unsigned int in_use; - unsigned int offset; + unsigned int dma_in_use; + unsigned int dma_free_off; }; static DEFINE_MUTEX(pools_lock); @@ -101,7 +101,7 @@ show_pools(struct device *dev, struct de >page_list[list_idx], dma_list) { pages++; - blocks += page->in_use; + blocks += page->dma_in_use; } } spin_unlock_irq(>lock); @@ -248,8 +248,8 @@ static struct dma_page *pool_alloc_page( memset(page->vaddr, POOL_POISON_FREED, pool->allocation); #endif pool_initialise_page(pool, page); - page->in_use = 0; - page->offset = 0; + page->dma_in_use = 0; + page->dma_free_off = 0; } else { kfree(page); page = NULL; @@ -259,7 +259,7 @@ static struct dma_page *pool_alloc_page( static inline bool is_page_busy(struct dma_page *page) { - return page->in_use != 0; + return page->dma_in_use != 0; } static void pool_free_page(struct dma_pool *pool, struct dma_page *page) @@ -362,10 +362,10 @@ void *dma_pool_alloc(struct dma_pool *po list_add(>dma_list, >page_list[POOL_AVAIL_IDX]); ready: - page->in_use++; - offset = page->offset; - page->offset = *(int *)(page->vaddr + offset); - if (page->offset >= pool->allocation) + page->dma_in_use++; + offset = page->dma_free_off; + page->dma_free_off = *(int *)(page->vaddr + offset); + if (page->dma_free_off >= pool->allocation) /* Move page from the "available" list to the "full" list. */ list_move_tail(>dma_list, >page_list[POOL_FULL_IDX]); @@ -375,8 +375,8 @@ void *dma_pool_alloc(struct dma_pool *po { int i; u8 *data = retval; - /* page->offset is stored in first 4 bytes */ - for (i = sizeof(page->offset); i < pool->size; i++) { + /* page->dma_free_off is stored in first 4 bytes */ + for (i = sizeof(page->dma_free_off); i < pool->size; i++) { if (data[i] == POOL_POISON_FREED) continue; dev_err(pool->dev, @@ -458,7 +458,7 @@ void dma_pool_free(struct dma_pool *pool return; } { - unsigned int chain = page->offset; + unsigned int chain = page->dma_free_off; while (chain < pool->allocation) { if (chain != offset) { chain = *(int *)(page->vaddr + chain); @@ -474,12 +474,12 @@ void dma_pool_free(struct dma_pool *pool memset(vaddr, POOL_POISON_FREED, pool->size); #endif - page->in_use--; - if (page->offset >= pool->allocation) + page->dma_in_use--; + if (page->dma_free_off >= pool->allocation) /* Move page from the "full" list to the "available" list. */ list_move(>dma_list, >page_list[POOL_AVAIL_IDX]); - *(int *)vaddr = page->offset; - page->offset = offset; + *(int *)vaddr = page->dma_free_off; + page->dma_free_off = offset; /* * Resist a temptation to do *if (!is_page_busy(page)) pool_free_page(pool, page); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 04/10] dmapool: improve scalability of dma_pool_alloc
dma_pool_alloc() scales poorly when allocating a large number of pages because it does a linear scan of all previously-allocated pages before allocating a new one. Improve its scalability by maintaining a separate list of pages that have free blocks ready to (re)allocate. In big O notation, this improves the algorithm from O(n^2) to O(n). Signed-off-by: Tony Battersby --- Changes since v2: *) Use list_move()/list_move_tail() instead of list_del+list_add(). *) Renamed POOL_N_LISTS to POOL_MAX_IDX. *) Use defined names instead of 0/1 indexes for INIT_LIST_HEAD(). --- linux/mm/dmapool.c.orig 2018-08-03 16:16:49.0 -0400 +++ linux/mm/dmapool.c 2018-08-03 16:45:33.0 -0400 @@ -15,11 +15,16 @@ * Many older drivers still have their own code to do this. * * The current design of this allocator is fairly simple. The pool is - * represented by the 'struct dma_pool' which keeps a doubly-linked list of - * allocated pages. Each page in the page_list is split into blocks of at - * least 'size' bytes. Free blocks are tracked in an unsorted singly-linked - * list of free blocks within the page. Used blocks aren't tracked, but we - * keep a count of how many are currently allocated from each page. + * represented by the 'struct dma_pool'. Each allocated page is split into + * blocks of at least 'size' bytes. Free blocks are tracked in an unsorted + * singly-linked list of free blocks within the page. Used blocks aren't + * tracked, but we keep a count of how many are currently allocated from each + * page. + * + * The pool keeps two doubly-linked list of allocated pages. The 'available' + * list tracks pages that have one or more free blocks, and the 'full' list + * tracks pages that have no free blocks. Pages are moved from one list to + * the other as their blocks are allocated and freed. */ #include @@ -43,7 +48,10 @@ #endif struct dma_pool { /* the pool */ - struct list_head page_list; +#define POOL_FULL_IDX 0 +#define POOL_AVAIL_IDX 1 +#define POOL_MAX_IDX2 + struct list_head page_list[POOL_MAX_IDX]; spinlock_t lock; size_t size; struct device *dev; @@ -54,7 +62,7 @@ struct dma_pool { /* the pool */ }; struct dma_page { /* cacheable header for 'allocation' bytes */ - struct list_head page_list; + struct list_head dma_list; void *vaddr; dma_addr_t dma; unsigned int in_use; @@ -70,7 +78,6 @@ show_pools(struct device *dev, struct de unsigned temp; unsigned size; char *next; - struct dma_page *page; struct dma_pool *pool; next = buf; @@ -84,11 +91,18 @@ show_pools(struct device *dev, struct de list_for_each_entry(pool, >dma_pools, pools) { unsigned pages = 0; unsigned blocks = 0; + int list_idx; spin_lock_irq(>lock); - list_for_each_entry(page, >page_list, page_list) { - pages++; - blocks += page->in_use; + for (list_idx = 0; list_idx < POOL_MAX_IDX; list_idx++) { + struct dma_page *page; + + list_for_each_entry(page, + >page_list[list_idx], + dma_list) { + pages++; + blocks += page->in_use; + } } spin_unlock_irq(>lock); @@ -163,7 +177,8 @@ struct dma_pool *dma_pool_create(const c retval->dev = dev; - INIT_LIST_HEAD(>page_list); + INIT_LIST_HEAD(>page_list[POOL_FULL_IDX]); + INIT_LIST_HEAD(>page_list[POOL_AVAIL_IDX]); spin_lock_init(>lock); retval->size = size; retval->boundary = boundary; @@ -252,7 +267,7 @@ static void pool_free_page(struct dma_po void *vaddr = page->vaddr; dma_addr_t dma = page->dma; - list_del(>page_list); + list_del(>dma_list); if (is_page_busy(page)) { dev_err(pool->dev, @@ -278,8 +293,8 @@ static void pool_free_page(struct dma_po */ void dma_pool_destroy(struct dma_pool *pool) { - struct dma_page *page; bool empty = false; + int list_idx; if (unlikely(!pool)) return; @@ -294,10 +309,15 @@ void dma_pool_destroy(struct dma_pool *p device_remove_file(pool->dev, _attr_pools); mutex_unlock(_reg_lock); - while ((page = list_first_entry_or_null(>page_list, - struct dma_page, - page_list))) { - pool_free_page(pool, page); + for (list_idx = 0; list_idx < POOL_MAX_IDX; list_idx++) { + struct dma_page *page; + + while ((page = list_first_entry_or_null( +
[PATCH v3 02/10] dmapool: remove checks for dev == NULL
dmapool originally tried to support pools without a device because dma_alloc_coherent() supports allocations without a device. But nobody ended up using dma pools without a device, so the current checks in dmapool.c for pool->dev == NULL are both insufficient and causing bloat. Remove them. Signed-off-by: Tony Battersby --- Changes since v2: *) This was "dmapool: cleanup error messages" in v2. *) Remove one more check for dev == NULL in dma_pool_destroy() that is unrelated to error messages. --- linux/mm/dmapool.c.orig 2018-08-03 16:12:23.0 -0400 +++ linux/mm/dmapool.c 2018-08-03 16:13:44.0 -0400 @@ -277,7 +277,7 @@ void dma_pool_destroy(struct dma_pool *p mutex_lock(_reg_lock); mutex_lock(_lock); list_del(>pools); - if (pool->dev && list_empty(>dev->dma_pools)) + if (list_empty(>dev->dma_pools)) empty = true; mutex_unlock(_lock); if (empty) @@ -289,13 +289,9 @@ void dma_pool_destroy(struct dma_pool *p page = list_entry(pool->page_list.next, struct dma_page, page_list); if (is_page_busy(page)) { - if (pool->dev) - dev_err(pool->dev, - "dma_pool_destroy %s, %p busy\n", - pool->name, page->vaddr); - else - pr_err("dma_pool_destroy %s, %p busy\n", - pool->name, page->vaddr); + dev_err(pool->dev, + "dma_pool_destroy %s, %p busy\n", + pool->name, page->vaddr); /* leak the still-in-use consistent memory */ list_del(>page_list); kfree(page); @@ -357,13 +353,9 @@ void *dma_pool_alloc(struct dma_pool *po for (i = sizeof(page->offset); i < pool->size; i++) { if (data[i] == POOL_POISON_FREED) continue; - if (pool->dev) - dev_err(pool->dev, - "dma_pool_alloc %s, %p (corrupted)\n", - pool->name, retval); - else - pr_err("dma_pool_alloc %s, %p (corrupted)\n", - pool->name, retval); + dev_err(pool->dev, + "dma_pool_alloc %s, %p (corrupted)\n", + pool->name, retval); /* * Dump the first 4 bytes even if they are not @@ -418,13 +410,9 @@ void dma_pool_free(struct dma_pool *pool page = pool_find_page(pool, dma); if (!page) { spin_unlock_irqrestore(>lock, flags); - if (pool->dev) - dev_err(pool->dev, - "dma_pool_free %s, %p/%lx (bad dma)\n", - pool->name, vaddr, (unsigned long)dma); - else - pr_err("dma_pool_free %s, %p/%lx (bad dma)\n", - pool->name, vaddr, (unsigned long)dma); + dev_err(pool->dev, + "dma_pool_free %s, %p/%lx (bad dma)\n", + pool->name, vaddr, (unsigned long)dma); return; } @@ -432,13 +420,9 @@ void dma_pool_free(struct dma_pool *pool #ifdef DMAPOOL_DEBUG if ((dma - page->dma) != offset) { spin_unlock_irqrestore(>lock, flags); - if (pool->dev) - dev_err(pool->dev, - "dma_pool_free %s, %p (bad vaddr)/%pad\n", - pool->name, vaddr, ); - else - pr_err("dma_pool_free %s, %p (bad vaddr)/%pad\n", - pool->name, vaddr, ); + dev_err(pool->dev, + "dma_pool_free %s, %p (bad vaddr)/%pad\n", + pool->name, vaddr, ); return; } { @@ -449,12 +433,9 @@ void dma_pool_free(struct dma_pool *pool continue; } spin_unlock_irqrestore(>lock, flags); - if (pool->dev) - dev_err(pool->dev, "dma_pool_free %s, dma %pad already free\n", - pool->name, ); - else - pr_err("dma_pool_free %s, dma %pad already free\n", - pool->name, ); + dev_err(pool->dev, + "dma_pool_free %s, dma %pad already free\n", + pool->name, );
[PATCH v3 01/10] dmapool: fix boundary comparison
Fix the boundary comparison when constructing the list of free blocks for the case that 'size' is a power of two. Since 'boundary' is also a power of two, that would make 'boundary' a multiple of 'size', in which case a single block would never cross the boundary. This bug would cause some of the allocated memory to be wasted (but not leaked). Example: size = 512 boundary = 2048 allocation = 4096 Address range 0 - 511 512 - 1023 1024 - 1535 1536 - 2047 * 2048 - 2559 2560 - 3071 3072 - 3583 3584 - 4095 * Prior to this fix, the address ranges marked with "*" would not have been used even though they didn't cross the given boundary. Fixes: e34f44b3517f ("pool: Improve memory usage for devices which can't cross boundaries") Signed-off-by: Tony Battersby --- No changes since v2. Even though I described this as a "fix", it does not seem important enough to Cc: stable from a strict reading of the stable kernel rules. IOW, it is not "bothering" anyone. --- linux/mm/dmapool.c.orig 2018-08-01 17:57:04.0 -0400 +++ linux/mm/dmapool.c 2018-08-01 17:57:16.0 -0400 @@ -210,7 +210,7 @@ static void pool_initialise_page(struct do { unsigned int next = offset + pool->size; - if (unlikely((next + pool->size) >= next_boundary)) { + if (unlikely((next + pool->size) > next_boundary)) { next = next_boundary; next_boundary += pool->boundary; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 00/10] mpt3sas and dmapool scalability
Major changes since v2: *) Addressed review comments. *) Changed the description of patch #2. *) Dropped "dmapool: reduce footprint in struct page", but split off parts of it for merging (patches #7 and #8) and used it to improve patch #9. --- drivers/scsi/mpt3sas is running into a scalability problem with the kernel's DMA pool implementation. With a LSI/Broadcom SAS 9300-8i 12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the equivalent of: chain_dma_pool = dma_pool_create(size = 128); for (i = 0; i < 373959; i++) { dma_addr[i] = dma_pool_alloc(chain_dma_pool); } And at rmmod, system shutdown, or system reboot, mpt3sas does the equivalent of: for (i = 0; i < 373959; i++) { dma_pool_free(chain_dma_pool, dma_addr[i]); } dma_pool_destroy(chain_dma_pool); With this usage, both dma_pool_alloc() and dma_pool_free() exhibit O(n^2) complexity, although dma_pool_free() is much worse due to implementation details. On my system, the dma_pool_free() loop above takes about 9 seconds to run. Note that the problem was even worse before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory consumption in driver."), where the dma_pool_free() loop could take ~30 seconds. mpt3sas also has some other DMA pools, but chain_dma_pool is the only one with so many allocations: cat /sys/devices/pci:80/:80:07.0/:85:00.0/pools (manually cleaned up column alignment) poolinfo - 0.1 reply_post_free_array pool 1 21 192 1 reply_free pool 1 1 41728 1 reply pool 1 1 1335296 1 sense pool 1 1 970272 1 chain pool 373959 386048 128 12064 reply_post_free pool12 12 166528 12 The first 9 patches in this series improve the scalability of the DMA pool implementation, which significantly reduces the running time of the DMA alloc/free loops. The last patch modifies mpt3sas to replace chain_dma_pool with direct calls to dma_alloc_coherent() and dma_free_coherent(), which reduces its overhead even further. The mpt3sas patch is independent of the dmapool patches; it can be used with or without them. If either the dmapool patches or the mpt3sas patch is applied, then "modprobe mpt3sas", "rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are significantly faster. Here are some benchmarks (of DMA alloc/free only, not the entire modprobe/rmmod): dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959 original:350 ms ( 1x) dmapool patches: 17 ms (21x) mpt3sas patch: 7 ms (51x) dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959 original:8901 ms ( 1x) dmapool patches: 15 ms ( 618x) mpt3sas patch: 2 ms (4245x) Considering that LSI/Broadcom offer an out-of-tree vendor driver that works across multiple kernel versions that won't get the dmapool patches, it may be worth it for them to patch mpt3sas to avoid the problem on older kernels. The downside is that the code is a bit more complicated. So I leave it to their judgement whether they think it is worth it to apply the mpt3sas patch. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/rockchip: Handle errors returned from PM framework
On 07/08/18 14:15, Heiko Stuebner wrote: > Am Dienstag, 7. August 2018, 14:31:49 CEST schrieb Marc Zyngier: >> On 07/08/18 13:09, Heiko Stuebner wrote: >>> Hi Marc, >>> >>> Am Dienstag, 7. August 2018, 10:54:05 CEST schrieb Marc Zyngier: pm_runtime_get_if_in_use can fail: either PM has been disabled altogether (-EINVAL), or the device hasn't been enabled yet (0). Sadly, the Rockchip IOMMU driver tends to conflate the two things by considering a non-zero return value as successful. This has the consequence of hiding other bugs, so let's handle this case throughout the driver, with a WARN_ON_ONCE so that we can try and work out what happened. Fixes: 0f181d3cf7d98 ("iommu/rockchip: Add runtime PM support") Signed-off-by: Marc Zyngier >>> >>> I'm still not sure about the !CONFIG_PM case, as it was probably silently >>> working in that case before >> >> Do we agree that this is an orthogonal problem though? > > Nope ;-) I.e. right now the code ignores the -EINVAL from disabled PM > and continues, possibly even handling the irq correctly. Ah, I now see what you mean. Yeah, this is a bit rubbish. It would have been better if the API returned something more sensible in that case, but that's a bit late... > If it actually worked is a different matter, as I guess nobody really tried > with !PM in the past. I don't think anyone noticed. !CONFIG_PM on something like rk3399 probably isn't very popular, and certainly comes for free on a multiplatform kernel. > Now with error-handling we always return IRQ_NONE for !PM. Yup. >>> But on the other hand we're also already running over it in other places >>> like in the iommu-shutdown and I guess if someone _really_ disabled >>> CONFIG_PM, a lot of additional stuff would fail anyway. >>> >>> So should we wrap that in some #ifdef magic, just ignore it or simply >>> select PM similar to what Tegra, Renesas and Vexpress seem to do? >>> >>> I guess I like the 3rd option best ;-) >> >> It probably doesn't hurt. At what level do you want it? As a dependency >> to the IOMMU? or to the platform? > > I guess it might be best to go the Tegra, etc way. Whoever in their right > mind would want to drive a mobile platform without any form for power > management ;-) . > > I can do these patches for arm32+arm64 myself ... I just wanted to put > that thought out there - in case that was just a stupid idea of mine :-D . Not stupid at all. Regarding this very patch: where do you want me to take it? M. -- Jazz is not dead. It just smells funny... ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: DMA related cleanups for IA64
Yonu, Fenghua, any chance you could look over these patches for 4.19? That would make my life for the next merge window a lot easier. On Wed, Aug 01, 2018 at 06:02:43PM +0200, Christoph Hellwig wrote: > Hi all, > > this is a resend of the last two series plus additional cleanups. The > driver of it were the odd dma barriers in the ia64 sync_single_* methods, > but once I started to look into that area the fallback got bigger and > bigger.. > ___ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu ---end quoted text--- ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/rockchip: Handle errors returned from PM framework
Am Dienstag, 7. August 2018, 14:31:49 CEST schrieb Marc Zyngier: > On 07/08/18 13:09, Heiko Stuebner wrote: > > Hi Marc, > > > > Am Dienstag, 7. August 2018, 10:54:05 CEST schrieb Marc Zyngier: > >> pm_runtime_get_if_in_use can fail: either PM has been disabled > >> altogether (-EINVAL), or the device hasn't been enabled yet (0). > >> Sadly, the Rockchip IOMMU driver tends to conflate the two things > >> by considering a non-zero return value as successful. > >> > >> This has the consequence of hiding other bugs, so let's handle this > >> case throughout the driver, with a WARN_ON_ONCE so that we can try > >> and work out what happened. > >> > >> Fixes: 0f181d3cf7d98 ("iommu/rockchip: Add runtime PM support") > >> Signed-off-by: Marc Zyngier > > > > I'm still not sure about the !CONFIG_PM case, as it was probably silently > > working in that case before > > Do we agree that this is an orthogonal problem though? Nope ;-) I.e. right now the code ignores the -EINVAL from disabled PM and continues, possibly even handling the irq correctly. If it actually worked is a different matter, as I guess nobody really tried with !PM in the past. Now with error-handling we always return IRQ_NONE for !PM. > > But on the other hand we're also already running over it in other places > > like in the iommu-shutdown and I guess if someone _really_ disabled > > CONFIG_PM, a lot of additional stuff would fail anyway. > > > > So should we wrap that in some #ifdef magic, just ignore it or simply > > select PM similar to what Tegra, Renesas and Vexpress seem to do? > > > > I guess I like the 3rd option best ;-) > > It probably doesn't hurt. At what level do you want it? As a dependency > to the IOMMU? or to the platform? I guess it might be best to go the Tegra, etc way. Whoever in their right mind would want to drive a mobile platform without any form for power management ;-) . I can do these patches for arm32+arm64 myself ... I just wanted to put that thought out there - in case that was just a stupid idea of mine :-D . Heiko ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/rockchip: Handle errors returned from PM framework
On 07/08/18 13:09, Heiko Stuebner wrote: > Hi Marc, > > Am Dienstag, 7. August 2018, 10:54:05 CEST schrieb Marc Zyngier: >> pm_runtime_get_if_in_use can fail: either PM has been disabled >> altogether (-EINVAL), or the device hasn't been enabled yet (0). >> Sadly, the Rockchip IOMMU driver tends to conflate the two things >> by considering a non-zero return value as successful. >> >> This has the consequence of hiding other bugs, so let's handle this >> case throughout the driver, with a WARN_ON_ONCE so that we can try >> and work out what happened. >> >> Fixes: 0f181d3cf7d98 ("iommu/rockchip: Add runtime PM support") >> Signed-off-by: Marc Zyngier > > I'm still not sure about the !CONFIG_PM case, as it was probably silently > working in that case before Do we agree that this is an orthogonal problem though? > > But on the other hand we're also already running over it in other places > like in the iommu-shutdown and I guess if someone _really_ disabled > CONFIG_PM, a lot of additional stuff would fail anyway. > > So should we wrap that in some #ifdef magic, just ignore it or simply > select PM similar to what Tegra, Renesas and Vexpress seem to do? > > I guess I like the 3rd option best ;-) It probably doesn't hurt. At what level do you want it? As a dependency to the IOMMU? or to the platform? Thanks, M. -- Jazz is not dead. It just smells funny... ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/rockchip: Handle errors returned from PM framework
Hi Marc, Am Dienstag, 7. August 2018, 10:54:05 CEST schrieb Marc Zyngier: > pm_runtime_get_if_in_use can fail: either PM has been disabled > altogether (-EINVAL), or the device hasn't been enabled yet (0). > Sadly, the Rockchip IOMMU driver tends to conflate the two things > by considering a non-zero return value as successful. > > This has the consequence of hiding other bugs, so let's handle this > case throughout the driver, with a WARN_ON_ONCE so that we can try > and work out what happened. > > Fixes: 0f181d3cf7d98 ("iommu/rockchip: Add runtime PM support") > Signed-off-by: Marc Zyngier I'm still not sure about the !CONFIG_PM case, as it was probably silently working in that case before. But on the other hand we're also already running over it in other places like in the iommu-shutdown and I guess if someone _really_ disabled CONFIG_PM, a lot of additional stuff would fail anyway. So should we wrap that in some #ifdef magic, just ignore it or simply select PM similar to what Tegra, Renesas and Vexpress seem to do? I guess I like the 3rd option best ;-) Heiko > drivers/iommu/rockchip-iommu.c | 21 +++-- > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c > index 054cd2c8e9c8..4e0f9b61cd7f 100644 > --- a/drivers/iommu/rockchip-iommu.c > +++ b/drivers/iommu/rockchip-iommu.c > @@ -521,10 +521,11 @@ static irqreturn_t rk_iommu_irq(int irq, void *dev_id) > u32 int_status; > dma_addr_t iova; > irqreturn_t ret = IRQ_NONE; > - int i; > + int i, err; > > - if (WARN_ON(!pm_runtime_get_if_in_use(iommu->dev))) > - return 0; > + err = pm_runtime_get_if_in_use(iommu->dev); > + if (WARN_ON_ONCE(err <= 0)) > + return ret; > > if (WARN_ON(clk_bulk_enable(iommu->num_clocks, iommu->clocks))) > goto out; > @@ -620,11 +621,15 @@ static void rk_iommu_zap_iova(struct rk_iommu_domain > *rk_domain, > spin_lock_irqsave(_domain->iommus_lock, flags); > list_for_each(pos, _domain->iommus) { > struct rk_iommu *iommu; > + int ret; > > iommu = list_entry(pos, struct rk_iommu, node); > > /* Only zap TLBs of IOMMUs that are powered on. */ > - if (pm_runtime_get_if_in_use(iommu->dev)) { > + ret = pm_runtime_get_if_in_use(iommu->dev); > + if (WARN_ON_ONCE(ret < 0)) > + continue; > + if (ret) { > WARN_ON(clk_bulk_enable(iommu->num_clocks, > iommu->clocks)); > rk_iommu_zap_lines(iommu, iova, size); > @@ -891,6 +896,7 @@ static void rk_iommu_detach_device(struct iommu_domain > *domain, > struct rk_iommu *iommu; > struct rk_iommu_domain *rk_domain = to_rk_domain(domain); > unsigned long flags; > + int ret; > > /* Allow 'virtual devices' (eg drm) to detach from domain */ > iommu = rk_iommu_from_dev(dev); > @@ -909,7 +915,9 @@ static void rk_iommu_detach_device(struct iommu_domain > *domain, > list_del_init(>node); > spin_unlock_irqrestore(_domain->iommus_lock, flags); > > - if (pm_runtime_get_if_in_use(iommu->dev)) { > + ret = pm_runtime_get_if_in_use(iommu->dev); > + WARN_ON_ONCE(ret < 0); > + if (ret > 0) { > rk_iommu_disable(iommu); > pm_runtime_put(iommu->dev); > } > @@ -946,7 +954,8 @@ static int rk_iommu_attach_device(struct iommu_domain > *domain, > list_add_tail(>node, _domain->iommus); > spin_unlock_irqrestore(_domain->iommus_lock, flags); > > - if (!pm_runtime_get_if_in_use(iommu->dev)) > + ret = pm_runtime_get_if_in_use(iommu->dev); > + if (!ret || WARN_ON_ONCE(ret < 0)) > return 0; > ret = rk_iommu_enable(iommu); > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/rockchip: Move irq request past pm_runtime_enable
Am Dienstag, 7. August 2018, 10:54:06 CEST schrieb Marc Zyngier: > Enabling the interrupt early, before power has been applied to the > device, can result in an interrupt being delivered too early if: > > - the IOMMU shares an interrupt with a VOP > - the VOP has a pending interrupt (after a kexec, for example) > > In these conditions, we end-up taking the interrupt without > the IOMMU being ready to handle the interrupt (not powered on). > > Moving the interrupt request past the pm_runtime_enable() call > makes sure we can at least access the IOMMU registers. Note that > this is only a partial fix, and that the VOP interrupt will still > be screaming until the VOP driver kicks in, which advocates for > a more synchronized interrupt enabling/disabling approach. > > Fixes: 0f181d3cf7d98 ("iommu/rockchip: Add runtime PM support") > Signed-off-by: Marc Zyngier Reviewed-by: Heiko Stuebner ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range
As an optimisation for PCI devices, there is always first attempt been made to allocate iova from SAC address range. This will lead to unnecessary attempts/function calls, when there are no free ranges available. This patch optimises by adding flag to track previous failed attempts and avoids further attempts until replenish happens. Signed-off-by: Ganapatrao Kulkarni --- This patch is based on comments from Robin Murphy for patch [1] [1] https://lkml.org/lkml/2018/4/19/780 drivers/iommu/iova.c | 11 ++- include/linux/iova.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 83fe262..d97bb5a 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->granule = granule; iovad->start_pfn = start_pfn; iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); + iovad->free_32bit_pfns = true; iovad->flush_cb = NULL; iovad->fq = NULL; iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) cached_iova = rb_entry(iovad->cached32_node, struct iova, node); if (free->pfn_hi < iovad->dma_32bit_pfn && - free->pfn_lo >= cached_iova->pfn_lo) + free->pfn_lo >= cached_iova->pfn_lo) { iovad->cached32_node = rb_next(>node); + iovad->free_32bit_pfns = true; + } cached_iova = rb_entry(iovad->cached_node, struct iova, node); if (free->pfn_lo >= cached_iova->pfn_lo) @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long size, struct iova *new_iova; int ret; + if (limit_pfn <= iovad->dma_32bit_pfn && + !iovad->free_32bit_pfns) + return NULL; + new_iova = alloc_iova_mem(); if (!new_iova) return NULL; @@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long size, if (ret) { free_iova_mem(new_iova); + if (limit_pfn <= iovad->dma_32bit_pfn) + iovad->free_32bit_pfns = false; return NULL; } diff --git a/include/linux/iova.h b/include/linux/iova.h index 928442d..3810ba9 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -96,6 +96,7 @@ struct iova_domain { flush-queues */ atomic_t fq_timer_on; /* 1 when timer is active, 0 when not */ + boolfree_32bit_pfns; }; static inline unsigned long iova_size(struct iova *iova) -- 2.9.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/2] iommu/rockchip: Move irq request past pm_runtime_enable
Enabling the interrupt early, before power has been applied to the device, can result in an interrupt being delivered too early if: - the IOMMU shares an interrupt with a VOP - the VOP has a pending interrupt (after a kexec, for example) In these conditions, we end-up taking the interrupt without the IOMMU being ready to handle the interrupt (not powered on). Moving the interrupt request past the pm_runtime_enable() call makes sure we can at least access the IOMMU registers. Note that this is only a partial fix, and that the VOP interrupt will still be screaming until the VOP driver kicks in, which advocates for a more synchronized interrupt enabling/disabling approach. Fixes: 0f181d3cf7d98 ("iommu/rockchip: Add runtime PM support") Signed-off-by: Marc Zyngier --- drivers/iommu/rockchip-iommu.c | 24 +--- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index 4e0f9b61cd7f..2b1724e8d307 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -1161,17 +1161,6 @@ static int rk_iommu_probe(struct platform_device *pdev) if (iommu->num_mmu == 0) return PTR_ERR(iommu->bases[0]); - i = 0; - while ((irq = platform_get_irq(pdev, i++)) != -ENXIO) { - if (irq < 0) - return irq; - - err = devm_request_irq(iommu->dev, irq, rk_iommu_irq, - IRQF_SHARED, dev_name(dev), iommu); - if (err) - return err; - } - iommu->reset_disabled = device_property_read_bool(dev, "rockchip,disable-mmu-reset"); @@ -1228,6 +1217,19 @@ static int rk_iommu_probe(struct platform_device *pdev) pm_runtime_enable(dev); + i = 0; + while ((irq = platform_get_irq(pdev, i++)) != -ENXIO) { + if (irq < 0) + return irq; + + err = devm_request_irq(iommu->dev, irq, rk_iommu_irq, + IRQF_SHARED, dev_name(dev), iommu); + if (err) { + pm_runtime_disable(dev); + goto err_remove_sysfs; + } + } + return 0; err_remove_sysfs: iommu_device_sysfs_remove(>iommu); -- 2.18.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 0/2] iommu/rockchip: Runtime PM fixes
This small series addresses a couple of runtime PM issues I've spotted while running 4.18 on a Chromebook Plus (kevin, rk3399) platform, and specifically doing kexec. Note that even with these two patches, kexec is still fairly broken on rk3399, as the VOP is never turned off (see [1] for a fix). [1] https://www.spinics.net/lists/arm-kernel/msg670229.html Marc Zyngier (2): iommu/rockchip: Handle errors returned from PM framework iommu/rockchip: Move irq request past pm_runtime_enable drivers/iommu/rockchip-iommu.c | 45 +- 1 file changed, 28 insertions(+), 17 deletions(-) -- 2.18.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/2] iommu/rockchip: Handle errors returned from PM framework
pm_runtime_get_if_in_use can fail: either PM has been disabled altogether (-EINVAL), or the device hasn't been enabled yet (0). Sadly, the Rockchip IOMMU driver tends to conflate the two things by considering a non-zero return value as successful. This has the consequence of hiding other bugs, so let's handle this case throughout the driver, with a WARN_ON_ONCE so that we can try and work out what happened. Fixes: 0f181d3cf7d98 ("iommu/rockchip: Add runtime PM support") Signed-off-by: Marc Zyngier --- drivers/iommu/rockchip-iommu.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index 054cd2c8e9c8..4e0f9b61cd7f 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -521,10 +521,11 @@ static irqreturn_t rk_iommu_irq(int irq, void *dev_id) u32 int_status; dma_addr_t iova; irqreturn_t ret = IRQ_NONE; - int i; + int i, err; - if (WARN_ON(!pm_runtime_get_if_in_use(iommu->dev))) - return 0; + err = pm_runtime_get_if_in_use(iommu->dev); + if (WARN_ON_ONCE(err <= 0)) + return ret; if (WARN_ON(clk_bulk_enable(iommu->num_clocks, iommu->clocks))) goto out; @@ -620,11 +621,15 @@ static void rk_iommu_zap_iova(struct rk_iommu_domain *rk_domain, spin_lock_irqsave(_domain->iommus_lock, flags); list_for_each(pos, _domain->iommus) { struct rk_iommu *iommu; + int ret; iommu = list_entry(pos, struct rk_iommu, node); /* Only zap TLBs of IOMMUs that are powered on. */ - if (pm_runtime_get_if_in_use(iommu->dev)) { + ret = pm_runtime_get_if_in_use(iommu->dev); + if (WARN_ON_ONCE(ret < 0)) + continue; + if (ret) { WARN_ON(clk_bulk_enable(iommu->num_clocks, iommu->clocks)); rk_iommu_zap_lines(iommu, iova, size); @@ -891,6 +896,7 @@ static void rk_iommu_detach_device(struct iommu_domain *domain, struct rk_iommu *iommu; struct rk_iommu_domain *rk_domain = to_rk_domain(domain); unsigned long flags; + int ret; /* Allow 'virtual devices' (eg drm) to detach from domain */ iommu = rk_iommu_from_dev(dev); @@ -909,7 +915,9 @@ static void rk_iommu_detach_device(struct iommu_domain *domain, list_del_init(>node); spin_unlock_irqrestore(_domain->iommus_lock, flags); - if (pm_runtime_get_if_in_use(iommu->dev)) { + ret = pm_runtime_get_if_in_use(iommu->dev); + WARN_ON_ONCE(ret < 0); + if (ret > 0) { rk_iommu_disable(iommu); pm_runtime_put(iommu->dev); } @@ -946,7 +954,8 @@ static int rk_iommu_attach_device(struct iommu_domain *domain, list_add_tail(>node, _domain->iommus); spin_unlock_irqrestore(_domain->iommus_lock, flags); - if (!pm_runtime_get_if_in_use(iommu->dev)) + ret = pm_runtime_get_if_in_use(iommu->dev); + if (!ret || WARN_ON_ONCE(ret < 0)) return 0; ret = rk_iommu_enable(iommu); -- 2.18.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/3] hexagon: remove the sync_single_for_cpu DMA operation
On Tue, Jul 31, 2018 at 05:22:29PM +0200, Christoph Hellwig wrote: > On Wed, Jul 25, 2018 at 06:39:27AM +0200, Christoph Hellwig wrote: > > On Tue, Jul 24, 2018 at 10:29:48PM -0500, Richard Kuo wrote: > > > Patch series looks good. Definitely appreciate the cleanup. > > > > > > I can take it through my tree, or if not: > > > > > > Acked-by: Richard Kuo > > > > Please take it through your tree, thanks! > > I haven't seen it in linux-next yet, do you still plan to take it? > > Otherwise I'll merge it in the dma-mapping tree. Given that I haven't seen this in linux-next nor haven't heard back from you I assume you are on your well deserved summer vacation. If I don't hear back until tomorrow night I'll merge it through the dma-mapping tree, so that I have it in linux-next before heading out to my summer vacation. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu