Re: [PATCH RFC] dma-direct: do not allocate a single page from CMA area

2018-11-01 Thread Christoph Hellwig
On Thu, Nov 01, 2018 at 02:07:55PM +, Robin Murphy wrote:
> On 31/10/2018 20:03, Nicolin Chen wrote:
>> The addresses within a single page are always contiguous, so it's
>> not so necessary to allocate one single page from CMA area. Since
>> the CMA area has a limited predefined size of space, it might run
>> out of space in some heavy use case, where there might be quite a
>> lot CMA pages being allocated for single pages.
>>
>> This patch tries to skip CMA allocations of single pages and lets
>> them go through normal page allocations. This would save resource
>> in the CMA area for further more CMA allocations.
>
> In general, this seems to make sense to me. It does represent a theoretical 
> change in behaviour for devices which have their own CMA area somewhere 
> other than kernel memory, and only ever make non-atomic allocations, but 
> I'm not sure whether that's a realistic or common enough case to really 
> worry about.

Yes, I think we should make the decision in dma_alloc_from_contiguous
based on having a per-dev CMA area or not.  There is a lot of cruft in
this area that should be cleaned up while we're at it, like always
falling back to the normal page allocator if there is no CMA area or
nothing suitable found in dma_alloc_from_contiguous instead of
having to duplicate all that in the caller.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 04/14] iommu/io-pgtable-arm-v7s: Add paddr_to_iopte and iopte_to_paddr helpers

2018-11-01 Thread Yong Wu
Hi Robin,

Thanks very much for your review.

On Thu, 2018-11-01 at 16:34 +, Robin Murphy wrote:
> [ apologies if anyone gets this twice, looks like our email's eaten 
> itself again... ]
> 
> On 24/09/2018 09:58, Yong Wu wrote:
> > Add two helper functions: paddr_to_iopte and iopte_to_paddr.
> 
> Nice and tidy! (although it will need rebasing now, but I think that's 
> just context conflicts)
> 
> > Signed-off-by: Yong Wu 
> > ---
> >   drivers/iommu/io-pgtable-arm-v7s.c | 45 
> > --
> >   1 file changed, 33 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/iommu/io-pgtable-arm-v7s.c 
> > b/drivers/iommu/io-pgtable-arm-v7s.c
> > index b5948ba..fa3b9ec 100644
> > --- a/drivers/iommu/io-pgtable-arm-v7s.c
> > +++ b/drivers/iommu/io-pgtable-arm-v7s.c
> > @@ -173,18 +173,38 @@ struct arm_v7s_io_pgtable {
> > spinlock_t  split_lock;
> >   };
> >   
> > +static bool arm_v7s_pte_is_cont(arm_v7s_iopte pte, int lvl);
> > +
> >   static dma_addr_t __arm_v7s_dma_addr(void *pages)
> >   {
> > return (dma_addr_t)virt_to_phys(pages);
> >   }
> >   
> > -static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl)
> > +static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,
> > +   struct arm_v7s_io_pgtable *data)
> 
> Nit: it seems to work out a little cleaner if we pass the io_pgtable_cfg 
> to these helpers directly, rather than re-deriving it from data each time.

I will use "io_pgtable_cfg" instead of "arm_v7s_io_pgtable" in the next
version. This is a minor change and I will keep your R-b.

> 
> Otherwise, though,
> 
> Reviewed-by: Robin Murphy 
> 
> >   {
> > +   return paddr & ARM_V7S_LVL_MASK(lvl);
> > +}
> > +
> > +static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
> > + struct arm_v7s_io_pgtable *data)
> > +{
> > +   arm_v7s_iopte mask;
> > +
> > if (ARM_V7S_PTE_IS_TABLE(pte, lvl))
> > -   pte &= ARM_V7S_TABLE_MASK;
> > +   mask = ARM_V7S_TABLE_MASK;
> > +   else if (arm_v7s_pte_is_cont(pte, lvl))
> > +   mask = ARM_V7S_LVL_MASK(lvl) * ARM_V7S_CONT_PAGES;
> > else
> > -   pte &= ARM_V7S_LVL_MASK(lvl);
> > -   return phys_to_virt(pte);
> > +   mask = ARM_V7S_LVL_MASK(lvl);
> > +
> > +   return pte & mask;
> > +}
> > +
> > +static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl,
> > + struct arm_v7s_io_pgtable *data)
> > +{
> > +   return phys_to_virt(iopte_to_paddr(pte, lvl, data));
> >   }
> >   
> >   static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
> > @@ -396,7 +416,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable 
> > *data,
> > if (num_entries > 1)
> > pte = arm_v7s_pte_to_cont(pte, lvl);
> >   
> > -   pte |= paddr & ARM_V7S_LVL_MASK(lvl);
> > +   pte |= paddr_to_iopte(paddr, lvl, data);
> >   
> > __arm_v7s_set_pte(ptep, pte, num_entries, cfg);
> > return 0;
> > @@ -462,7 +482,7 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable 
> > *data, unsigned long iova,
> > }
> >   
> > if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
> > -   cptep = iopte_deref(pte, lvl);
> > +   cptep = iopte_deref(pte, lvl, data);
> > } else if (pte) {
> > /* We require an unmap first */
> > WARN_ON(!selftest_running);
> > @@ -512,7 +532,8 @@ static void arm_v7s_free_pgtable(struct io_pgtable *iop)
> > arm_v7s_iopte pte = data->pgd[i];
> >   
> > if (ARM_V7S_PTE_IS_TABLE(pte, 1))
> > -   __arm_v7s_free_table(iopte_deref(pte, 1), 2, data);
> > +   __arm_v7s_free_table(iopte_deref(pte, 1, data),
> > +2, data);
> > }
> > __arm_v7s_free_table(data->pgd, 1, data);
> > kmem_cache_destroy(data->l2_tables);
> > @@ -582,7 +603,7 @@ static size_t arm_v7s_split_blk_unmap(struct 
> > arm_v7s_io_pgtable *data,
> > if (!ARM_V7S_PTE_IS_TABLE(pte, 1))
> > return 0;
> >   
> > -   tablep = iopte_deref(pte, 1);
> > +   tablep = iopte_deref(pte, 1, data);
> > return __arm_v7s_unmap(data, iova, size, 2, tablep);
> > }
> >   
> > @@ -640,7 +661,7 @@ static size_t __arm_v7s_unmap(struct arm_v7s_io_pgtable 
> > *data,
> > io_pgtable_tlb_add_flush(iop, iova, blk_size,
> > ARM_V7S_BLOCK_SIZE(lvl + 1), false);
> > io_pgtable_tlb_sync(iop);
> > -   ptep = iopte_deref(pte[i], lvl);
> > +   ptep = iopte_deref(pte[i], lvl, data);
> > __arm_v7s_free_table(ptep, lvl + 1, data);
> > } else {
> > io_pgtable_tlb_add_flush(iop, iova, blk_size,
> > @@ -658,7 +679,7 @@ static size_t __arm_v7s_unmap(struct arm_v7s_io_pgtable 
> > *data,
> > }
> >   
> > /* Keep on walki

Re: [PATCH v2 05/14] iommu/io-pgtable-arm-v7s: Extend MediaTek 4GB Mode

2018-11-01 Thread Yong Wu
On Thu, 2018-11-01 at 16:35 +, Robin Murphy wrote:
> On 24/09/2018 09:58, Yong Wu wrote:
> > MediaTek extend the arm v7s descriptor to support the dram over 4GB.
> > 
> > In the mt2712 and mt8173, it's called "4GB mode", the physical address
> > is from 0x4000_ to 0x1_3fff_, but from EMI point of view, it
> > is remapped to high address from 0x1__ to 0x1__, the
> > bit32 is always enabled. thus, in the M4U, we always enable the bit9
> > for all PTEs which means to enable bit32 of physical address.
> > 
> > but in mt8183, M4U support the dram from 0x4000_ to 0x3__
> > which isn't remaped. We extend the PTEs: the bit9 represent bit32 of
> > PA and the bit4 represent bit33 of PA. Meanwhile the iova still is
> > 32bits.
> > 
> > In order to unify code, in the "4GB mode", we add the bit32 for the
> > physical address manually in our driver.
> > 
> > Correspondingly, Adding bit32 and bit33 for the PA in the iova_to_phys
> > has to been moved into v7s.
> > 
> > Signed-off-by: Yong Wu 
> > ---
> >   drivers/iommu/io-pgtable-arm-v7s.c | 34 +++---
> >   drivers/iommu/io-pgtable.h |  7 +++
> >   drivers/iommu/mtk_iommu.c  | 16 ++--
> >   drivers/iommu/mtk_iommu.h  |  1 +
> >   4 files changed, 41 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/iommu/io-pgtable-arm-v7s.c 
> > b/drivers/iommu/io-pgtable-arm-v7s.c
> > index fa3b9ec..3679a5f 100644
> > --- a/drivers/iommu/io-pgtable-arm-v7s.c
> > +++ b/drivers/iommu/io-pgtable-arm-v7s.c
> > @@ -124,7 +124,9 @@
> >   #define ARM_V7S_TEX_MASK  0x7
> >   #define ARM_V7S_ATTR_TEX(val) (((val) & ARM_V7S_TEX_MASK) << 
> > ARM_V7S_TEX_SHIFT)
> >   
> > -#define ARM_V7S_ATTR_MTK_4GB   BIT(9) /* MTK extend it for 4GB 
> > mode */
> > +/* MediaTek extend the two bits below for over 4GB mode */
> > +#define ARM_V7S_ATTR_MTK_PA_BIT32  BIT(9)
> > +#define ARM_V7S_ATTR_MTK_PA_BIT33  BIT(4)
> >   
> >   /* *well, except for TEX on level 2 large pages, of course :( */
> >   #define ARM_V7S_CONT_PAGE_TEX_SHIFT   6
> > @@ -183,13 +185,24 @@ static dma_addr_t __arm_v7s_dma_addr(void *pages)
> >   static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,
> > struct arm_v7s_io_pgtable *data)
> >   {
> > -   return paddr & ARM_V7S_LVL_MASK(lvl);
> > +   arm_v7s_iopte pte = paddr & ARM_V7S_LVL_MASK(lvl);
> > +   struct io_pgtable_cfg *cfg = &data->iop.cfg;
> > +
> > +   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB) {
> > +   if (paddr & BIT_ULL(32))
> > +   pte |= ARM_V7S_ATTR_MTK_PA_BIT32;
> > +   if (paddr & BIT_ULL(33))
> > +   pte |= ARM_V7S_ATTR_MTK_PA_BIT33;
> > +   }
> > +   return pte;
> >   }
> >   
> >   static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
> >   struct arm_v7s_io_pgtable *data)
> >   {
> > +   struct io_pgtable_cfg *cfg = &data->iop.cfg;
> > arm_v7s_iopte mask;
> > +   phys_addr_t paddr;
> >   
> > if (ARM_V7S_PTE_IS_TABLE(pte, lvl))
> > mask = ARM_V7S_TABLE_MASK;
> > @@ -198,7 +211,15 @@ static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, 
> > int lvl,
> > else
> > mask = ARM_V7S_LVL_MASK(lvl);
> >   
> > -   return pte & mask;
> > +   paddr = pte & mask;
> > +   if (IS_ENABLED(CONFIG_PHYS_ADDR_T_64BIT) &&
> 
> I don't think we need this IS_ENABLED() check - we don't have one in the 
> equivalent paddr_to_iopte() case above, and what it's protecting is 
> clearly a no-op anyway when paddr has a 32-bit type (a GCC 7 build for 
> non-LPAE arch/arm doesn't complain with the check removed).

OK.I will delete IS_ENABLED here.
Thanks.

> 
> Other than that, though, the rest looks good now;
> 
> Reviewed-by: Robin Murphy 
> 
> > +   cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB) {
> > +   if (pte & ARM_V7S_ATTR_MTK_PA_BIT32)
> > +   paddr |= BIT_ULL(32);
> > +   if (pte & ARM_V7S_ATTR_MTK_PA_BIT33)
> > +   paddr |= BIT_ULL(33);
> > +   }
> > +   return paddr;
> >   }
> >   
> >   static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl,
> > @@ -315,9 +336,6 @@ static arm_v7s_iopte arm_v7s_prot_to_pte(int prot, int 
> > lvl,
> > if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
> > pte |= ARM_V7S_ATTR_NS_SECTION;
> >   
> > -   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
> > -   pte |= ARM_V7S_ATTR_MTK_4GB;
> > -
> > return pte;
> >   }
> >   
> > @@ -504,7 +522,9 @@ static int arm_v7s_map(struct io_pgtable_ops *ops, 
> > unsigned long iova,
> > if (!(prot & (IOMMU_READ | IOMMU_WRITE)))
> > return 0;
> >   
> > -   if (WARN_ON(upper_32_bits(iova) || upper_32_bits(paddr)))
> > +   if (WARN_ON(upper_32_bits(iova)) ||
> > +   WARN_ON(upper_32_bits(paddr) &&
> > +   !(iop->cfg.quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)))
> > return -ER

Re: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-11-01 Thread Lu Baolu

Hi,

On 10/23/18 12:48 AM, Jordan Crouse wrote:

On Fri, Oct 19, 2018 at 07:11:52PM +0100, Jean-Philippe Brucker wrote:


(2) Allocate a domain and attach it to the device.

   dom = iommu_domain_alloc()
   iommu_attach_device(dom, dev)

 I still have concerns about this part, which are highlighted by the
 messy changes of patch 1. I think it would make more sense to
 introduce new attach/detach_dev_aux() functions instead of reusing
 attach/detach_dev()

 Can we reconsider this and avoid unnecessary complications in IOMMU
 core and drivers? Does the VFIO code greatly benefit from using the
 same attach() function? It could as well use a different one for
 devices in AUXD mode, which the mediating driver could tell by
 adding a flag in mdev_set_iommu_device(), for example.

 And I don't think other users of AUXD would benefit from using the
 same attach() function, since they will know whether they want to be
 using main or auxiliary domain when doing attach().


I second this. For the arm-smmu-v2 multiple-pagetable model we would either need
new API functions or do something along the lines of

dom = iommu_domain_alloc()
iommu_set_attr(dom, DOMAIN_ATTR_I_WANT_TO_BE_AUX)
iommu_attach_device(dom,dev)

Either that or do some some dummy device magic that I haven't started to work
out in my head. Having aux specific functions just for this step would make the
arm-smmu-v2 version work out much cleaner.


Okay! If there are no more comments, I will prepare a new version with
aux specific functions.

Best regards,
Lu Baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] iommu/vt-d: Group and domain relationship

2018-11-01 Thread Lu Baolu

Hi,

On 10/30/18 10:18 PM, James Sewart via iommu wrote:

Hey,

I’ve been investigating the relationship between iommu groups and domains
on our systems and have a few question. Why does the intel iommu code not
allow allocating IOMMU_DOMAIN_DMA? Returning NULL when given this domain
type has the side effect that the default_domain for an iommu group is not
set, which, when using for e.g. dma_map_ops.map_page, means a domain is
allocated per device.


Intel vt-d driver doesn't implement the default domain and allocates
domain only on demanded. There are lots of things to do before we allow
iommu API to allocate domains other than IOMMU_DOMAIN_UNMANAGED.

Best regards,
Lu Baolu



This seems to be the opposite behaviour to the AMD iommu code which
supports allocating an IOMMU_DOMAIN_DMA and will only look to the iommu
group if a domain is not attached to the device rather than allocating a
new one. On AMD every device in an iommu group will share the same domain.

Appended is what I think a patch to implement domain_alloc for
IOMMU_DOMAIN_DMA and also IOMMU_DOMAIN_IDENTITY would look like. Testing
shows each device in a group will share a domain by default, it also
allows allocating a new dma domain that can be successfully attached to a
group with iommu_attach_group.

Looking for comment on why the behaviour is how it is currently and if
there are any issues with the solution I’ve been testing.

Cheers,
James.


diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index bff2abd6..3a58389f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5170,10 +5170,15 @@ static struct iommu_domain 
*intel_iommu_domain_alloc(unsigned type)
struct dmar_domain *dmar_domain;
struct iommu_domain *domain;
  
-	if (type != IOMMU_DOMAIN_UNMANAGED)

+   if (type == IOMMU_DOMAIN_UNMANAGED)
+   dmar_domain = alloc_domain(DOMAIN_FLAG_VIRTUAL_MACHINE);
+   else if(type == IOMMU_DOMAIN_DMA)
+   dmar_domain = alloc_domain(0);
+   else if(type == IOMMU_DOMAIN_IDENTITY)
+   dmar_domain = alloc_domain(DOMAIN_FLAG_STATIC_IDENTITY);
+   else
return NULL;
  
-	dmar_domain = alloc_domain(DOMAIN_FLAG_VIRTUAL_MACHINE);

if (!dmar_domain) {
pr_err("Can't allocate dmar_domain\n");
return NULL;
@@ -5186,9 +5191,12 @@ static struct iommu_domain 
*intel_iommu_domain_alloc(unsigned type)
domain_update_iommu_cap(dmar_domain);
  
  	domain = &dmar_domain->domain;

-   domain->geometry.aperture_start = 0;
-   domain->geometry.aperture_end   = __DOMAIN_MAX_ADDR(dmar_domain->gaw);
-   domain->geometry.force_aperture = true;
+
+   if (type == IOMMU_DOMAIN_UNMANAGED) {
+   domain->geometry.aperture_start = 0;
+   domain->geometry.aperture_end   = 
__DOMAIN_MAX_ADDR(dmar_domain->gaw);
+   domain->geometry.force_aperture = true;
+   }
  
  	return domain;

  }
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/dma: Zero pages manually in a length of scatterlist

2018-11-01 Thread Nicolin Chen
The __GFP_ZERO will be passed down to the generic page allocation
routine which zeros everything page by page. This is safe to be a
generic way but not efficient for iommu allocation that organizes
contiguous pages using scatterlist.

So this changes drops __GFP_ZERO from the flag, and adds a manual
memset after page/sg allocations, using the length of scatterlist.

My test result of a 2.5MB size allocation shows iommu_dma_alloc()
takes 46% less time, reduced from averagely 925 usec to 500 usec.

Signed-off-by: Nicolin Chen 
---
 drivers/iommu/dma-iommu.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b04753b204..e48d995e65c5 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -551,10 +551,13 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
struct iommu_domain *domain = iommu_get_dma_domain(dev);
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = &cookie->iovad;
+   struct scatterlist *s;
struct page **pages;
struct sg_table sgt;
dma_addr_t iova;
unsigned int count, min_size, alloc_sizes = domain->pgsize_bitmap;
+   bool gfp_zero = false;
+   int i;
 
*handle = IOMMU_MAPPING_ERROR;
 
@@ -568,6 +571,15 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
if (attrs & DMA_ATTR_ALLOC_SINGLE_PAGES)
alloc_sizes = min_size;
 
+   /*
+* The generic zeroing in a length of one page size is slow,
+* so do it manually in a length of scatterlist size instead
+*/
+   if (gfp & __GFP_ZERO) {
+   gfp &= ~__GFP_ZERO;
+   gfp_zero = true;
+   }
+
count = PAGE_ALIGN(size) >> PAGE_SHIFT;
pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
if (!pages)
@@ -581,6 +593,12 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
goto out_free_iova;
 
+   if (gfp_zero) {
+   /* Now zero all the pages in the scatterlist */
+   for_each_sg(sgt.sgl, s, sgt.orig_nents, i)
+   memset(sg_virt(s), 0, s->length);
+   }
+
if (!(prot & IOMMU_CACHE)) {
struct sg_mapping_iter miter;
/*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC] dma-direct: do not allocate a single page from CMA area

2018-11-01 Thread Nicolin Chen
On Thu, Nov 01, 2018 at 07:32:39PM +, Robin Murphy wrote:

> > On Thu, Nov 01, 2018 at 02:07:55PM +, Robin Murphy wrote:
> > > On 31/10/2018 20:03, Nicolin Chen wrote:
> > > > The addresses within a single page are always contiguous, so it's
> > > > not so necessary to allocate one single page from CMA area. Since
> > > > the CMA area has a limited predefined size of space, it might run
> > > > out of space in some heavy use case, where there might be quite a
> > > > lot CMA pages being allocated for single pages.
> > > > 
> > > > This patch tries to skip CMA allocations of single pages and lets
> > > > them go through normal page allocations. This would save resource
> > > > in the CMA area for further more CMA allocations.
> > > 
> > > In general, this seems to make sense to me. It does represent a 
> > > theoretical
> > > change in behaviour for devices which have their own CMA area somewhere
> > > other than kernel memory, and only ever make non-atomic allocations, but 
> > > I'm
> > > not sure whether that's a realistic or common enough case to really worry
> > > about.
> > 
> > Hmm..I don't quite understand the part of worrying its realisticness.
> > Would you mind elaborating a bit?
> 
> I only mean the case where a driver previously happened to get single pages
> allocated from a per-device CMA area, would now always get them fulfilled
> from regular kernel memory instead, and actually cares about the difference.

I see. I think that's a good question.

> As I say, that's a contrived case that I doubt is honestly a significant
> concern, but it's not *entirely* inconceivable. I've just been bitten before
> by drivers relying on specific DMA API implementation behaviour which was
> never guaranteed or even necessarily correct by the terms of the API itself,
> so I'm naturally wary of the corner cases ;)

I also have a vague concern that CMA pages might turn out to be
special so this change would make some differences in stability
or performance for those who actually rely on actual CMA pages,
though I ain't sure if it can be true or realistic as you said.

> On second thought, however, I suppose we could always key this off
> DMA_ATTR_FORCE_CONTIGUOUS as well if we really want - technically it has a
> more general meaning than "only ever allocate from CMA", but in practice if
> that's the behaviour a driver wants, then that flag is already the only way
> it can even hope to get dma_alloc_coherent() to comply anywhere near
> reliably.

That is a good input. Would you prefer to have it in condition
check now with this patch?

> > As I tested this change on Tegra186
> > board, and saw some single-page allocations have been directed to the
> > normal allocation; and the "CmaFree" size reported from /proc/meminfo
> > is also increased. Does this mean it's realistic?
> 
> Indeed - I happen to have CMA debug enabled for no good reason in my current
> development config, and on my relatively unexciting Juno board single-page
> allocations turn out to be the majority by number, even if not by total
> consumption:
> 
> [0.519663] cma: cma_alloc(cma (ptrval), count 64, align 6)
> [0.527508] cma: cma_alloc(): returned (ptrval)
> [3.768066] cma: cma_alloc(cma (ptrval), count 1, align 0)
> [3.774566] cma: cma_alloc(): returned (ptrval)
> [3.860097] cma: cma_alloc(cma (ptrval), count 1875, align 8)
> [3.867150] cma: cma_alloc(): returned (ptrval)
> [3.920796] cma: cma_alloc(cma (ptrval), count 31, align 5)
> [3.927093] cma: cma_alloc(): returned (ptrval)
> [3.932326] cma: cma_alloc(cma (ptrval), count 31, align 5)
> [3.938643] cma: cma_alloc(): returned (ptrval)
> [4.022188] cma: cma_alloc(cma (ptrval), count 1, align 0)
> [4.028415] cma: cma_alloc(): returned (ptrval)
> [4.033600] cma: cma_alloc(cma (ptrval), count 1, align 0)
> [4.039786] cma: cma_alloc(): returned (ptrval)
> [4.044968] cma: cma_alloc(cma (ptrval), count 1, align 0)
> [4.051150] cma: cma_alloc(): returned (ptrval)
> [4.113556] cma: cma_alloc(cma (ptrval), count 1, align 0)
> [4.119785] cma: cma_alloc(): returned (ptrval)
> [5.012654] cma: cma_alloc(cma (ptrval), count 1, align 0)
> [5.019047] cma: cma_alloc(): returned (ptrval)
> [   11.485179] cma: cma_alloc(cma 9dd074ee, count 1, align 0)
> [   11.492096] cma: cma_alloc(): returned 9264a86c
> [   12.269355] cma: cma_alloc(cma 9dd074ee, count 1875, align 8)
> [   12.277535] cma: cma_alloc(): returned d7bb9ae5
> [   12.286110] cma: cma_alloc(cma 9dd074ee, count 4, align 2)
> [   12.292507] cma: cma_alloc(): returned 07ba7a39
> 
> I don't have any exciting peripherals to really exercise the coherent
> allocator, but I imagine that fragmentation is probably just as good a
> reason as total CMA usage for avo

Re: [PATCH RFC] dma-direct: do not allocate a single page from CMA area

2018-11-01 Thread Robin Murphy

On 01/11/2018 18:04, Nicolin Chen wrote:

Hi Robin,

Thanks for the comments.

On Thu, Nov 01, 2018 at 02:07:55PM +, Robin Murphy wrote:

On 31/10/2018 20:03, Nicolin Chen wrote:

The addresses within a single page are always contiguous, so it's
not so necessary to allocate one single page from CMA area. Since
the CMA area has a limited predefined size of space, it might run
out of space in some heavy use case, where there might be quite a
lot CMA pages being allocated for single pages.

This patch tries to skip CMA allocations of single pages and lets
them go through normal page allocations. This would save resource
in the CMA area for further more CMA allocations.


In general, this seems to make sense to me. It does represent a theoretical
change in behaviour for devices which have their own CMA area somewhere
other than kernel memory, and only ever make non-atomic allocations, but I'm
not sure whether that's a realistic or common enough case to really worry
about.


Hmm..I don't quite understand the part of worrying its realisticness.
Would you mind elaborating a bit?


I only mean the case where a driver previously happened to get single 
pages allocated from a per-device CMA area, would now always get them 
fulfilled from regular kernel memory instead, and actually cares about 
the difference. As I say, that's a contrived case that I doubt is 
honestly a significant concern, but it's not *entirely* inconceivable. 
I've just been bitten before by drivers relying on specific DMA API 
implementation behaviour which was never guaranteed or even necessarily 
correct by the terms of the API itself, so I'm naturally wary of the 
corner cases ;)


On second thought, however, I suppose we could always key this off 
DMA_ATTR_FORCE_CONTIGUOUS as well if we really want - technically it has 
a more general meaning than "only ever allocate from CMA", but in 
practice if that's the behaviour a driver wants, then that flag is 
already the only way it can even hope to get dma_alloc_coherent() to 
comply anywhere near reliably.



As I tested this change on Tegra186
board, and saw some single-page allocations have been directed to the
normal allocation; and the "CmaFree" size reported from /proc/meminfo
is also increased. Does this mean it's realistic?


Indeed - I happen to have CMA debug enabled for no good reason in my 
current development config, and on my relatively unexciting Juno board 
single-page allocations turn out to be the majority by number, even if 
not by total consumption:


[0.519663] cma: cma_alloc(cma (ptrval), count 64, align 6)
[0.527508] cma: cma_alloc(): returned (ptrval)
[3.768066] cma: cma_alloc(cma (ptrval), count 1, align 0)
[3.774566] cma: cma_alloc(): returned (ptrval)
[3.860097] cma: cma_alloc(cma (ptrval), count 1875, align 8)
[3.867150] cma: cma_alloc(): returned (ptrval)
[3.920796] cma: cma_alloc(cma (ptrval), count 31, align 5)
[3.927093] cma: cma_alloc(): returned (ptrval)
[3.932326] cma: cma_alloc(cma (ptrval), count 31, align 5)
[3.938643] cma: cma_alloc(): returned (ptrval)
[4.022188] cma: cma_alloc(cma (ptrval), count 1, align 0)
[4.028415] cma: cma_alloc(): returned (ptrval)
[4.033600] cma: cma_alloc(cma (ptrval), count 1, align 0)
[4.039786] cma: cma_alloc(): returned (ptrval)
[4.044968] cma: cma_alloc(cma (ptrval), count 1, align 0)
[4.051150] cma: cma_alloc(): returned (ptrval)
[4.113556] cma: cma_alloc(cma (ptrval), count 1, align 0)
[4.119785] cma: cma_alloc(): returned (ptrval)
[5.012654] cma: cma_alloc(cma (ptrval), count 1, align 0)
[5.019047] cma: cma_alloc(): returned (ptrval)
[   11.485179] cma: cma_alloc(cma 9dd074ee, count 1, align 0)
[   11.492096] cma: cma_alloc(): returned 9264a86c
[   12.269355] cma: cma_alloc(cma 9dd074ee, count 1875, align 8)
[   12.277535] cma: cma_alloc(): returned d7bb9ae5
[   12.286110] cma: cma_alloc(cma 9dd074ee, count 4, align 2)
[   12.292507] cma: cma_alloc(): returned 07ba7a39

I don't have any exciting peripherals to really exercise the coherent 
allocator, but I imagine that fragmentation is probably just as good a 
reason as total CMA usage for avoiding trivial allocations by default.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC] dma-direct: do not allocate a single page from CMA area

2018-11-01 Thread Nicolin Chen
Hi Robin,

Thanks for the comments.

On Thu, Nov 01, 2018 at 02:07:55PM +, Robin Murphy wrote:
> On 31/10/2018 20:03, Nicolin Chen wrote:
> > The addresses within a single page are always contiguous, so it's
> > not so necessary to allocate one single page from CMA area. Since
> > the CMA area has a limited predefined size of space, it might run
> > out of space in some heavy use case, where there might be quite a
> > lot CMA pages being allocated for single pages.
> > 
> > This patch tries to skip CMA allocations of single pages and lets
> > them go through normal page allocations. This would save resource
> > in the CMA area for further more CMA allocations.
> 
> In general, this seems to make sense to me. It does represent a theoretical
> change in behaviour for devices which have their own CMA area somewhere
> other than kernel memory, and only ever make non-atomic allocations, but I'm
> not sure whether that's a realistic or common enough case to really worry
> about.

Hmm..I don't quite understand the part of worrying its realisticness.
Would you mind elaborating a bit? As I tested this change on Tegra186
board, and saw some single-page allocations have been directed to the
normal allocation; and the "CmaFree" size reported from /proc/meminfo
is also increased. Does this mean it's realistic?

Thank you
Nicolin

-

> 
> Robin.
> 
> > Signed-off-by: Nicolin Chen 
> > ---
> >   kernel/dma/direct.c | 8 ++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> > index 22a12ab5a5e9..14c5d49eded2 100644
> > --- a/kernel/dma/direct.c
> > +++ b/kernel/dma/direct.c
> > @@ -120,8 +120,12 @@ void *dma_direct_alloc_pages(struct device *dev, 
> > size_t size,
> > gfp |= __dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
> > &phys_mask);
> >   again:
> > -   /* CMA can be used only in the context which permits sleeping */
> > -   if (gfpflags_allow_blocking(gfp)) {
> > +   /*
> > +* CMA can be used only in the context which permits sleeping.
> > +* Since addresses within one PAGE are always contiguous, skip
> > +* CMA allocation for a single page to save CMA reserved space
> > +*/
> > +   if (gfpflags_allow_blocking(gfp) && count > 1) {
> > page = dma_alloc_from_contiguous(dev, count, page_order,
> >  gfp & __GFP_NOWARN);
> > if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
> > 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 05/14] iommu/io-pgtable-arm-v7s: Extend MediaTek 4GB Mode

2018-11-01 Thread Robin Murphy
On 24/09/2018 09:58, Yong Wu wrote:
> MediaTek extend the arm v7s descriptor to support the dram over 4GB.
> 
> In the mt2712 and mt8173, it's called "4GB mode", the physical address
> is from 0x4000_ to 0x1_3fff_, but from EMI point of view, it
> is remapped to high address from 0x1__ to 0x1__, the
> bit32 is always enabled. thus, in the M4U, we always enable the bit9
> for all PTEs which means to enable bit32 of physical address.
> 
> but in mt8183, M4U support the dram from 0x4000_ to 0x3__
> which isn't remaped. We extend the PTEs: the bit9 represent bit32 of
> PA and the bit4 represent bit33 of PA. Meanwhile the iova still is
> 32bits.
> 
> In order to unify code, in the "4GB mode", we add the bit32 for the
> physical address manually in our driver.
> 
> Correspondingly, Adding bit32 and bit33 for the PA in the iova_to_phys
> has to been moved into v7s.
> 
> Signed-off-by: Yong Wu 
> ---
>   drivers/iommu/io-pgtable-arm-v7s.c | 34 +++---
>   drivers/iommu/io-pgtable.h |  7 +++
>   drivers/iommu/mtk_iommu.c  | 16 ++--
>   drivers/iommu/mtk_iommu.h  |  1 +
>   4 files changed, 41 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm-v7s.c 
> b/drivers/iommu/io-pgtable-arm-v7s.c
> index fa3b9ec..3679a5f 100644
> --- a/drivers/iommu/io-pgtable-arm-v7s.c
> +++ b/drivers/iommu/io-pgtable-arm-v7s.c
> @@ -124,7 +124,9 @@
>   #define ARM_V7S_TEX_MASK0x7
>   #define ARM_V7S_ATTR_TEX(val)   (((val) & ARM_V7S_TEX_MASK) << 
> ARM_V7S_TEX_SHIFT)
>   
> -#define ARM_V7S_ATTR_MTK_4GB BIT(9) /* MTK extend it for 4GB mode */
> +/* MediaTek extend the two bits below for over 4GB mode */
> +#define ARM_V7S_ATTR_MTK_PA_BIT32BIT(9)
> +#define ARM_V7S_ATTR_MTK_PA_BIT33BIT(4)
>   
>   /* *well, except for TEX on level 2 large pages, of course :( */
>   #define ARM_V7S_CONT_PAGE_TEX_SHIFT 6
> @@ -183,13 +185,24 @@ static dma_addr_t __arm_v7s_dma_addr(void *pages)
>   static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,
>   struct arm_v7s_io_pgtable *data)
>   {
> - return paddr & ARM_V7S_LVL_MASK(lvl);
> + arm_v7s_iopte pte = paddr & ARM_V7S_LVL_MASK(lvl);
> + struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +
> + if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB) {
> + if (paddr & BIT_ULL(32))
> + pte |= ARM_V7S_ATTR_MTK_PA_BIT32;
> + if (paddr & BIT_ULL(33))
> + pte |= ARM_V7S_ATTR_MTK_PA_BIT33;
> + }
> + return pte;
>   }
>   
>   static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
> struct arm_v7s_io_pgtable *data)
>   {
> + struct io_pgtable_cfg *cfg = &data->iop.cfg;
>   arm_v7s_iopte mask;
> + phys_addr_t paddr;
>   
>   if (ARM_V7S_PTE_IS_TABLE(pte, lvl))
>   mask = ARM_V7S_TABLE_MASK;
> @@ -198,7 +211,15 @@ static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int 
> lvl,
>   else
>   mask = ARM_V7S_LVL_MASK(lvl);
>   
> - return pte & mask;
> + paddr = pte & mask;
> + if (IS_ENABLED(CONFIG_PHYS_ADDR_T_64BIT) &&

I don't think we need this IS_ENABLED() check - we don't have one in the 
equivalent paddr_to_iopte() case above, and what it's protecting is 
clearly a no-op anyway when paddr has a 32-bit type (a GCC 7 build for 
non-LPAE arch/arm doesn't complain with the check removed).

Other than that, though, the rest looks good now;

Reviewed-by: Robin Murphy 

> + cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB) {
> + if (pte & ARM_V7S_ATTR_MTK_PA_BIT32)
> + paddr |= BIT_ULL(32);
> + if (pte & ARM_V7S_ATTR_MTK_PA_BIT33)
> + paddr |= BIT_ULL(33);
> + }
> + return paddr;
>   }
>   
>   static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl,
> @@ -315,9 +336,6 @@ static arm_v7s_iopte arm_v7s_prot_to_pte(int prot, int 
> lvl,
>   if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
>   pte |= ARM_V7S_ATTR_NS_SECTION;
>   
> - if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
> - pte |= ARM_V7S_ATTR_MTK_4GB;
> -
>   return pte;
>   }
>   
> @@ -504,7 +522,9 @@ static int arm_v7s_map(struct io_pgtable_ops *ops, 
> unsigned long iova,
>   if (!(prot & (IOMMU_READ | IOMMU_WRITE)))
>   return 0;
>   
> - if (WARN_ON(upper_32_bits(iova) || upper_32_bits(paddr)))
> + if (WARN_ON(upper_32_bits(iova)) ||
> + WARN_ON(upper_32_bits(paddr) &&
> + !(iop->cfg.quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)))
>   return -ERANGE;
>   
>   ret = __arm_v7s_map(data, iova, paddr, size, prot, 1, data->pgd);
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index 2df7909..ee7beb5 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -62

Re: [PATCH v2 04/14] iommu/io-pgtable-arm-v7s: Add paddr_to_iopte and iopte_to_paddr helpers

2018-11-01 Thread Robin Murphy
[ apologies if anyone gets this twice, looks like our email's eaten 
itself again... ]

On 24/09/2018 09:58, Yong Wu wrote:
> Add two helper functions: paddr_to_iopte and iopte_to_paddr.

Nice and tidy! (although it will need rebasing now, but I think that's 
just context conflicts)

> Signed-off-by: Yong Wu 
> ---
>   drivers/iommu/io-pgtable-arm-v7s.c | 45 
> --
>   1 file changed, 33 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm-v7s.c 
> b/drivers/iommu/io-pgtable-arm-v7s.c
> index b5948ba..fa3b9ec 100644
> --- a/drivers/iommu/io-pgtable-arm-v7s.c
> +++ b/drivers/iommu/io-pgtable-arm-v7s.c
> @@ -173,18 +173,38 @@ struct arm_v7s_io_pgtable {
>   spinlock_t  split_lock;
>   };
>   
> +static bool arm_v7s_pte_is_cont(arm_v7s_iopte pte, int lvl);
> +
>   static dma_addr_t __arm_v7s_dma_addr(void *pages)
>   {
>   return (dma_addr_t)virt_to_phys(pages);
>   }
>   
> -static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl)
> +static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,
> + struct arm_v7s_io_pgtable *data)

Nit: it seems to work out a little cleaner if we pass the io_pgtable_cfg 
to these helpers directly, rather than re-deriving it from data each time.

Otherwise, though,

Reviewed-by: Robin Murphy 

>   {
> + return paddr & ARM_V7S_LVL_MASK(lvl);
> +}
> +
> +static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
> +   struct arm_v7s_io_pgtable *data)
> +{
> + arm_v7s_iopte mask;
> +
>   if (ARM_V7S_PTE_IS_TABLE(pte, lvl))
> - pte &= ARM_V7S_TABLE_MASK;
> + mask = ARM_V7S_TABLE_MASK;
> + else if (arm_v7s_pte_is_cont(pte, lvl))
> + mask = ARM_V7S_LVL_MASK(lvl) * ARM_V7S_CONT_PAGES;
>   else
> - pte &= ARM_V7S_LVL_MASK(lvl);
> - return phys_to_virt(pte);
> + mask = ARM_V7S_LVL_MASK(lvl);
> +
> + return pte & mask;
> +}
> +
> +static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl,
> +   struct arm_v7s_io_pgtable *data)
> +{
> + return phys_to_virt(iopte_to_paddr(pte, lvl, data));
>   }
>   
>   static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
> @@ -396,7 +416,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable 
> *data,
>   if (num_entries > 1)
>   pte = arm_v7s_pte_to_cont(pte, lvl);
>   
> - pte |= paddr & ARM_V7S_LVL_MASK(lvl);
> + pte |= paddr_to_iopte(paddr, lvl, data);
>   
>   __arm_v7s_set_pte(ptep, pte, num_entries, cfg);
>   return 0;
> @@ -462,7 +482,7 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, 
> unsigned long iova,
>   }
>   
>   if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
> - cptep = iopte_deref(pte, lvl);
> + cptep = iopte_deref(pte, lvl, data);
>   } else if (pte) {
>   /* We require an unmap first */
>   WARN_ON(!selftest_running);
> @@ -512,7 +532,8 @@ static void arm_v7s_free_pgtable(struct io_pgtable *iop)
>   arm_v7s_iopte pte = data->pgd[i];
>   
>   if (ARM_V7S_PTE_IS_TABLE(pte, 1))
> - __arm_v7s_free_table(iopte_deref(pte, 1), 2, data);
> + __arm_v7s_free_table(iopte_deref(pte, 1, data),
> +  2, data);
>   }
>   __arm_v7s_free_table(data->pgd, 1, data);
>   kmem_cache_destroy(data->l2_tables);
> @@ -582,7 +603,7 @@ static size_t arm_v7s_split_blk_unmap(struct 
> arm_v7s_io_pgtable *data,
>   if (!ARM_V7S_PTE_IS_TABLE(pte, 1))
>   return 0;
>   
> - tablep = iopte_deref(pte, 1);
> + tablep = iopte_deref(pte, 1, data);
>   return __arm_v7s_unmap(data, iova, size, 2, tablep);
>   }
>   
> @@ -640,7 +661,7 @@ static size_t __arm_v7s_unmap(struct arm_v7s_io_pgtable 
> *data,
>   io_pgtable_tlb_add_flush(iop, iova, blk_size,
>   ARM_V7S_BLOCK_SIZE(lvl + 1), false);
>   io_pgtable_tlb_sync(iop);
> - ptep = iopte_deref(pte[i], lvl);
> + ptep = iopte_deref(pte[i], lvl, data);
>   __arm_v7s_free_table(ptep, lvl + 1, data);
>   } else {
>   io_pgtable_tlb_add_flush(iop, iova, blk_size,
> @@ -658,7 +679,7 @@ static size_t __arm_v7s_unmap(struct arm_v7s_io_pgtable 
> *data,
>   }
>   
>   /* Keep on walkin' */
> - ptep = iopte_deref(pte[0], lvl);
> + ptep = iopte_deref(pte[0], lvl, data);
>   return __arm_v7s_unmap(data, iova, size, lvl + 1, ptep);
>   }
>   
> @@ -684,7 +705,7 @@ static phys_addr_t arm_v7s_iova_to_phys(struct 
> io_pgtable_ops *ops,
>   do {
>   ptep += ARM_V7S_LVL_IDX(iova, ++lvl);
>   pte = READ_ONCE(*ptep);
> -  

Re: [PATCH] MIPS: Fix `dma_alloc_coherent' returning a non-coherent allocation

2018-11-01 Thread Maciej W. Rozycki
On Thu, 1 Nov 2018, Christoph Hellwig wrote:

> Oops, yes this looks good:

 BTW, for anyone missing hardware suitable for serious DMA testing I can 
recommend getting a pair of DEFPA cards, the ubiquitous PCI version of 
this board, cheaply available, which is the same except for a different 
host bus bridge ASIC, developed later.  They can be wired back to back 
similarly to Ethernet adapters (or in a loop if you have 2 or more dual 
attachment versions), no other hardware is required save for patch cords.
Version 3 (DEFPA-xC) boards support universal PCI signalling, older ones 
are 5V-only.

 These devices are a stellar example of fine engineering[1].  Our `defxx' 
driver, which I believe has been adapted from the DEC OSF/1 one referred 
in the said document, has some latency and other issues that I plan to 
address sometime, once I have sorted higher-priority issues, however 
hardware itself is excellent.

References:

[1] Chran-Ham Chang et al., "High-performance TCP/IP and UDP/IP Networking 
in DEC OSF/1 for Alpha AXP", Digital Technical Journal, vol. 5, no. 1 
(Winter 1993), "Network Adapter Characteristics", p. 7



  Maciej
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC] dma-direct: do not allocate a single page from CMA area

2018-11-01 Thread Robin Murphy

On 31/10/2018 20:03, Nicolin Chen wrote:

The addresses within a single page are always contiguous, so it's
not so necessary to allocate one single page from CMA area. Since
the CMA area has a limited predefined size of space, it might run
out of space in some heavy use case, where there might be quite a
lot CMA pages being allocated for single pages.

This patch tries to skip CMA allocations of single pages and lets
them go through normal page allocations. This would save resource
in the CMA area for further more CMA allocations.


In general, this seems to make sense to me. It does represent a 
theoretical change in behaviour for devices which have their own CMA 
area somewhere other than kernel memory, and only ever make non-atomic 
allocations, but I'm not sure whether that's a realistic or common 
enough case to really worry about.


Robin.


Signed-off-by: Nicolin Chen 
---
  kernel/dma/direct.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 22a12ab5a5e9..14c5d49eded2 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -120,8 +120,12 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
gfp |= __dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
&phys_mask);
  again:
-   /* CMA can be used only in the context which permits sleeping */
-   if (gfpflags_allow_blocking(gfp)) {
+   /*
+* CMA can be used only in the context which permits sleeping.
+* Since addresses within one PAGE are always contiguous, skip
+* CMA allocation for a single page to save CMA reserved space
+*/
+   if (gfpflags_allow_blocking(gfp) && count > 1) {
page = dma_alloc_from_contiguous(dev, count, page_order,
 gfp & __GFP_NOWARN);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] MIPS: Fix `dma_alloc_coherent' returning a non-coherent allocation

2018-11-01 Thread Christoph Hellwig
On Thu, Nov 01, 2018 at 07:54:24AM +, Maciej W. Rozycki wrote:
> Fix a MIPS `dma_alloc_coherent' regression from commit bc3ec75de545 
> ("dma-mapping: merge direct and noncoherent ops") that causes a cached 
> allocation to be returned on noncoherent cache systems.
> 
> This is due to an inverted check now used in the MIPS implementation of 
> `arch_dma_alloc' on the result from `dma_direct_alloc_pages' before 
> doing the cached-to-uncached mapping of the allocation address obtained.  
> The mapping has to be done for a non-NULL rather than NULL result, 
> because a NULL result means the allocation has failed.
> 
> Invert the check for correct operation then.
> 
> Signed-off-by: Maciej W. Rozycki 
> Fixes: bc3ec75de545 ("dma-mapping: merge direct and noncoherent ops")
> Cc: sta...@vger.kernel.org # 4.19+

Oops, yes this looks good:

Reviewed-by: Christoph Hellwig 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] MIPS: Fix `dma_alloc_coherent' returning a non-coherent allocation

2018-11-01 Thread Maciej W. Rozycki
Fix a MIPS `dma_alloc_coherent' regression from commit bc3ec75de545 
("dma-mapping: merge direct and noncoherent ops") that causes a cached 
allocation to be returned on noncoherent cache systems.

This is due to an inverted check now used in the MIPS implementation of 
`arch_dma_alloc' on the result from `dma_direct_alloc_pages' before 
doing the cached-to-uncached mapping of the allocation address obtained.  
The mapping has to be done for a non-NULL rather than NULL result, 
because a NULL result means the allocation has failed.

Invert the check for correct operation then.

Signed-off-by: Maciej W. Rozycki 
Fixes: bc3ec75de545 ("dma-mapping: merge direct and noncoherent ops")
Cc: sta...@vger.kernel.org # 4.19+
---
On Thu, 1 Nov 2018, Christoph Hellwig wrote:

> Fails to compile for me with:
> 
> cc1: error: ‘-march=r3000’ requires ‘-mfp32’
> 
> using the x86_64 corss gcc 8 from Debian testing.

 Hmm, that seems related to the FPXX ABI, which the R3000 does not support 
and which they may have configured as the default for GCC (`-mfpxx').  
That's not relevant to the kernel, so we probably just ought to force 
`-mfp32' and `-mfp64' for 32-bit and 64-bit builds respectively these 
days.

> Either way the config looks like we have all the required bits for
> non-coherent dma support.  The only guess is that somehow
> dma_cache_wback_inv aren't hooked up to the right functions for some
> reason, but I can't really think off why.

 Well, `dma_cache_wback_inv' isn't actually called and with the 64-bit 
configuration I switched to the address returned is in the XKPHYS cached 
noncoherent space, as proved with a simple diagnostic patch applied to 
`dma_direct_alloc' showing the results of `dev_is_dma_coherent' and the 
actual allocation:

tc1: DEFTA at MMIO addr = 0x1e90, IRQ = 20, Hardware addr = 
08-00-2b-a3-a3-29
defxx tc1: dma_direct_alloc: coherent: 0
defxx tc1: dma_direct_alloc: returned: 980003db8000
tc1: registered as fddi0

(the value of 3 in bits 61:59 of the virtual address denotes the cached 
noncoherent attribute).

 The cause is commit bc3ec75de545 ("dma-mapping: merge direct and 
noncoherent ops") reversed the interpretation of the `dma_direct_alloc*' 
result in `arch_dma_alloc'.  I guess this change was unlucky not to have 
this part of the API actually verified at run-time by anyone anywhere.

 Fixed thus, with debug output now as expected:

defxx tc1: dma_direct_alloc: coherent: 0
defxx tc1: dma_direct_alloc: returned: 93db8000

showing the address returned in the XKPHYS uncached space (the value of 2 
in bits 61:59 of the virtual address denotes the uncached attribute), and 
the network interface working properly.

 Please apply, and backport as required.

  Maciej
---
 arch/mips/mm/dma-noncoherent.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

linux-mips-arch-dma-alloc-err.patch
Index: linux-20181028-4maxp64-defconfig/arch/mips/mm/dma-noncoherent.c
===
--- linux-20181028-4maxp64-defconfig.orig/arch/mips/mm/dma-noncoherent.c
+++ linux-20181028-4maxp64-defconfig/arch/mips/mm/dma-noncoherent.c
@@ -50,7 +50,7 @@ void *arch_dma_alloc(struct device *dev,
void *ret;
 
ret = dma_direct_alloc_pages(dev, size, dma_handle, gfp, attrs);
-   if (!ret && !(attrs & DMA_ATTR_NON_CONSISTENT)) {
+   if (ret && !(attrs & DMA_ATTR_NON_CONSISTENT)) {
dma_cache_wback_inv((unsigned long) ret, size);
ret = (void *)UNCAC_ADDR(ret);
}
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu