Re: [PATCH 09/27] mm: generalize the pgmap based page_free infrastructure

2022-02-14 Thread Logan Gunthorpe



On 2022-02-10 12:28 a.m., Christoph Hellwig wrote:
> Key off on the existence of ->page_free to prepare for adding support for
> more pgmap types that are device managed and thus need the free callback.
> 
> Signed-off-by: Christoph Hellwig 

Great! This makes my patch simpler.

Reviewed-by: Logan Gunthorpe 


> ---
>  mm/memremap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index fef5734d5e4933..e00ffcdba7b632 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  void free_zone_device_page(struct page *page)
>  {
> - if (WARN_ON_ONCE(!is_device_private_page(page)))
> + if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
>   return;
>  
>   __ClearPageWaiters(page);
> @@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page)
>   mem_cgroup_uncharge(page_folio(page));
>  
>   /*
> -  * When a device_private page is freed, the page->mapping field
> +  * When a device managed page is freed, the page->mapping field
>* may still contain a (stale) mapping value. For example, the
>* lower bits of page->mapping may still identify the page as an
>* anonymous page. Ultimately, this entire field is just stale
> 


Re: start sorting out the ZONE_DEVICE refcount mess

2022-02-08 Thread Logan Gunthorpe



On 2022-02-06 11:32 p.m., Christoph Hellwig wrote:
> Hi all,
> 
> this series removes the offset by one refcount for ZONE_DEVICE pages
> that are freed back to the driver owning them, which is just device
> private ones for now, but also the planned device coherent pages
> and the ehanced p2p ones pending.
> 
> It does not address the fsdax pages yet, which will be attacked in a
> follow on series.
> 
> Diffstat:
>  arch/arm64/mm/mmu.c  |1 
>  arch/powerpc/kvm/book3s_hv_uvmem.c   |1 
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |2 
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 
>  drivers/gpu/drm/drm_cache.c  |2 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |3 -
>  drivers/gpu/drm/nouveau/nouveau_svm.c|1 
>  drivers/infiniband/core/rw.c |1 
>  drivers/nvdimm/pmem.h|1 
>  drivers/nvme/host/pci.c  |1 
>  drivers/nvme/target/io-cmd-bdev.c|1 
>  fs/Kconfig   |2 
>  fs/fuse/virtio_fs.c  |1 
>  include/linux/hmm.h  |9 
>  include/linux/memremap.h |   22 +-
>  include/linux/mm.h   |   59 -
>  lib/test_hmm.c   |4 +
>  mm/Kconfig   |4 -
>  mm/internal.h|2 
>  mm/memcontrol.c  |   11 +
>  mm/memremap.c|   63 
> ---
>  mm/migrate.c |6 --
>  mm/swap.c|   49 ++--
>  23 files changed, 90 insertions(+), 157 deletions(-)

Looks good to me. I was wondering about the location of some of this
code, so it's nice to see it cleaned up. Except for the one minor issue
I noted on patch 6, it all looks good to me. I've reviewed all the
patches and tested the series under my p2pdma series.

Reviewed-by: Logan Gunthorpe 

Logan


Re: [PATCH 6/8] mm: don't include in

2022-02-08 Thread Logan Gunthorpe



On 2022-02-06 11:32 p.m., Christoph Hellwig wrote:
> Move the check for the actual pgmap types that need the free at refcount
> one behavior into the out of line helper, and thus avoid the need to
> pull memremap.h into mm.h.
> 
> Signed-off-by: Christoph Hellwig 

I've noticed mm/memcontrol.c uses is_device_private_page() and also
needs a memremap.h include added to compile with my configuration.

Logan


Re: Phyr Starter

2022-01-11 Thread Logan Gunthorpe



On 2022-01-11 4:02 p.m., Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 03:57:07PM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2022-01-11 3:53 p.m., Jason Gunthorpe wrote:
>>> I just want to share the whole API that will have to exist to
>>> reasonably support this flexible array of intervals data structure..
>>
>> Is that really worth it? I feel like type safety justifies replicating a
>> bit of iteration and allocation infrastructure. Then there's no silly
>> mistakes of thinking one array is one thing when it is not.
> 
> If it is a 'a bit' then sure, but I suspect doing a good job here will
> be a lot of code here.
> 
> Look at how big scatterlist is, for instance.

Yeah, but scatterlist has a ton of cruft; numerous ways to allocate,
multiple iterators, developers using it in different ways, etc, etc.
It's a big mess. bvec.h is much smaller (though includes stuff that
wouldn't necessarily be appropriate here).

Also some things apply to one but not the other. eg: a memcpy to/from
function might make sense for a phy_range but makes no sense for a
dma_range.

> Maybe we could have a generic 64 bit interval arry and then two type
> wrappers that do dma and physaddr casting? IDK.
> 
> Not sure type safety of DMA vs CPU address is critical?

I would argue it is. A DMA address is not a CPU address and should not
be treated the same.

Logan



Re: Phyr Starter

2022-01-11 Thread Logan Gunthorpe



On 2022-01-11 3:57 p.m., Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 03:09:13PM -0700, Logan Gunthorpe wrote:
> 
>> Either that, or we need a wrapper that allocates an appropriately
>> sized SGL to pass to any dma_map implementation that doesn't support
>> the new structures.
> 
> This is what I think we should do. If we start with RDMA then we can
> motivate the 4 main server IOMMU drivers to get updated ASAP, then it
> can acceptably start to spread to other users.

I suspect the preferred path forward is for the IOMMU drivers that don't
use dma-iommu should be converted to use it. Then anything we do to
dma-iommu will be applicable to the IOMMU drivers. Better than expecting
them to implement a bunch of new functionality themselves.

Logan



Re: Phyr Starter

2022-01-11 Thread Logan Gunthorpe



On 2022-01-11 3:53 p.m., Jason Gunthorpe wrote:
> I just want to share the whole API that will have to exist to
> reasonably support this flexible array of intervals data structure..

Is that really worth it? I feel like type safety justifies replicating a
bit of iteration and allocation infrastructure. Then there's no silly
mistakes of thinking one array is one thing when it is not.

Logan



Re: Phyr Starter

2022-01-11 Thread Logan Gunthorpe



On 2022-01-11 2:25 p.m., Matthew Wilcox wrote:
> That's reproducing the bad decision of the scatterlist, only with
> a different encoding.  You end up with something like:
> 
> struct neoscat {
>   dma_addr_t dma_addr;
>   phys_addr_t phys_addr;
>   size_t dma_len;
>   size_t phys_len;
> };
> 
> and the dma_addr and dma_len are unused by all-but-the-first entry when
> you have a competent IOMMU.  We want a different data structure in and
> out, and we may as well keep using the scatterlist for the dma-map-out.

With my P2PDMA patchset, even with a competent IOMMU, we need to support
multiple dma_addr/dma_len pairs (plus the flag bit). This is required to
program IOVAs and multiple bus addresses into a single DMA transactions.

I think using the scatter list for the DMA-out side is not ideal seeing
we don't need the page pointers or multiple length fields and we won't
be able to change the sgl substantially given the massive amount of
existing use cases that won't go away over night.

My hope would be along these lines:

struct phy_range {
phys_addr_t phyr_addr;
u32 phyr_len;
u32 phyr_flags;
};

struct dma_range {
dma_addr_t dmar_addr;
u32 dmar_len;
u32 dmar_flags;
};

A new GUP helper would somehow return a list of phy_range structs and
the new dma_map function would take that list and return a list of
dma_range structs. Each element in the list could represent a segment up
to 4GB, so any range longer than that would need multiple items in the
list. (Alternatively, perhaps the length could be a 64bit value and we
steal some of the top bits for flags or some such). The flags would not
only be needed by some of the use cases mentioned (FOLL_PIN or
DMA_BUS_ADDRESS) but could also support chaining these lists like SGLs
so continuous vmallocs would not be necessary for longer lists.

If there's an [phy|dma]_range_list struct (or some such) which contains
these range structs (per some details of Jason's suggestions) that'd be
fine by me too and would just depend on implementation details.

However, the main problem I see is a chicken and egg problem. The new
dma_map function would need to be implemented by every dma_map provider
or any driver trying to use it would need a messy fallback. Either that,
or we need a wrapper that allocates an appropriately sized SGL to pass
to any dma_map implementation that doesn't support the new structures.

Logan



Re: Phyr Starter

2022-01-11 Thread Logan Gunthorpe



On 2022-01-11 1:17 a.m., John Hubbard wrote:
> On 1/10/22 11:34, Matthew Wilcox wrote:
>> TLDR: I want to introduce a new data type:
>>
>> struct phyr {
>>  phys_addr_t addr;
>>  size_t len;
>> };
>>
>> and use it to replace bio_vec as well as using it to replace the array
>> of struct pages used by get_user_pages() and friends.
>>
>> ---
> 
> This would certainly solve quite a few problems at once. Very compelling.

I agree.

> Zooming in on the pinning aspect for a moment: last time I attempted to
> convert O_DIRECT callers from gup to pup, I recall wanting very much to
> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> or some non-FOLL_PIN method. Because at the end of the IO, it is not
> easy to disentangle which pages require put_page() and which require
> unpin_user_page*().
> 
> And changing the bio_vec for *that* purpose was not really acceptable.
> 
> But now that you're looking to change it in a big way (and with some
> spare bits avaiable...oohh!), maybe I can go that direction after all.
> 
> Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
> if it exists at all?

I'd also second being able to store a handful of flags in each phyr. My
userspace P2PDMA patchset needs to add a flag to each sgl to indicate
whether it was mapped as a bus address or not (which would be necessary
for the DMA mapped side dma_map_phyr).

Though, it's not immediately obvious where to put the flags without
increasing the size of the structure :(

Logan



Re: [patch 17/30] NTB/msi: Use irq_has_action()

2020-12-11 Thread Logan Gunthorpe



On 2020-12-10 12:25 p.m., Thomas Gleixner wrote:
> Use the proper core function.
> 
> Signed-off-by: Thomas Gleixner 
> Cc: Jon Mason 
> Cc: Dave Jiang 
> Cc: Allen Hubbe 
> Cc: linux-...@googlegroups.com

Looks good to me.

Reviewed-by: Logan Gunthorpe 

> ---
>  drivers/ntb/msi.c |4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> --- a/drivers/ntb/msi.c
> +++ b/drivers/ntb/msi.c
> @@ -282,15 +282,13 @@ int ntbm_msi_request_threaded_irq(struct
> struct ntb_msi_desc *msi_desc)
>  {
>   struct msi_desc *entry;
> - struct irq_desc *desc;
>   int ret;
>  
>   if (!ntb->msi)
>   return -EINVAL;
>  
>   for_each_pci_msi_entry(entry, ntb->pdev) {
> - desc = irq_to_desc(entry->irq);
> - if (desc->action)
> + if (irq_has_action(entry->irq))
>   continue;
>  
>   ret = devm_request_threaded_irq(>dev, entry->irq, handler,
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api

2020-09-09 Thread Logan Gunthorpe


On 2020-09-08 9:28 a.m., Tvrtko Ursulin wrote:
>>
>> diff --git a/drivers/gpu/drm/i915/i915_scatterlist.h
>> b/drivers/gpu/drm/i915/i915
>> index b7b59328cb76..9367ac801f0c 100644
>> --- a/drivers/gpu/drm/i915/i915_scatterlist.h
>> +++ b/drivers/gpu/drm/i915/i915_scatterlist.h
>> @@ -27,13 +27,19 @@ static __always_inline struct sgt_iter {
>>   } __sgt_iter(struct scatterlist *sgl, bool dma) {
>>  struct sgt_iter s = { .sgp = sgl };
>>
>> +   if (sgl && !sg_dma_len(s.sgp))
> 
> I'd extend the condition to be, just to be safe:
> if (dma && sgl && !sg_dma_len(s.sgp))
>

Right, good catch, that's definitely necessary.

>> +   s.sgp = NULL;
>> +
>>  if (s.sgp) {
>>  s.max = s.curr = s.sgp->offset;
>> -   s.max += s.sgp->length;
>> -   if (dma)
>> +
>> +   if (dma) {
>> +   s.max += sg_dma_len(s.sgp);
>>  s.dma = sg_dma_address(s.sgp);
>> -   else
>> +   } else {
>> +   s.max += s.sgp->length;
>>  s.pfn = page_to_pfn(sg_page(s.sgp));
>> +   }
> 
> Otherwise has this been tested or alternatively how to test it? (How to
> repro the issue.)

It has not been tested. To test it, you need Tom's patch set without the
last "DO NOT MERGE" patch:

https://lkml.kernel.org/lkml/20200907070035.ga25...@infradead.org/T/

Thanks,

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api

2020-08-28 Thread Logan Gunthorpe



On 2020-08-23 6:04 p.m., Tom Murphy wrote:
> I have added a check for the sg_dma_len == 0 :
> """
>  } __sgt_iter(struct scatterlist *sgl, bool dma) {
> struct sgt_iter s = { .sgp = sgl };
> 
> +   if (sgl && sg_dma_len(sgl) == 0)
> +   s.sgp = NULL;
> 
> if (s.sgp) {
> .
> """
> at location [1].
> but it doens't fix the problem.

Based on my read of the code, it looks like we also need to change usage
of sgl->length... Something like the rough patch below, maybe?

Also, Tom, do you have an updated version of the patchset to convert the
Intel IOMMU to dma-iommu available? The last one I've found doesn't
apply cleanly (I'm assuming parts of it have been merged in slightly
modified forms).

Thanks,

Logan

--

diff --git a/drivers/gpu/drm/i915/i915_scatterlist.h
b/drivers/gpu/drm/i915/i915
index b7b59328cb76..9367ac801f0c 100644
--- a/drivers/gpu/drm/i915/i915_scatterlist.h
+++ b/drivers/gpu/drm/i915/i915_scatterlist.h
@@ -27,13 +27,19 @@ static __always_inline struct sgt_iter {
 } __sgt_iter(struct scatterlist *sgl, bool dma) {
struct sgt_iter s = { .sgp = sgl };

+   if (sgl && !sg_dma_len(s.sgp))
+   s.sgp = NULL;
+
if (s.sgp) {
s.max = s.curr = s.sgp->offset;
-   s.max += s.sgp->length;
-   if (dma)
+
+   if (dma) {
+   s.max += sg_dma_len(s.sgp);
s.dma = sg_dma_address(s.sgp);
-   else
+   } else {
+   s.max += s.sgp->length;
s.pfn = page_to_pfn(sg_page(s.sgp));
+   }
}

return s;
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 02/15] PCI: Add shorthand define for message signalled interrupt types

2020-06-04 Thread Logan Gunthorpe



On 2020-06-03 5:45 a.m., Piotr Stankiewicz wrote:
> There are several places in the kernel which check/ask for MSI or MSI-X
> interrupts. It would make sense to have a shorthand constant, similar to
> PCI_IRQ_ALL_TYPES, to use in these situations. So add PCI_IRQ_MSI_TYPES,
> for this purpose.
> 
> Signed-off-by: Piotr Stankiewicz 
> Suggested-by: Andy Shevchenko 
> Reviewed-by: Andy Shevchenko 

Looks good to me,

Reviewed-by: Logan Gunthorpe 

Thanks,

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api

2020-05-30 Thread Logan Gunthorpe



On 2020-05-29 3:11 p.m., Marek Szyprowski wrote:
> Patches are pending:
> https://lore.kernel.org/linux-iommu/20200513132114.6046-1-m.szyprow...@samsung.com/T/

Cool, nice! Though, I still don't think that fixes the issue in
i915_scatterlist.h given it still ignores sg_dma_len() and strictly
relies on sg_next()/sg_is_last() to stop iterating -- and I suspect this
is the bug that got in Tom's way.

>> However, as Robin pointed out, there are other ugly tricks like stopping
>> iterating through the SGL when sg_dma_len() is zero. For example, the
>> AMD driver appears to use drm_prime_sg_to_page_addr_arrays() which does
>> this trick and thus likely isn't buggy (otherwise, I'd expect someone to
>> have complained by now seeing AMD has already switched to IOMMU-DMA.
> 
> I'm not sure that this is a trick. Stopping at zero sg_dma_len() was 
> somewhere documented.

Well whatever you want to call it, it is ugly to have some drivers doing
one thing with the returned value and others assuming there's an extra
zero at the end. It just causes confusion for people reading/copying the
code. It would be better if they are all consistent. However, I concede
stopping at zero should not be broken, presently.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api

2020-05-30 Thread Logan Gunthorpe


On 2020-05-29 6:45 a.m., Christoph Hellwig wrote:
> On Thu, May 28, 2020 at 06:00:44PM -0600, Logan Gunthorpe wrote:
>>> This issue is most likely in the i915 driver and is most likely caused by 
>>> the driver not respecting the return value of the dma_map_ops::map_sg 
>>> function. You can see the driver ignoring the return value here:
>>> https://github.com/torvalds/linux/blob/7e0165b2f1a912a06e381e91f0f4e495f4ac3736/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L51
>>>
>>> Previously this didn’t cause issues because the intel map_sg always 
>>> returned the same number of elements as the input scatter gather list but 
>>> with the change to this dma-iommu api this is no longer the case. I wasn’t 
>>> able to track the bug down to a specific line of code unfortunately.  
> 
> Mark did a big audit into the map_sg API abuse and initially had
> some i915 patches, but then gave up on them with this comment:
> 
> "The biggest TODO is DRM/i915 driver and I don't feel brave enough to fix
>  it fully. The driver creatively uses sg_table->orig_nents to store the
>  size of the allocate scatterlist and ignores the number of the entries
>  returned by dma_map_sg function. In this patchset I only fixed the
>  sg_table objects exported by dmabuf related functions. I hope that I
>  didn't break anything there."
> 
> it would be really nice if the i915 maintainers could help with sorting
> that API abuse out.
> 

I agree completely that the API abuse should be sorted out, but I think
that's much larger than just the i915 driver. Pretty much every dma-buf
map_dma_buf implementation I looked at ignores the returned nents of
sg_attrs. This sucks, but I don't think it's the bug Tom ran into. See:

amdgpu_dma_buf_map
armada_gem_prime_map_dma_buf
drm_gem_map_dma_buf
i915_gem_map_dma_buf
tegra_gem_prime_map_dma_buf

So this should probably be addressed by the whole GPU community.

However, as Robin pointed out, there are other ugly tricks like stopping
iterating through the SGL when sg_dma_len() is zero. For example, the
AMD driver appears to use drm_prime_sg_to_page_addr_arrays() which does
this trick and thus likely isn't buggy (otherwise, I'd expect someone to
have complained by now seeing AMD has already switched to IOMMU-DMA.

As I tried to point out in my previous email, i915 does not do this
trick. In fact, it completely ignores sg_dma_len() and is thus
completely broken. See i915_scatterlist.h and the __sgt_iter() function.
So it doesn't sound to me like Mark's fix would address the issue at
all. Per my previous email, I'd guess that it can be fixed simply by
adjusting the __sgt_iter() function to do something more sensible.
(Better yet, if possible, ditch __sgt_iter() and use the common DRM
function that AMD uses).

This would at least allow us to make progress with Tom's IOMMU-DMA patch
set and once that gets in, it will be harder for other drivers to make
the same mistake.

Logan


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api

2020-05-29 Thread Logan Gunthorpe
Hi Tom,

On 2019-12-21 8:03 a.m., Tom Murphy wrote:
> This patchset converts the intel iommu driver to the dma-iommu api.

Just wanted to note that I've rebased your series on recent kernels and
have done some testing on my old Sandybridge machine (without the DO NOT
MERGE patch) and have found no issues. I hope this can make progress
soon and get merged soon. If you like you can add:

Tested-By: Logan Gunthorpe 

> While converting the driver I exposed a bug in the intel i915 driver which 
> causes a huge amount of artifacts on the screen of my laptop. You can see a 
> picture of it here:
> https://github.com/pippy360/kernelPatches/blob/master/IMG_20191219_225922.jpg
> 
> This issue is most likely in the i915 driver and is most likely caused by the 
> driver not respecting the return value of the dma_map_ops::map_sg function. 
> You can see the driver ignoring the return value here:
> https://github.com/torvalds/linux/blob/7e0165b2f1a912a06e381e91f0f4e495f4ac3736/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L51
> 
> Previously this didn’t cause issues because the intel map_sg always returned 
> the same number of elements as the input scatter gather list but with the 
> change to this dma-iommu api this is no longer the case. I wasn’t able to 
> track the bug down to a specific line of code unfortunately.  

I did some digging into this myself and while I don't have full patch, I
think I traced it closer to the problem.

Sadly, ignoring the number of nents returned by map_sg() is endemic to
dma-buf users, but AMD's GPU driver seems to do the same thing,
presumably without issues.

Digging a bit further, I found that the i915 has an "innovative" way of
iterating through SGLs, see [1]. I suspect if __sgt_iter is changed to
increment with sg_dma_len() and return NULL when there is no length
left, it may fix the issue.

But, sorry, I don't really have the means or time to fix and test this
myself.

Thanks,

Logan

[1]
https://elixir.bootlin.com/linux/v5.7-rc7/source/drivers/gpu/drm/i915/i915_scatterlist.h#L76
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/6] lib/scatterlist: add sg_set_dma_addr() function

2020-03-13 Thread Logan Gunthorpe


On 2020-03-12 8:19 a.m., Jason Gunthorpe wrote:
> On Thu, Mar 12, 2020 at 03:47:29AM -0700, Christoph Hellwig wrote:
>> On Thu, Mar 12, 2020 at 11:31:35AM +0100, Christian König wrote:
>>> But how should we then deal with all the existing interfaces which already
>>> take a scatterlist/sg_table ?
>>>
>>> The whole DMA-buf design and a lot of drivers are build around
>>> scatterlist/sg_table and to me that actually makes quite a lot of sense.
>>>
>>
>> Replace them with a saner interface that doesn't take a scatterlist.
>> At very least for new functionality like peer to peer DMA, but
>> especially this code would also benefit from a general move away
>> from the scatterlist.
> 
> If dma buf can do P2P I'd like to see support for consuming a dmabuf
> in RDMA. Looking at how.. there is an existing sgl based path starting
> from get_user_pages through dma map to the drivers. (ib_umem)
> 
> I can replace the driver part with something else (dma_sg), but not
> until we get a way to DMA map pages directly into that something
> else..
> 
> The non-page scatterlist is also a big concern for RDMA as we have
> drivers that want the page list, so even if we did as this series
> contemplates I'd have still have to split the drivers and create the
> notion of a dma-only SGL.
> 
>>> I mean we could come up with a new structure for this, but to me that just
>>> looks like reinventing the wheel. Especially since drivers need to be able
>>> to handle both I/O to system memory and I/O to PCIe BARs.
>>
>> The structure for holding the struct page side of the scatterlist is
>> called struct bio_vec, so far mostly used by the block and networking
>> code.
> 
> I haven't used bio_vecs before, do they support chaining like SGL so
> they can be very big? RDMA dma maps gigabytes of memory

bio_vec's themselves don't support chaining... In the block layer they
are used in a struct bio which handles chaining, splitting and other
features. Each bio, though, has a limit of 256 segments to avoid higher
order allocations. Depending on your use case, you could reuse bios or
write your own container to chain bio_vecs.

>> The structure for holding dma addresses doesn't really exist
>> in a generic form, but would be an array of these structures:
>>
>> struct dma_sg {
>>  dma_addr_t  addr;
>>  u32 len;
>> };


> Yes, we easily have ranges of >1GB. So I would certainly say u64 for the len 
> here.

I'd probably avoid the u64 here and leave space for some flags or
something. If you have >1GB to map you can always just have mulitple
segments. With 4GB per segment and 256 segments per page, a page of DMA
sgs can easily map 1TB of memory in a single call and with chaining or
larger allocations you can extend that further, if needed.

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 17/25] PCI/P2PDMA: use the dev_pagemap internal refcount

2019-06-27 Thread Logan Gunthorpe


On 2019-06-26 6:27 a.m., Christoph Hellwig wrote:
> The functionality is identical to the one currently open coded in
> p2pdma.c.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Logan Gunthorpe 

Also, for the P2PDMA changes in this series:

Tested-by: Logan Gunthorpe 

I've ran this series through my simple P2PDMA tests.

Logan

> ---
>  drivers/pci/p2pdma.c | 57 
>  1 file changed, 4 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ebd8ce3bba2e..608f84df604a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -24,12 +24,6 @@ struct pci_p2pdma {
>   bool p2pmem_published;
>  };
>  
> -struct p2pdma_pagemap {
> - struct dev_pagemap pgmap;
> - struct percpu_ref ref;
> - struct completion ref_done;
> -};
> -
>  static ssize_t size_show(struct device *dev, struct device_attribute *attr,
>char *buf)
>  {
> @@ -78,32 +72,6 @@ static const struct attribute_group p2pmem_group = {
>   .name = "p2pmem",
>  };
>  
> -static struct p2pdma_pagemap *to_p2p_pgmap(struct percpu_ref *ref)
> -{
> - return container_of(ref, struct p2pdma_pagemap, ref);
> -}
> -
> -static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
> -{
> - struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
> -
> - complete(_pgmap->ref_done);
> -}
> -
> -static void pci_p2pdma_percpu_kill(struct dev_pagemap *pgmap)
> -{
> - percpu_ref_kill(pgmap->ref);
> -}
> -
> -static void pci_p2pdma_percpu_cleanup(struct dev_pagemap *pgmap)
> -{
> - struct p2pdma_pagemap *p2p_pgmap =
> - container_of(pgmap, struct p2pdma_pagemap, pgmap);
> -
> - wait_for_completion(_pgmap->ref_done);
> - percpu_ref_exit(_pgmap->ref);
> -}
> -
>  static void pci_p2pdma_release(void *data)
>  {
>   struct pci_dev *pdev = data;
> @@ -153,11 +121,6 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>   return error;
>  }
>  
> -static const struct dev_pagemap_ops pci_p2pdma_pagemap_ops = {
> - .kill   = pci_p2pdma_percpu_kill,
> - .cleanup= pci_p2pdma_percpu_cleanup,
> -};
> -
>  /**
>   * pci_p2pdma_add_resource - add memory for use as p2p memory
>   * @pdev: the device to add the memory to
> @@ -171,7 +134,6 @@ static const struct dev_pagemap_ops 
> pci_p2pdma_pagemap_ops = {
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>   u64 offset)
>  {
> - struct p2pdma_pagemap *p2p_pgmap;
>   struct dev_pagemap *pgmap;
>   void *addr;
>   int error;
> @@ -194,26 +156,15 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
> bar, size_t size,
>   return error;
>   }
>  
> - p2p_pgmap = devm_kzalloc(>dev, sizeof(*p2p_pgmap), GFP_KERNEL);
> - if (!p2p_pgmap)
> + pgmap = devm_kzalloc(>dev, sizeof(*pgmap), GFP_KERNEL);
> + if (!pgmap)
>   return -ENOMEM;
> -
> - init_completion(_pgmap->ref_done);
> - error = percpu_ref_init(_pgmap->ref,
> - pci_p2pdma_percpu_release, 0, GFP_KERNEL);
> - if (error)
> - goto pgmap_free;
> -
> - pgmap = _pgmap->pgmap;
> -
>   pgmap->res.start = pci_resource_start(pdev, bar) + offset;
>   pgmap->res.end = pgmap->res.start + size - 1;
>   pgmap->res.flags = pci_resource_flags(pdev, bar);
> - pgmap->ref = _pgmap->ref;
>   pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
>   pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
>   pci_resource_start(pdev, bar);
> - pgmap->ops = _p2pdma_pagemap_ops;
>  
>   addr = devm_memremap_pages(>dev, pgmap);
>   if (IS_ERR(addr)) {
> @@ -224,7 +175,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
> bar, size_t size,
>   error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr,
>   pci_bus_address(pdev, bar) + offset,
>   resource_size(>res), dev_to_node(>dev),
> - _pgmap->ref);
> + pgmap->ref);
>   if (error)
>   goto pages_free;
>  
> @@ -236,7 +187,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
> bar, size_t size,
>  pages_free:
>   devm_memunmap_pages(>dev, pgmap);
>  pgmap_free:
> - devm_kfree(>dev, p2p_pgmap);
> + devm_kfree(>dev, pgmap);
>   return error;
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 08/25] memremap: move dev_pagemap callbacks into a separate structure

2019-06-18 Thread Logan Gunthorpe


On 2019-06-17 6:27 a.m., Christoph Hellwig wrote:
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index a98126ad9c3a..e083567d26ef 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -100,7 +100,7 @@ static void pci_p2pdma_percpu_cleanup(struct percpu_ref 
> *ref)
>   struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
>  
>   wait_for_completion(_pgmap->ref_done);
> - percpu_ref_exit(_pgmap->ref);
> + percpu_ref_exit(ref);
>  }
>  
>  static void pci_p2pdma_release(void *data)
> @@ -152,6 +152,11 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>   return error;
>  }
>  
> +static const struct dev_pagemap_ops pci_p2pdma_pagemap_ops = {
> + .kill   = pci_p2pdma_percpu_kill,
> + .cleanup= pci_p2pdma_percpu_cleanup,
> +};
> +
>  /**
>   * pci_p2pdma_add_resource - add memory for use as p2p memory
>   * @pdev: the device to add the memory to
> @@ -207,8 +212,6 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
> bar, size_t size,
>   pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
>   pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
>   pci_resource_start(pdev, bar);
> - pgmap->kill = pci_p2pdma_percpu_kill;
> - pgmap->cleanup = pci_p2pdma_percpu_cleanup;

I just noticed this is missing a line to set pgmap->ops to
pci_p2pdma_pagemap_ops. I must have gotten confused by the other users
in my original review. Though I'm not sure how this compiles as the new
struct is static and unused. However, it is rendered moot in Patch 16
when this is all removed.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 16/25] PCI/P2PDMA: use the dev_pagemap internal refcount

2019-06-18 Thread Logan Gunthorpe


On 2019-06-17 6:27 a.m., Christoph Hellwig wrote:
> The functionality is identical to the one currently open coded in
> p2pdma.c.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Logan Gunthorpe 

I also did a quick test with the full patch-set to ensure that the setup
and tear down paths for p2pdma still work correctly and it all does.

Thanks,

Logan

> ---
>  drivers/pci/p2pdma.c | 56 
>  1 file changed, 4 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 48a88158e46a..608f84df604a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -24,12 +24,6 @@ struct pci_p2pdma {
>   bool p2pmem_published;
>  };
>  
> -struct p2pdma_pagemap {
> - struct dev_pagemap pgmap;
> - struct percpu_ref ref;
> - struct completion ref_done;
> -};
> -
>  static ssize_t size_show(struct device *dev, struct device_attribute *attr,
>char *buf)
>  {
> @@ -78,32 +72,6 @@ static const struct attribute_group p2pmem_group = {
>   .name = "p2pmem",
>  };
>  
> -static struct p2pdma_pagemap *to_p2p_pgmap(struct percpu_ref *ref)
> -{
> - return container_of(ref, struct p2pdma_pagemap, ref);
> -}
> -
> -static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
> -{
> - struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
> -
> - complete(_pgmap->ref_done);
> -}
> -
> -static void pci_p2pdma_percpu_kill(struct dev_pagemap *pgmap)
> -{
> - percpu_ref_kill(pgmap->ref);
> -}
> -
> -static void pci_p2pdma_percpu_cleanup(struct dev_pagemap *pgmap)
> -{
> - struct p2pdma_pagemap *p2p_pgmap =
> - container_of(pgmap, struct p2pdma_pagemap, pgmap);
> -
> - wait_for_completion(_pgmap->ref_done);
> - percpu_ref_exit(_pgmap->ref);
> -}
> -
>  static void pci_p2pdma_release(void *data)
>  {
>   struct pci_dev *pdev = data;
> @@ -153,11 +121,6 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>   return error;
>  }
>  
> -static const struct dev_pagemap_ops pci_p2pdma_pagemap_ops = {
> - .kill   = pci_p2pdma_percpu_kill,
> - .cleanup= pci_p2pdma_percpu_cleanup,
> -};
> -
>  /**
>   * pci_p2pdma_add_resource - add memory for use as p2p memory
>   * @pdev: the device to add the memory to
> @@ -171,7 +134,6 @@ static const struct dev_pagemap_ops 
> pci_p2pdma_pagemap_ops = {
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>   u64 offset)
>  {
> - struct p2pdma_pagemap *p2p_pgmap;
>   struct dev_pagemap *pgmap;
>   void *addr;
>   int error;
> @@ -194,22 +156,12 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
> bar, size_t size,
>   return error;
>   }
>  
> - p2p_pgmap = devm_kzalloc(>dev, sizeof(*p2p_pgmap), GFP_KERNEL);
> - if (!p2p_pgmap)
> + pgmap = devm_kzalloc(>dev, sizeof(*pgmap), GFP_KERNEL);
> + if (!pgmap)
>   return -ENOMEM;
> -
> - init_completion(_pgmap->ref_done);
> - error = percpu_ref_init(_pgmap->ref,
> - pci_p2pdma_percpu_release, 0, GFP_KERNEL);
> - if (error)
> - goto pgmap_free;
> -
> - pgmap = _pgmap->pgmap;
> -
>   pgmap->res.start = pci_resource_start(pdev, bar) + offset;
>   pgmap->res.end = pgmap->res.start + size - 1;
>   pgmap->res.flags = pci_resource_flags(pdev, bar);
> - pgmap->ref = _pgmap->ref;
>   pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
>   pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
>   pci_resource_start(pdev, bar);
> @@ -223,7 +175,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
> bar, size_t size,
>   error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr,
>   pci_bus_address(pdev, bar) + offset,
>   resource_size(>res), dev_to_node(>dev),
> - _pgmap->ref);
> + pgmap->ref);
>   if (error)
>   goto pages_free;
>  
> @@ -235,7 +187,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
> bar, size_t size,
>  pages_free:
>   devm_memunmap_pages(>dev, pgmap);
>  pgmap_free:
> - devm_kfree(>dev, p2p_pgmap);
> + devm_kfree(>dev, pgmap);
>   return error;
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: dev_pagemap related cleanups

2019-06-14 Thread Logan Gunthorpe


On 2019-06-13 12:27 p.m., Dan Williams wrote:
> On Thu, Jun 13, 2019 at 2:43 AM Christoph Hellwig  wrote:
>>
>> Hi Dan, Jérôme and Jason,
>>
>> below is a series that cleans up the dev_pagemap interface so that
>> it is more easily usable, which removes the need to wrap it in hmm
>> and thus allowing to kill a lot of code
>>
>> Diffstat:
>>
>>  22 files changed, 245 insertions(+), 802 deletions(-)
> 
> Hooray!
> 
>> Git tree:
>>
>> git://git.infradead.org/users/hch/misc.git hmm-devmem-cleanup
> 
> I just realized this collides with the dev_pagemap release rework in
> Andrew's tree (commit ids below are from next.git and are not stable)
> 
> 4422ee8476f0 mm/devm_memremap_pages: fix final page put race
> 771f0714d0dc PCI/P2PDMA: track pgmap references per resource, not globally
> af37085de906 lib/genalloc: introduce chunk owners
> e0047ff8aa77 PCI/P2PDMA: fix the gen_pool_add_virt() failure path
> 0315d47d6ae9 mm/devm_memremap_pages: introduce devm_memunmap_pages
> 216475c7eaa8 drivers/base/devres: introduce devm_release_action()
> 
> CONFLICT (content): Merge conflict in tools/testing/nvdimm/test/iomap.c
> CONFLICT (content): Merge conflict in mm/hmm.c
> CONFLICT (content): Merge conflict in kernel/memremap.c
> CONFLICT (content): Merge conflict in include/linux/memremap.h
> CONFLICT (content): Merge conflict in drivers/pci/p2pdma.c
> CONFLICT (content): Merge conflict in drivers/nvdimm/pmem.c
> CONFLICT (content): Merge conflict in drivers/dax/device.c
> CONFLICT (content): Merge conflict in drivers/dax/dax-private.h
> 
> Perhaps we should pull those out and resend them through hmm.git?

Hmm, I've been waiting for those patches to get in for a little while now ;(

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 08/22] memremap: pass a struct dev_pagemap to ->kill

2019-06-14 Thread Logan Gunthorpe


On 2019-06-13 3:43 a.m., Christoph Hellwig wrote:
> Passing the actual typed structure leads to more understandable code
> vs the actual references.

Ha, ok, I originally suggested this to Dan when he introduced the
callback[1].

Reviewed-by: Logan Gunthorpe 

Logan

[1]
https://lore.kernel.org/lkml/8f0cae82-130f-8a64-cfbd-fda5fd76b...@deltatee.com/T/#u

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: dev_pagemap related cleanups

2019-06-14 Thread Logan Gunthorpe


On 2019-06-13 2:21 p.m., Dan Williams wrote:
> On Thu, Jun 13, 2019 at 1:18 PM Logan Gunthorpe  wrote:
>>
>>
>>
>> On 2019-06-13 12:27 p.m., Dan Williams wrote:
>>> On Thu, Jun 13, 2019 at 2:43 AM Christoph Hellwig  wrote:
>>>>
>>>> Hi Dan, Jérôme and Jason,
>>>>
>>>> below is a series that cleans up the dev_pagemap interface so that
>>>> it is more easily usable, which removes the need to wrap it in hmm
>>>> and thus allowing to kill a lot of code
>>>>
>>>> Diffstat:
>>>>
>>>>  22 files changed, 245 insertions(+), 802 deletions(-)
>>>
>>> Hooray!
>>>
>>>> Git tree:
>>>>
>>>> git://git.infradead.org/users/hch/misc.git hmm-devmem-cleanup
>>>
>>> I just realized this collides with the dev_pagemap release rework in
>>> Andrew's tree (commit ids below are from next.git and are not stable)
>>>
>>> 4422ee8476f0 mm/devm_memremap_pages: fix final page put race
>>> 771f0714d0dc PCI/P2PDMA: track pgmap references per resource, not globally
>>> af37085de906 lib/genalloc: introduce chunk owners
>>> e0047ff8aa77 PCI/P2PDMA: fix the gen_pool_add_virt() failure path
>>> 0315d47d6ae9 mm/devm_memremap_pages: introduce devm_memunmap_pages
>>> 216475c7eaa8 drivers/base/devres: introduce devm_release_action()
>>>
>>> CONFLICT (content): Merge conflict in tools/testing/nvdimm/test/iomap.c
>>> CONFLICT (content): Merge conflict in mm/hmm.c
>>> CONFLICT (content): Merge conflict in kernel/memremap.c
>>> CONFLICT (content): Merge conflict in include/linux/memremap.h
>>> CONFLICT (content): Merge conflict in drivers/pci/p2pdma.c
>>> CONFLICT (content): Merge conflict in drivers/nvdimm/pmem.c
>>> CONFLICT (content): Merge conflict in drivers/dax/device.c
>>> CONFLICT (content): Merge conflict in drivers/dax/dax-private.h
>>>
>>> Perhaps we should pull those out and resend them through hmm.git?
>>
>> Hmm, I've been waiting for those patches to get in for a little while now ;(
> 
> Unless Andrew was going to submit as v5.2-rc fixes I think I should
> rebase / submit them on current hmm.git and then throw these cleanups
> from Christoph on top?

Whatever you feel is best. I'm just hoping they get in sooner rather
than later. They do fix a bug after all. Let me know if you want me to
retest the P2PDMA stuff after the rebase.

Thanks,

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 07/22] memremap: move dev_pagemap callbacks into a separate structure

2019-06-14 Thread Logan Gunthorpe


On 2019-06-13 3:43 a.m., Christoph Hellwig wrote:
> The dev_pagemap is a growing too many callbacks.  Move them into a
> separate ops structure so that they are not duplicated for multiple
> instances, and an attacker can't easily overwrite them.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/device.c  |  6 +-
>  drivers/nvdimm/pmem.c | 18 +-
>  drivers/pci/p2pdma.c  |  5 -
>  include/linux/memremap.h  | 29 +++--
>  kernel/memremap.c | 12 ++--
>  mm/hmm.c  |  8 ++--
>  tools/testing/nvdimm/test/iomap.c |  2 +-
>  7 files changed, 50 insertions(+), 30 deletions(-)

Looks good to me,

Reviewed-by: Logan Gunthorpe 

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-15 Thread Logan Gunthorpe


On 2019-05-14 6:14 p.m., Frank Rowand wrote:
> The high level issue is to provide reviewers with enough context to be
> able to evaluate the patch series.  That is probably not very obvious
> at this point in the thread.  At this point I was responding to Logan's
> response to me that I should be reading Documentation to get a better
> description of KUnit features.  I _think_ that Logan thought that I
> did not understand KUnit features and was trying to be helpful by
> pointing out where I could get more information.  If so, he was missing
> my intended point had been that patch 0 should provide more information
> to justify adding this feature.

Honestly, I lost track of wait exactly your point was. And, in my
opinion, Brendan has provided over and above the information required to
justify Kunit's inclusion.

> One thing that has become very apparent in the discussion of this patch
> series is that some people do not understand that kselftest includes
> in-kernel tests, not just userspace tests.  As such, KUnit is an
> additional implementation of "the same feature".  (One can debate
> exactly which in-kernel test features kselftest and KUnit provide,
> and how much overlap exists or does not exist.  So don't take "the
> same feature" as my final opinion of how much overlap exists.)  So
> that is a key element to be noted and explained.

From my perspective, once we were actually pointed to the in-kernel
kselftest code and took a look at it, it was clear there was no
over-arching framework to them and that Kunit could be used to
significantly improve those tests with a common structure. Based on my
reading of the thread, Ted came to the same conclusion.

I don't think we should block this feature from being merged, and for
future work, someone can update the in-kernel kselftests to use the new
framework.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 08/17] kunit: test: add support for test abort

2019-05-06 Thread Logan Gunthorpe


On 2019-05-03 12:48 a.m., Brendan Higgins wrote:
> On Thu, May 2, 2019 at 8:15 PM Logan Gunthorpe  wrote:
>> On 2019-05-01 5:01 p.m., Brendan Higgins wrote:
>>> +/*
>>> + * struct kunit_try_catch - provides a generic way to run code which might 
>>> fail.
>>> + * @context: used to pass user data to the try and catch functions.
>>> + *
>>> + * kunit_try_catch provides a generic, architecture independent way to 
>>> execute
>>> + * an arbitrary function of type kunit_try_catch_func_t which may bail out 
>>> by
>>> + * calling kunit_try_catch_throw(). If kunit_try_catch_throw() is called, 
>>> @try
>>> + * is stopped at the site of invocation and @catch is catch is called.
>>
>> I found some of the C++ comparisons in this series a bit distasteful but
>> wasn't going to say anything until I saw the try catch But looking
>> into the implementation it's just a thread that can exit early which
>> seems fine to me. Just a poor choice of name I guess...
> 
> Guilty as charged (I have a long history with C++, sorry). Would you
> prefer I changed the name? I just figured that try-catch is a commonly
> understood pattern that describes exactly what I am doing.

It is a commonly understood pattern, but I don't think it's what the
code is doing. Try-catch cleans up an entire stack and allows each level
of the stack to apply local cleanup. This implementation simply exits a
thread and has none of that complexity. To me, it seems like an odd
abstraction here as it's really just a test runner that can exit early
(though I haven't seen the follow-up UML implementation).

I would prefer to see this cleaned up such that the abstraction matches
more what's going on but I don't feel that strongly about it so I'll
leave it up to you to figure out what's best unless other reviewers have
stronger opinions.

>>
>> [snip]
>>
>>> +static void __noreturn kunit_abort(struct kunit *test)
>>> +{
>>> + kunit_set_death_test(test, true);
>>> +
>>> + kunit_try_catch_throw(>try_catch);
>>> +
>>> + /*
>>> +  * Throw could not abort from test.
>>> +  *
>>> +  * XXX: we should never reach this line! As kunit_try_catch_throw is
>>> +  * marked __noreturn.
>>> +  */
>>> + WARN_ONCE(true, "Throw could not abort from test!\n");
>>> +}
>>> +
>>>  int kunit_init_test(struct kunit *test, const char *name)
>>>  {
>>>   spin_lock_init(>lock);
>>> @@ -77,6 +103,7 @@ int kunit_init_test(struct kunit *test, const char *name)
>>>   test->name = name;
>>>   test->vprintk = kunit_vprintk;
>>>   test->fail = kunit_fail;
>>> + test->abort = kunit_abort;
>>
>> There are a number of these function pointers which seem to be pointless
>> to me as you only ever set them to one function. Just call the function
>> directly. As it is, it is an unnecessary indirection for someone reading
>> the code. If and when you have multiple implementations of the function
>> then add the pointer. Don't assume you're going to need it later on and
>> add all this maintenance burden if you never use it..
> 
> Ah, yes, Frank (and probably others) previously asked me to remove
> unnecessary method pointers; I removed all the totally unused ones. As
> for these, I don't use them in this patchset, but I use them in my
> patchsets that will follow up this one. These in particular are
> present so that they can be mocked out for testing.

Adding indirection and function pointers solely for the purpose of
mocking out while testing doesn't sit well with me and I don't think it
should be a pattern that's encouraged. Adding extra complexity like this
to a design to make it unit-testable doesn't seem like something that
makes sense in kernel code. Especially given that indirect calls are
more expensive in the age of spectre.

Also, mocking these particular functions seems like it's an artifact of
how you've designed the try/catch abstraction. If the abstraction was
more around an abort-able test runner then it doesn't make sense to need
to mock out the abort/fail functions as you will be testing overly
generic features of something that don't seem necessary to the
implementation.

>>
>> [snip]
>>
>>> +void kunit_generic_try_catch_init(struct kunit_try_catch *try_catch)
>>> +{
>>> + try_catch->run = kunit_generic_run_try_catch;
>>> + try_catch->throw = kunit_generic_throw;
>>> +}
>>
>> Same here. There's only one implementation of try_catch and I can't
>> really see any sensibl

Re: [PATCH v2 08/17] kunit: test: add support for test abort

2019-05-03 Thread Logan Gunthorpe


On 2019-05-01 5:01 p.m., Brendan Higgins wrote:
> +/*
> + * struct kunit_try_catch - provides a generic way to run code which might 
> fail.
> + * @context: used to pass user data to the try and catch functions.
> + *
> + * kunit_try_catch provides a generic, architecture independent way to 
> execute
> + * an arbitrary function of type kunit_try_catch_func_t which may bail out by
> + * calling kunit_try_catch_throw(). If kunit_try_catch_throw() is called, 
> @try
> + * is stopped at the site of invocation and @catch is catch is called.

I found some of the C++ comparisons in this series a bit distasteful but
wasn't going to say anything until I saw the try catch But looking
into the implementation it's just a thread that can exit early which
seems fine to me. Just a poor choice of name I guess...

[snip]

> +static void __noreturn kunit_abort(struct kunit *test)
> +{
> + kunit_set_death_test(test, true);
> +
> + kunit_try_catch_throw(>try_catch);
> +
> + /*
> +  * Throw could not abort from test.
> +  *
> +  * XXX: we should never reach this line! As kunit_try_catch_throw is
> +  * marked __noreturn.
> +  */
> + WARN_ONCE(true, "Throw could not abort from test!\n");
> +}
> +
>  int kunit_init_test(struct kunit *test, const char *name)
>  {
>   spin_lock_init(>lock);
> @@ -77,6 +103,7 @@ int kunit_init_test(struct kunit *test, const char *name)
>   test->name = name;
>   test->vprintk = kunit_vprintk;
>   test->fail = kunit_fail;
> + test->abort = kunit_abort;

There are a number of these function pointers which seem to be pointless
to me as you only ever set them to one function. Just call the function
directly. As it is, it is an unnecessary indirection for someone reading
the code. If and when you have multiple implementations of the function
then add the pointer. Don't assume you're going to need it later on and
add all this maintenance burden if you never use it..

[snip]

> +void kunit_generic_try_catch_init(struct kunit_try_catch *try_catch)
> +{
> + try_catch->run = kunit_generic_run_try_catch;
> + try_catch->throw = kunit_generic_throw;
> +}

Same here. There's only one implementation of try_catch and I can't
really see any sensible justification for another implementation. Even
if there is, add the indirection when the second implementation is
added. This isn't C++ and we don't need to make everything a "method".

Thanks,

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-05-03 Thread Logan Gunthorpe


On 2019-05-01 5:01 p.m., Brendan Higgins wrote:
> ## TLDR
> 
> I rebased the last patchset on 5.1-rc7 in hopes that we can get this in
> 5.2.
> 

As I said on the last posting, I like this and would like to see it move
forward. I still have the same concerns over the downsides of using UML
(ie. not being able to compile large swaths of the tree due to features
that don't exist in that arch) but these are concerns for later.

I'd prefer to see the unnecessary indirection that I pointed out in
patch 8 cleaned up but, besides that, the code looks good to me.

Reviewed-by: Logan Gunthorpe 

Thanks!

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v4 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-03-24 Thread Logan Gunthorpe


On 2019-03-21 4:07 p.m., Brendan Higgins wrote:
> A couple of points, as for needing CONFIG_PCI; my plan to deal with
> that type of thing has been that we would add support for a KUnit/UML
> version that is just for KUnit. It would mock out the necessary bits
> to provide a fake hardware implementation for anything that might
> depend on it. I wrote a prototype for mocking/faking MMIO that I
> presented to the list here[1]; it is not part of the current patchset
> because we decided it would be best to focus on getting an MVP in, but
> I plan on bringing it back up at some point. Anyway, what do you
> generally think of this approach?

Yes, I was wondering if that might be possible. I think that's a great
approach but it will unfortunately take a lot of work before larger
swaths of the kernel are testable in Kunit with UML. Having more common
mocked infrastructure will be great by-product of it though.

> Awesome, I looked at the code you posted and it doesn't look like you
> have had too many troubles. One thing that stood out to me, why did
> you need to put it in the kunit/ dir?

Yeah, writing the code was super easy. Only after, did I realized I
couldn't get it to easily build.

Putting it in the kunit directory was necessary because nothing in the
NTB tree builds unless CONFIG_NTB is set (see drivers/Makefile) and
CONFIG_NTB depends on CONFIG_PCI. I didn't experiment to see how hard it
would be to set CONFIG_NTB without CONFIG_PCI; I assumed it would be tricky.

> I am looking forward to see what you think!

Generally, I'm impressed and want to see this work in upstream as soon
as possible so I can start to make use of it!

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v4 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-03-24 Thread Logan Gunthorpe


On 2019-03-20 11:23 p.m., Knut Omang wrote:
> Testing drivers, hardware and firmware within production kernels was the use
> case that inspired KTF (Kernel Test Framework). Currently KTF is available as 
> a
> standalone git repository. That's been the most efficient form for us so far, 
> as we typically want tests to be developed once but deployed on many different
> kernel versions simultaneously, as part of continuous integration.

Interesting. It seems like it's really in direct competition with KUnit.
I didn't really go into it in too much detail but these are my thoughts:

From a developer perspective I think KTF not being in the kernel tree is
a huge negative. I want minimal effort to include my tests in a patch
series and minimal effort for other developers to be able to use them.
Needing to submit these tests to another project or use another project
to run them is too much friction.

Also I think the goal of having tests that run on any kernel version is
a pipe dream. You'd absolutely need a way to encode which kernel
versions a test is expected to pass on because the tests might not make
sense until a feature is finished in upstream. And this makes it even
harder to develop these tests because, when we write them, we might not
even know which kernel version the feature will be added to. Similarly,
if a feature is removed or substantially changed, someone will have to
do a patch to disable the test for subsequent kernel versions and create
a new test for changed features. So, IMO, tests absolutely have to be
part of the kernel tree so they can be changed with the respective
features they test.

Kunit's ability to run without having to build and run the entire kernel
 is also a huge plus. (Assuming there's a way to get around the build
dependency issues). Because of this, it can be very quick to run these
tests which makes development a *lot* easier seeing you don't have to
reboot a machine every time you want to test a fix.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v4 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-03-24 Thread Logan Gunthorpe


On 2019-03-21 1:13 p.m., Knut Omang wrote:
>> Nevertheless, I don't really see KTF as a real unit testing framework
>> for a number of different reasons; you pointed out some below, but I
>> think the main one being that it requires booting a real kernel on
>> actual hardware; 
> 
> That depends on what you want to test. If you need hardware (or simulated or
> emulated hardware) for the test, of course you would need to have that 
> hardware,
> but if, lets say, you just wanted to run tests like the skbuff example tests
> (see link above) you wouldn't need anything more than what you need to run 
> KUnit
> tests.

I'm starting to get the same impression: KTF isn't unit testing. When we
are saying "unit tests" we are specifying exactly what we want to test:
small sections of code in isolation. So by definition you should not
need hardware for this.

> I have fulfilled that dream, so I know it is possible (Inifinband driver,
> kernels from 2.6.39 to 4.8.x or so..) I know a lot of projects would benefit
> from support for such workflows, but that's not really the point here - we 
> want
> to achieve both goals!

This is what makes me think we are not talking about testing the same
things. We are not talking about end to end testing of entire drivers
but smaller sections of code. A unit test is far more granular and
despite an infinband driver existing for 2.6.39 through 4.8, the
internal implementation could be drastically different. But unit tests
would be testing internal details which could be very different version
to version and has to evolve with the implementation.

> If your target component under test can be built as a kernel module, or set of
> modules, with KTF your workflow would not involve booting at all (unless you
> happened to crash the system with one of your tests, that is :) )

> You would just unload your module under test and the test module, recompile 
> the
> two and insmod again. My work current work cycle on this is just a few 
> seconds.

Yes, I'm sure we've all done that many a time but it's really beside the
point. Kunit offers a much nicer method for running a lot of unit tests
on existing code.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v4 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-03-21 Thread Logan Gunthorpe
Hi,

On 2019-02-14 2:37 p.m., Brendan Higgins wrote:
> This patch set proposes KUnit, a lightweight unit testing and mocking
> framework for the Linux kernel.

I haven't followed the entire conversation but I saw the KUnit write-up
on LWN and ended up, as an exercise, giving it a try.

I really like the idea of having a fast unit testing infrastructure in
the kernel. Occasionally, I write userspace tests for tricky functions
that I essentially write by copying the code over to a throw away C file
and exercise them as I need. I think it would be great to be able to
keep these tests around in a way that they can be run by anyone who
wants to touch the code.

I was just dealing with some functions that required some mocked up
tests so I thought I'd give KUnit a try. I found writing the code very
easy and the infrastructure I was testing was quite simple to mock out
the hardware.

However, I got a bit hung up by one issue: I was writing unit tests for
code in the NTB tree which itself depends on CONFIG_PCI which cannot be
enabled in UML (for what should be obvious reasons). I managed to work
around this because, as luck would have it, all the functions I cared
about testing were actually static inline functions in headers. So I
placed my test code in the kunit folder (so it would compile) and hacked
around a couple a of functions I didn't care about that would not be
compiled.

In the end I got it to work acceptably, but I get the impression that
KUnit will not be usable for wide swaths of kernel code that can't be
compiled in UML. Has there been any discussion or ideas on how to work
around this so it can be more generally useful? Or will this feature be
restricted roughly to non-drivers and functions in headers that don't
have #ifdefs around them?

If you're interested in seeing the unit tests I ended up writing you can
find the commits here[1].

Thanks,

Logan

[1] https://github.com/sbates130272/linux-p2pmem/ ntb_kunit
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-02-01 Thread Logan Gunthorpe


On 2019-01-31 12:35 p.m., Jerome Glisse wrote:
> So what is this O_DIRECT thing that keep coming again and again here :)
> What is the use case ? Note that bio will always have valid struct page
> of regular memory as using PCIE BAR for filesystem is crazy (you do not
> have atomic or cache coherence and many CPU instruction have _undefined_
> effect so what ever the userspace would do might do nothing.

The point is to be able to use a BAR as the source of data to write/read
from a file system. So as a simple example, if an NVMe drive had a CMB,
and you could map that CMB to userspace, you could do an O_DIRECT read
to the BAR on one drive and an O_DIRECT write from the BAR on another
drive. Thus you could bypass the upstream port of a switch (and
therefore all CPU resources) altogether.

For the most part nobody would want to put a filesystem on a BAR.
(Though there have been some crazy ideas to put persistent memory behind
a CMB...)

> Now if you want to use BAR address as destination or source of directIO
> then let just update the directIO code to handle this. There is no need
> to go hack every single place in the kernel that might deal with struct
> page or sgl. Just update the place that need to understand this. We can
> even update directIO to work on weird platform. The change to directIO
> will be small, couple hundred line of code at best.

Well if you want to figure out how to remove struct page from the entire
block layer that would help everybody. But until then, it's pretty much
impossible to use the block layer (and therefore O_DIRECT) without
struct page.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-02-01 Thread Logan Gunthorpe


On 2019-01-31 12:02 p.m., Jason Gunthorpe wrote:
> I still think the right direction is to build on what Logan has done -
> realize that he created a DMA-only SGL - make that a formal type of
> the kernel and provide the right set of APIs to work with this type,
> without being forced to expose struct page.

> Basically invert the API flow - the DMA map would be done close to
> GUP, not buried in the driver. This absolutely doesn't work for every
> flow we have, but it does enable the ones that people seem to care
> about when talking about P2P.
> It also does present a path to solve some cases of the O_DIRECT
> problems if the block stack can develop some way to know if an IO will
> go down a DMA-only IO path or not... This seems less challenging that
> auditing every SGL user for iomem safety??


The DMA-only SGL will work for some use cases, but I think it's going to
be a challenge for others. We care most about NVMe and, therefore, the
block layer.

Given my understanding of the block layer, and it's queuing
infrastructure, I don't think having a DMA-only IO path makes sense. I
think it has to be the same path, but with a special DMA-only bio; and
endpoints would have to indicate support for that bio. I can't say I
have a deep enough understanding of the block layer to know how possible
that would be.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-30 10:26 a.m., Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
>> Even outside GPU driver, device driver like RDMA just want to share their
>> doorbell to other device and they do not want to see those doorbell page
>> use in direct I/O or anything similar AFAICT.
> 
> At least Mellanox HCA support and inline data feature where you
> can copy data directly into the BAR.  For something like a usrspace
> NVMe target it might be very useful to do direct I/O straight into
> the BAR for that.

Yup, these are things we definitely want to be able to do, and have done
with hacky garbage code: Direct I/O from NVMe to P2P BAR, then we could
Direct I/O to another drive or map it as an MR and send it over an RNIC.

We'd definitely like to move in that direction. And a world where such
userspace mappings are gimpped by the fact that they are only some
special feature of userspace VMAs that can only be used in specialized
userspace interfaces is not useful to us.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-30 12:38 p.m., Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 02:22:34PM -0500, Jerome Glisse wrote:
> 
>> For GPU it would not work, GPU might want to use main memory (because
>> it is running out of BAR space) it is a lot easier if the p2p_map
>> callback calls the right dma map function (for page or io) rather than
>> having to define some format that would pass down the information.
> 
> This is already sort of built into the sgl, you are supposed to use
> is_pci_p2pdma_page() and pci_p2pdma_map_sg() and somehow it is supposed
> to work out - but I think this is also fairly incomplete.


> ie the current APIs seem to assume the SGL is homogeneous :(

We never changed SGLs. We still use them to pass p2pdma pages, only we
need to be a bit careful where we send the entire SGL. I see no reason
why we can't continue to be careful once their in userspace if there's
something in GUP to deny them.

It would be nice to have heterogeneous SGLs and it is something we
should work toward but in practice they aren't really necessary at the
moment.

>>> Worry about optimizing away the struct page overhead later?
>>
>> Struct page do not fit well for GPU as the BAR address can be reprogram
>> to point to any page inside the device memory (think 256M BAR versus
>> 16GB device memory).
> 
> The struct page only points to the BAR - it is not related to the
> actual GPU memory in any way. The struct page is just an alternative
> way to specify the physical address of the BAR page.

That doesn't even necessarily need to be the case. For HMM, I
understand, struct pages may not point to any accessible memory and the
memory that backs it (or not) may change over the life time of it. So
they don't have to be strictly tied to BARs addresses. p2pdma pages are
strictly tied to BAR addresses though.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-30 12:59 p.m., Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 12:45:46PM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-30 12:06 p.m., Jason Gunthorpe wrote:
>>>> Way less problems than not having struct page for doing anything
>>>> non-trivial.  If you map the BAR to userspace with remap_pfn_range
>>>> and friends the mapping is indeed very simple.  But any operation
>>>> that expects a page structure, which is at least everything using
>>>> get_user_pages won't work.
>>>
>>> GUP doesn't work anyhow today, and won't work with BAR struct pages in
>>> the forseeable future (Logan has sent attempts on this before).
>>
>> I don't recall ever attempting that... But patching GUP for special
>> pages or VMAS; or working around by not calling it in some cases seems
>> like the thing that's going to need to be done one way or another.
> 
> Remember, the long discussion we had about how to get the IOMEM
> annotation into SGL? That is a necessary pre-condition to doing
> anything with GUP in DMA using drivers as GUP -> SGL -> DMA map is
> pretty much the standard flow.

Yes, but that was unrelated to GUP even if that might have been the
eventual direction.

And I feel the GUP->SGL->DMA flow should still be what we are aiming
for. Even if we need a special GUP for special pages, and a special DMA
map; and the SGL still has to be homogenous

> So, I see Jerome solving the GUP problem by replacing GUP entirely
> using an API that is more suited to what these sorts of drivers
> actually need.

Yes, this is what I'm expecting and what I want. Not bypassing the whole
thing by doing special things with VMAs.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
> Every attempt to give BAR memory to struct page has run into major
> trouble, IMHO, so I like that this approach avoids that.
> 
> And if you don't have struct page then the only kernel object left to
> hang meta data off is the VMA itself.
> 
> It seems very similar to the existing P2P work between in-kernel
> consumers, just that VMA is now mediating a general user space driven
> discovery process instead of being hard wired into a driver.

But the kernel now has P2P bars backed by struct pages and it works
well. And that's what we are doing in-kernel. We even have a hacky
out-of-tree module which exposes these pages and it also works (but
would need Jerome's solution for denying those pages in GUP, etc). So
why do something completely different in userspace so they can't share
any of the DMA map infrastructure?

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> I don't see why a special case with a VMA is really that different.

Well one *really* big difference is the VMA changes necessarily expose
specialized new functionality to userspace which has to be supported
forever and may be difficult to change. The p2pdma code is largely
in-kernel and we can rework and change the interfaces all we want as we
improve our struct page infrastructure.

I'd also argue that p2pdma isn't nearly as specialized as this VMA thing
and can be used pretty generically to do other things. Though, the other
ideas we've talked about doing are pretty far off and may have other
challenges.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-30 2:50 p.m., Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 02:01:35PM -0700, Logan Gunthorpe wrote:
> 
>> And I feel the GUP->SGL->DMA flow should still be what we are aiming
>> for. Even if we need a special GUP for special pages, and a special DMA
>> map; and the SGL still has to be homogenous
> 
> *shrug* so what if the special GUP called a VMA op instead of
> traversing the VMA PTEs today? Why does it really matter? It could
> easily change to a struct page flow tomorrow..

Well it's so that it's composable. We want the SGL->DMA side to work for
APIs from kernel space and not have to run a completely different flow
for kernel drivers than from userspace memory.

For GUP to do a special VMA traversal it would now need to return
something besides struct pages which means no SGL and it means a
completely different DMA mapping call.
> Would you feel better if this also came along with a:
> 
>   struct dma_sg_table *sgl_dma_map_user(struct device *dma_device, 
>  void __user *prt, size_t len)

That seems like a nice API. But certainly the implementation would need
to use existing dma_map or pci_p2pdma_map calls, or whatever as part of
it...

,
> flow which returns a *DMA MAPPED* sgl that does not have struct page
> pointers as another interface?
> 
> We can certainly call an API like this from RDMA for non-ODP MRs.
> 
> Eliminating the page pointers also eliminates the __iomem
> problem. However this sgl object is not copyable or accessible from
> the CPU, so the caller must be sure it doesn't need CPU access when
> using this API. 

We actually stopped caring about the __iomem problem. We are working
under the assumption that pages returned by devm_memremap_pages() can be
accessed as normal RAM and does not need the __iomem designation. The
main problem now is that code paths need to know to use pci_p2pdma_map
or not. And in theory this could be pushed into regular dma_map
implementations but we'd have to get it into all of them which is a pain.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-30 12:22 p.m., Jerome Glisse wrote:
> On Wed, Jan 30, 2019 at 06:56:59PM +, Jason Gunthorpe wrote:
>> On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
>>>
>>>
>>> On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
>>>> Every attempt to give BAR memory to struct page has run into major
>>>> trouble, IMHO, so I like that this approach avoids that.
>>>>
>>>> And if you don't have struct page then the only kernel object left to
>>>> hang meta data off is the VMA itself.
>>>>
>>>> It seems very similar to the existing P2P work between in-kernel
>>>> consumers, just that VMA is now mediating a general user space driven
>>>> discovery process instead of being hard wired into a driver.
>>>
>>> But the kernel now has P2P bars backed by struct pages and it works
>>> well. 
>>
>> I don't think it works that well..
>>
>> We ended up with a 'sgl' that is not really a sgl, and doesn't work
>> with many of the common SGL patterns. sg_copy_buffer doesn't work,
>> dma_map, doesn't work, sg_page doesn't work quite right, etc.
>>
>> Only nvme and rdma got the special hacks to make them understand these
>> p2p-sgls, and I'm still not convinced some of the RDMA drivers that
>> want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't
>> break in this scenario.
>>
>> Since the SGLs become broken, it pretty much means there is no path to
>> make GUP work generically, we have to go through and make everything
>> safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds
>> impossible with all the competing objections.
>>
>> But GPU seems to have a problem unrelated to this - what Jerome wants
>> is to have two faulting domains for VMA's - visible-to-cpu and
>> visible-to-dma. The new op is essentially faulting the pages into the
>> visible-to-dma category and leaving them invisible-to-cpu.
>>
>> So that duality would still have to exists, and I think p2p_map/unmap
>> is a much simpler implementation than trying to create some kind of
>> special PTE in the VMA..
>>
>> At least for RDMA, struct page or not doesn't really matter. 
>>
>> We can make struct pages for the BAR the same way NVMe does.  GPU is
>> probably the same, just with more mememory at stake?  
>>
>> And maybe this should be the first implementation. The p2p_map VMA
>> operation should return a SGL and the caller should do the existing
>> pci_p2pdma_map_sg() flow.. 
> 
> For GPU it would not work, GPU might want to use main memory (because
> it is running out of BAR space) it is a lot easier if the p2p_map
> callback calls the right dma map function (for page or io) rather than
> having to define some format that would pass down the information.

>>
>> Worry about optimizing away the struct page overhead later?
> 
> Struct page do not fit well for GPU as the BAR address can be reprogram
> to point to any page inside the device memory (think 256M BAR versus
> 16GB device memory). Forcing struct page on GPU driver would require
> major surgery to the GPU driver inner working and there is no benefit
> to have from the struct page. So it is hard to justify this.

I think we have to consider the struct pages to track the address space,
not what backs it (essentially what HMM is doing). If we need to add
operations for the driver to map the address space/struct pages back to
physical memory then do that. Creating a whole new idea that's tied to
userspace VMAs still seems wrong to me.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-30 12:06 p.m., Jason Gunthorpe wrote:
>> Way less problems than not having struct page for doing anything
>> non-trivial.  If you map the BAR to userspace with remap_pfn_range
>> and friends the mapping is indeed very simple.  But any operation
>> that expects a page structure, which is at least everything using
>> get_user_pages won't work.
> 
> GUP doesn't work anyhow today, and won't work with BAR struct pages in
> the forseeable future (Logan has sent attempts on this before).

I don't recall ever attempting that... But patching GUP for special
pages or VMAS; or working around by not calling it in some cases seems
like the thing that's going to need to be done one way or another.

> Jerome made the HMM mirror API use this flow, so afer his patch to
> switch the ODP MR to use HMM, and to switch GPU drivers, it will work
> for those cases. Which is more than the zero cases than we have today
> :)

But we're getting the same bait and switch here... If you are using HMM
you are using struct pages, but we're told we need this special VMA hack
for cases that don't use HMM and thus don't have struct pages...

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe


On 2019-01-30 12:19 p.m., Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 11:13:11AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
>>> I don't see why a special case with a VMA is really that different.
>>
>> Well one *really* big difference is the VMA changes necessarily expose
>> specialized new functionality to userspace which has to be supported
>> forever and may be difficult to change. 
> 
> The only user change here is that more things will succeed when
> creating RDMA MRs (and vice versa to GPU). I don't think this
> restricts the kernel implementation at all, unless we intend to
> remove P2P entirely..

Well for MRs I'd expect you are using struct pages to track the memory
some how VMAs that aren't backed by pages and use this special
interface must therefore be creating new special interfaces that can
call p2p_[un]map...

I'd much rather see special cases around struct page so we can find ways
to generalize it in the future than create special cases tied to random
userspace interfaces.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
> +bool pci_test_p2p(struct device *devA, struct device *devB)
> +{
> + struct pci_dev *pciA, *pciB;
> + bool ret;
> + int tmp;
> +
> + /*
> +  * For now we only support PCIE peer to peer but other inter-connect
> +  * can be added.
> +  */
> + pciA = find_parent_pci_dev(devA);
> + pciB = find_parent_pci_dev(devB);
> + if (pciA == NULL || pciB == NULL) {
> + ret = false;
> + goto out;
> + }
> +
> + tmp = upstream_bridge_distance(pciA, pciB, NULL);
> + ret = tmp < 0 ? false : true;
> +
> +out:
> + pci_dev_put(pciB);
> + pci_dev_put(pciA);
> + return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_test_p2p);

This function only ever returns false

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 12:32 p.m., Jason Gunthorpe wrote:
> Jerome, I think it would be nice to have a helper scheme - I think the
> simple case would be simple remapping of PCI BAR memory, so if we
> could have, say something like:
> 
> static const struct vm_operations_struct my_ops {
>   .p2p_map = p2p_ioremap_map_op,
>   .p2p_unmap = p2p_ioremap_unmap_op,
> }
> 
> struct ioremap_data {
>   [..]
> }
> 
> fops_mmap() {
>vma->private_data = _priv->ioremap_data;
>return p2p_ioremap_device_memory(vma, exporting_device, [..]);
> }

This is roughly what I was expecting, except I don't see exactly what
the p2p_map and p2p_unmap callbacks are for. The importing driver should
see p2pdma/hmm struct pages and use the appropriate function to map
them. It shouldn't be the responsibility of the exporting driver to
implement the mapping. And I don't think we should have 'special' vma's
for this (though we may need something to ensure we don't get mapping
requests mixed with different types of pages...).

I also figured there'd be a fault version of p2p_ioremap_device_memory()
for when you are mapping P2P memory and you want to assign the pages
lazily. Though, this can come later when someone wants to implement that.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 12:56 p.m., Alex Deucher wrote:
> On Tue, Jan 29, 2019 at 12:47 PM  wrote:
>>
>> From: Jérôme Glisse 
>>
>> device_test_p2p() return true if two devices can peer to peer to
>> each other. We add a generic function as different inter-connect
>> can support peer to peer and we want to genericaly test this no
>> matter what the inter-connect might be. However this version only
>> support PCIE for now.
>>
> 
> What about something like these patches:
> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p=4fab9ff69cb968183f717551441b475fabce6c1c
> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p=f90b12d41c277335d08c9dab62433f27c0fadbe5
> They are a bit more thorough.

Those new functions seem to have a lot of overlap with the code that is
already upstream in p2pdma Perhaps you should be improving the
p2pdma functions if they aren't suitable for what you want already
instead of creating new ones.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
> No this is the non HMM case i am talking about here. Fully ignore HMM
> in this frame. A GPU driver that do not support or use HMM in anyway
> has all the properties and requirement i do list above. So all the points
> i was making are without HMM in the picture whatsoever. I should have
> posted this a separate patches to avoid this confusion.
> 
> Regarding your HMM question. You can not map HMM pages, all code path
> that would try that would trigger a migration back to regular memory
> and will use the regular memory for CPU access.
> 

I thought this was the whole point of HMM... And eventually it would
support being able to map the pages through the BAR in cooperation with
the driver. If not, what's that whole layer for? Why not just have HMM
handle this situation?

And what struct pages are actually going to be backing these VMAs if
it's not using HMM?


> Again HMM has nothing to do here, ignore HMM it does not play any role
> and it is not involve in anyway here. GPU want to control what object
> they allow other device to access and object they do not allow. GPU driver
> _constantly_ invalidate the CPU page table and in fact the CPU page table
> do not have any valid pte for a vma that is an mmap of GPU device file
> for most of the vma lifetime. Changing that would highly disrupt and
> break GPU drivers. They need to control that, they need to control what
> to do if another device tries to peer map some of their memory. Hence
> why they need to implement the callback and decide on wether or not they
> allow the peer mapping or use device memory for it (they can decide to
> fallback to main memory).

But mapping is an operation of the memory/struct pages behind the VMA;
not of the VMA itself and I think that's evident by the code in that the
only way the VMA layer is involved is the fact that you're abusing
vm_ops by adding new ops there and calling it by other layers.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:

> + /*
> +  * Optional for device driver that want to allow peer to peer (p2p)
> +  * mapping of their vma (which can be back by some device memory) to
> +  * another device.
> +  *
> +  * Note that the exporting device driver might not have map anything
> +  * inside the vma for the CPU but might still want to allow a peer
> +  * device to access the range of memory corresponding to a range in
> +  * that vma.
> +  *
> +  * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> +  * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> +  * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> +  * device to map once during setup and report any failure at that time
> +  * to the userspace. Further mapping of the same range might happen
> +  * after mmu notifier invalidation over the range. The exporting device
> +  * can use this to move things around (defrag BAR space for instance)
> +  * or do other similar task.
> +  *
> +  * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> +  * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> +  * POINT IN TIME WITH NO LOCK HELD.
> +  *
> +  * In below function, the device argument is the importing device,
> +  * the exporting device is the device to which the vma belongs.
> +  */
> + long (*p2p_map)(struct vm_area_struct *vma,
> + struct device *device,
> + unsigned long start,
> + unsigned long end,
> + dma_addr_t *pa,
> + bool write);
> + long (*p2p_unmap)(struct vm_area_struct *vma,
> +   struct device *device,
> +   unsigned long start,
> +   unsigned long end,
> +   dma_addr_t *pa);

I don't understand why we need new p2p_[un]map function pointers for
this. In subsequent patches, they never appear to be set anywhere and
are only called by the HMM code. I'd have expected it to be called by
some core VMA code and set by HMM as that's what vm_operations_struct is
for.

But the code as all very confusing, hard to follow and seems to be
missing significant chunks. So I'm not really sure what is going on.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
> On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
>>
>>> +   /*
>>> +* Optional for device driver that want to allow peer to peer (p2p)
>>> +* mapping of their vma (which can be back by some device memory) to
>>> +* another device.
>>> +*
>>> +* Note that the exporting device driver might not have map anything
>>> +* inside the vma for the CPU but might still want to allow a peer
>>> +* device to access the range of memory corresponding to a range in
>>> +* that vma.
>>> +*
>>> +* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
>>> +* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
>>> +* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
>>> +* device to map once during setup and report any failure at that time
>>> +* to the userspace. Further mapping of the same range might happen
>>> +* after mmu notifier invalidation over the range. The exporting device
>>> +* can use this to move things around (defrag BAR space for instance)
>>> +* or do other similar task.
>>> +*
>>> +* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
>>> +* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
>>> +* POINT IN TIME WITH NO LOCK HELD.
>>> +*
>>> +* In below function, the device argument is the importing device,
>>> +* the exporting device is the device to which the vma belongs.
>>> +*/
>>> +   long (*p2p_map)(struct vm_area_struct *vma,
>>> +   struct device *device,
>>> +   unsigned long start,
>>> +   unsigned long end,
>>> +   dma_addr_t *pa,
>>> +   bool write);
>>> +   long (*p2p_unmap)(struct vm_area_struct *vma,
>>> + struct device *device,
>>> + unsigned long start,
>>> + unsigned long end,
>>> + dma_addr_t *pa);
>>
>> I don't understand why we need new p2p_[un]map function pointers for
>> this. In subsequent patches, they never appear to be set anywhere and
>> are only called by the HMM code. I'd have expected it to be called by
>> some core VMA code and set by HMM as that's what vm_operations_struct is
>> for.
>>
>> But the code as all very confusing, hard to follow and seems to be
>> missing significant chunks. So I'm not really sure what is going on.
> 
> It is set by device driver when userspace do mmap(fd) where fd comes
> from open("/dev/somedevicefile"). So it is set by device driver. HMM
> has nothing to do with this. It must be set by device driver mmap
> call back (mmap callback of struct file_operations). For this patch
> you can completely ignore all the HMM patches. Maybe posting this as
> 2 separate patchset would make it clearer.
> 
> For instance see [1] for how a non HMM driver can export its memory
> by just setting those callback. Note that a proper implementation of
> this should also include some kind of driver policy on what to allow
> to map and what to not allow ... All this is driver specific in any
> way.

I'd suggest [1] should be a part of the patchset so we can actually see
a user of the stuff you're adding.

But it still doesn't explain everything as without the HMM code nothing
calls the new vm_ops. And there's still no callers for the p2p_test
functions you added. And I still don't understand why we need the new
vm_ops or who calls them and when. Why can't drivers use the existing
'fault' vm_op and call a new helper function to map p2p when appropriate
or a different helper function to map a large range in its mmap
operation? Just like regular mmap code...

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 12:44 p.m., Jerome Glisse wrote:
>> I'd suggest [1] should be a part of the patchset so we can actually see
>> a user of the stuff you're adding.
> 
> I did not wanted to clutter patchset with device driver specific usage
> of this. As the API can be reason about in abstract way.

It's hard to reason about an interface when you can't see what all the
layers want to do with it. Most maintainers (I'd hope) would certainly
never merge code that has no callers, and for much the same reason, I'd
rather not review patches that don't have real use case examples.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 1:57 p.m., Jerome Glisse wrote:
> GPU driver must be in control and must be call to. Here there is 2 cases
> in this patchset and i should have instead posted 2 separate patchset as
> it seems that it is confusing things.
> 
> For the HMM page, the physical address of the page ie the pfn does not
> correspond to anything ie there is nothing behind it. So the importing
> device has no idea how to get a valid physical address from an HMM page
> only the device driver exporting its memory with HMM device memory knows
> that.
> 
> 
> For the special vma ie mmap of a device file. GPU driver do manage their
> BAR ie the GPU have a page table that map BAR page to GPU memory and the
> driver _constantly_ update this page table, it is reflected by invalidating
> the CPU mapping. In fact most of the time the CPU mapping of GPU object are
> invalid they are valid only a small fraction of their lifetime. So you
> _must_ have some call to inform the exporting device driver that another
> device would like to map one of its vma. The exporting device can then
> try to avoid as much churn as possible for the importing device. But this
> has consequence and the exporting device driver must be allow to apply
> policy and make decission on wether or not it authorize the other device
> to peer map its memory. For GPU the userspace application have to call
> specific API that translate into specific ioctl which themself set flags
> on object (in the kernel struct tracking the user space object). The only
> way to allow program predictability is if the application can ask and know
> if it can peer export an object (ie is there enough BAR space left).

This all seems like it's an HMM problem and not related to mapping
BARs/"potential BARs" to userspace. If some code wants to DMA map HMM
pages, it calls an HMM function to map them. If HMM needs to consult
with the driver on aspects of how that's mapped, then that's between HMM
and the driver and not something I really care about. But making the
entire mapping stuff tied to userspace VMAs does not make sense to me.
What if somebody wants to map some HMM pages in the same way but from
kernel space and they therefore don't have a VMA?


>> I also figured there'd be a fault version of p2p_ioremap_device_memory()
>> for when you are mapping P2P memory and you want to assign the pages
>> lazily. Though, this can come later when someone wants to implement that.
> 
> For GPU the BAR address space is manage page by page and thus you do not
> want to map a range of BAR addresses but you want to allow mapping of
> multiple page of BAR address that are not adjacent to each other nor
> ordered in anyway. But providing helper for simpler device does make sense.

Well, this has little do with the backing device but how the memory is
mapped into userspace. With p2p_ioremap_device_memory() the entire range
is mapped into the userspace VMA immediately during the call to mmap().
With p2p_fault_device_memory(), mmap() would not actually map anything
and a page in the VMA would be mapped only when userspace accesses it
(using fault()). It seems to me like GPUs would prefer the latter but if
HMM takes care of the mapping from userspace potential pages to actual
GPU pages through the BAR then that may not be true.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 4:47 p.m., Jerome Glisse wrote:
> The whole point is to allow to use device memory for range of virtual
> address of a process when it does make sense to use device memory for
> that range. So they are multiple cases where it does make sense:
> [1] - Only the device is accessing the range and they are no CPU access
>   For instance the program is executing/running a big function on
>   the GPU and they are not concurrent CPU access, this is very
>   common in all the existing GPGPU code. In fact AFAICT It is the
>   most common pattern. So here you can use HMM private or public
>   memory.
> [2] - Both device and CPU access a common range of virtul address
>   concurrently. In that case if you are on a platform with cache
>   coherent inter-connect like OpenCAPI or CCIX then you can use
>   HMM public device memory and have both access the same memory.
>   You can not use HMM private memory.
> 
> So far on x86 we only have PCIE and thus so far on x86 we only have
> private HMM device memory that is not accessible by the CPU in any
> way.

I feel like you're just moving the rug out from under us... Before you
said ignore HMM and I was asking about the use case that wasn't using
HMM and how it works without HMM. In response, you just give me *way*
too much information describing HMM. And still, as best as I can see,
managing DMA mappings (which is different from the userspace mappings)
for GPU P2P should be handled by HMM and the userspace mappings should
*just* link VMAs to HMM pages using the standard infrastructure we
already have.

>> And what struct pages are actually going to be backing these VMAs if
>> it's not using HMM?
> 
> When you have some range of virtual address migrated to HMM private
> memory then the CPU pte are special swap entry and they behave just
> as if the memory was swapped to disk. So CPU access to those will
> fault and trigger a migration back to main memory.

This isn't answering my question at all... I specifically asked what is
backing the VMA when we are *not* using HMM.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 2/5] drivers/base: add a function to test peer to peer capability

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> device_test_p2p() return true if two devices can peer to peer to
> each other. We add a generic function as different inter-connect
> can support peer to peer and we want to genericaly test this no
> matter what the inter-connect might be. However this version only
> support PCIE for now.

This doesn't appear to be used in any of the further patches; so it's
very confusing.

I'm not sure a struct device wrapper is really necessary...

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability

2019-01-30 Thread Logan Gunthorpe


On 2019-01-29 12:44 p.m., Greg Kroah-Hartman wrote:
> On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
>>> +bool pci_test_p2p(struct device *devA, struct device *devB)
>>> +{
>>> +   struct pci_dev *pciA, *pciB;
>>> +   bool ret;
>>> +   int tmp;
>>> +
>>> +   /*
>>> +* For now we only support PCIE peer to peer but other inter-connect
>>> +* can be added.
>>> +*/
>>> +   pciA = find_parent_pci_dev(devA);
>>> +   pciB = find_parent_pci_dev(devB);
>>> +   if (pciA == NULL || pciB == NULL) {
>>> +   ret = false;
>>> +   goto out;
>>> +   }
>>> +
>>> +   tmp = upstream_bridge_distance(pciA, pciB, NULL);
>>> +   ret = tmp < 0 ? false : true;
>>> +
>>> +out:
>>> +   pci_dev_put(pciB);
>>> +   pci_dev_put(pciA);
>>> +   return false;
>>> +}
>>> +EXPORT_SYMBOL_GPL(pci_test_p2p);
>>
>> This function only ever returns false
> 
> I guess it was nevr actually tested :(
> 
> I feel really worried about passing random 'struct device' pointers into
> the PCI layer.  Are we _sure_ it can handle this properly?

Yes, there are a couple of pci_p2pdma functions that take struct devices
directly simply because it's way more convenient for the caller. That's
what find_parent_pci_dev() takes care of (it returns false if the device
is not a PCI device). Whether that's appropriate here is hard to say
seeing we haven't seen any caller code.

Logan


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling

2018-12-02 Thread Logan Gunthorpe


On 2018-11-30 3:28 p.m., Dan Williams wrote:
> On Fri, Nov 30, 2018 at 2:19 PM Logan Gunthorpe  wrote:
>>
>> Hey,
>>
>> On 2018-11-29 11:51 a.m., Dan Williams wrote:
>>> Got it, let me see how bad moving arch_remove_memory() turns out,
>>> sounds like a decent approach to coordinate multiple users of a single
>>> ref.
>>
>> I've put together a patch set[1] that fixes all the users of
>> devm_memremap_pages() without moving arch_remove_memory(). It's pretty
>> clean except for the p2pdma case which is fairly tricky but I don't
>> think there's an easy way around that.
> 
> The solution I'm trying is to introduce a devm_memremap_pages_remove()
> that each user can call after they have called percpu_ref_exit(), it's
> just crashing for me currently...

Ok, that's probably less of a clean up for other users, but sounds like
it would be less tricky for p2pdma. I'd have to create a list of all
pgmaps, but that's not so hard and doesn't create any nasty races to
consider like my current solution.

>> If you come up with a better solution that's great, otherwise let me
>> know and I'll do some clean up and more testing and send this set to the
>> lists. Though, we might need to wait for your patch to land before we
>> can properly send the fix to it (the first patch in my series)...
> 
> I'd say go ahead and send it. We can fix p2pdma as a follow-on. Send
> it to Andrew as a patch relative to the current -next tree.

Ok, though, how do I reference the current patch in Andrew's tree? Or
does it matter?

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling

2018-12-02 Thread Logan Gunthorpe
Hey,

On 2018-11-29 11:51 a.m., Dan Williams wrote:
> Got it, let me see how bad moving arch_remove_memory() turns out,
> sounds like a decent approach to coordinate multiple users of a single
> ref.

I've put together a patch set[1] that fixes all the users of
devm_memremap_pages() without moving arch_remove_memory(). It's pretty
clean except for the p2pdma case which is fairly tricky but I don't
think there's an easy way around that.

If you come up with a better solution that's great, otherwise let me
know and I'll do some clean up and more testing and send this set to the
lists. Though, we might need to wait for your patch to land before we
can properly send the fix to it (the first patch in my series)...

Logan

[1] https://github.com/sbates130272/linux-p2pmem/ memremap_fix

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling

2018-11-30 Thread Logan Gunthorpe


On 2018-11-28 8:10 p.m., Dan Williams wrote:
> Yes, please send a proper patch. 

Ok, I'll send one shortly.

> Although, I'm still not sure I see
> the problem with the order of the percpu-ref kill. It's likely more
> efficient to put the kill after the put_page() loop because the
> percpu-ref will still be in "fast" per-cpu mode, but the kernel panic
> should not be possible as long as their is a wait_for_completion()
> before the exit, unless something else is wrong.

The series of events looks something like this:

1) Some p2pdma user calls pci_alloc_p2pmem() to get some memory to DMA
to taking a reference to the pgmap.
2) Another process unbinds the underlying p2pdma driver and the devm
chain starts to unwind.
3) devm_memremap_pages_release() is called and it kills the reference
and drop's it's last reference.
4) arch_remove_memory() is called which will remove all the struct pages.
5) We eventually get to pci_p2pdma_release() where we wait for the
completion indicating all the pages have been freed.
6) The user in (1) tries to use the page that has been removed,
typically by calling pci_p2pdma_map_sg(), but the page doesn't exist so
the kernel panics.

So we really need the wait in (5) to occur before (4) but after (3) so
that the pages continue to exist until the last reference is dropped.

> Certainly you can't move the wait_for_completion() into your ->kill()
> callback without switching the ordering, but I'm not on board with
> that change until I understand a bit more about why you think
> device-dax might be broken?
> 
> I took a look at the p2pdma shutdown path and the:
> 
> if (percpu_ref_is_dying(ref))
> return;
> ...looks fishy. If multiple agents can overlap their requests for the
> same range why not track that simply as additional refs? Could it be
> the crash that you are seeing is a result of mis-accounting when it is
> safe to assume the page allocation can be freed?

Yeah, someone else mentioned the same thing during review but if I
remove it, there can be a double kill() on a hypothetical driver that
might call pci_p2pdma_add_resource() twice. The issue is we only have
one percpu_ref per device not one per range/BAR.

Though, now that I look at it, the current change in question will be
wrong if there are two devm_memremap_pages_release()s to call. Both need
to drop their references before we can wait_for_completion() ;(. I guess
I need multiple percpu_refs or more complex changes to
devm_memremap_pages_release().

Thanks

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling

2018-11-30 Thread Logan Gunthorpe


On 2018-11-29 10:30 a.m., Dan Williams wrote:
> Oh! Yes, nice find. We need to wait for the percpu-ref to be dead and
> all outstanding references dropped before we can proceed to
> arch_remove_memory(), and I think this problem has been there since
> day one because the final exit was always after devm_memremap_pages()
> release which means arch_remove_memory() was always racing any final
> put_page(). I'll take a look, it seems the arch_remove_pages() call
> needs to be moved out-of-line to its own context and wait for the
> final exit of the percpu-ref.

Ok, well I thought moving the wait_for_completion() into the kill() call
was a pretty good solution to this. Though, if we move the
arch_remove_pages() into a different context, it *may* help with the
problem below...

>> Though, now that I look at it, the current change in question will be
>> wrong if there are two devm_memremap_pages_release()s to call. Both need
>> to drop their references before we can wait_for_completion() ;(. I guess
>> I need multiple percpu_refs or more complex changes to
>> devm_memremap_pages_release().
> 
> Can you just have a normal device-level kref for this case? On final
> device-level kref_put then kill the percpu_ref? I guess the problem is
> devm semantics where p2pdma only gets one callback on a driver
> ->remove() event. I'm not sure how to support multiple references of
> the same pages without creating a non-devm version of
> devm_memremap_pages(). I'm not opposed to that, but afaiu I don't
> think p2pdma is compatible with devm as long as it supports N>1:1
> mappings of the same range.

Hmm, no I think you misunderstood what I said. I'm saying I need to have
exactly one percpu_ref per call to devm_memremap_pages() and this is
doable, just slightly annoying. Right now I have one percpu_ref for
multiple calls to devm_memremap_pages() which doesn't work with the
above fix because there will always be a wait_for_completion() before
the last references are dropped in this way:

1) First devm_memremap_pages_release() is called which drops it's
reference and waits_for_completion().

2) The second devm_memremap_pages_release() needs to be called to drop
it's reference, but can't seeing the first is waiting, and therefore the
percpu_ref never goes to zero and the wait_for_completion() never returns.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling

2018-11-28 Thread Logan Gunthorpe
Hey Dan,

On 2018-11-20 4:13 p.m., Dan Williams wrote:
> The last step before devm_memremap_pages() returns success is to
> allocate a release action, devm_memremap_pages_release(), to tear the
> entire setup down. However, the result from devm_add_action() is not
> checked.
> 
> Checking the error from devm_add_action() is not enough. The api
> currently relies on the fact that the percpu_ref it is using is killed
> by the time the devm_memremap_pages_release() is run. Rather than
> continue this awkward situation, offload the responsibility of killing
> the percpu_ref to devm_memremap_pages_release() directly. This allows
> devm_memremap_pages() to do the right thing  relative to init failures
> and shutdown.
> 
> Without this change we could fail to register the teardown of
> devm_memremap_pages(). The likelihood of hitting this failure is tiny as
> small memory allocations almost always succeed. However, the impact of
> the failure is large given any future reconfiguration, or
> disable/enable, of an nvdimm namespace will fail forever as subsequent
> calls to devm_memremap_pages() will fail to setup the pgmap_radix since
> there will be stale entries for the physical address range.
> 
> An argument could be made to require that the ->kill() operation be set
> in the @pgmap arg rather than passed in separately. However, it helps
> code readability, tracking the lifetime of a given instance, to be able
> to grep the kill routine directly at the devm_memremap_pages() call
> site.
> 
> Cc: 
> Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
> Reviewed-by: "Jérôme Glisse" 
> Reported-by: Logan Gunthorpe 
> Reviewed-by: Logan Gunthorpe 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Dan Williams 

I recently realized this patch, which was recently added to the mm tree,
will break p2pdma. This is largely because the patch was written and
reviewed before p2pdma was merged (in 4.20). Originally, I think we both
expected this patch would be merged before p2pdma but that's not what
happened.

Also, while testing this, I found the teardown is still not quite
correct. In p2pdma, the struct pages will be removed before all of the
percpu references have released and if the device is unbound while pages
are in use, there will be a kernel panic. This is because we wait on the
completion that indicates all references have been free'd after
devm_memremap_pages_release() is called and the pages are removed. This
is fairly easily fixed by waiting for the completion in the kill
function and moving the call after the last put_page(). I suspect device
DAX also has this problem but I'm not entirely certain if something else
might be preventing us from hitting this bug.

Ideally, as part of this patch we need to update the p2pdma call site
for devm_memremap_pages() and fix the completion issue. The diff for all
this is below, but if you'd like I can send a proper patch.

Thanks,

Logan

--


diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ae3c5b25dcc7..1df7bdb45eab 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -82,9 +82,10 @@ static void pci_p2pdma_percpu_release(struct
percpu_ref *ref)
complete_all(>devmap_ref_done);
 }

-static void pci_p2pdma_percpu_kill(void *data)
+static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
 {
-   struct percpu_ref *ref = data;
+   struct pci_p2pdma *p2p =
+   container_of(ref, struct pci_p2pdma, devmap_ref);

/*
 * pci_p2pdma_add_resource() may be called multiple times
@@ -96,6 +97,7 @@ static void pci_p2pdma_percpu_kill(void *data)
return;

percpu_ref_kill(ref);
+   wait_for_completion(>devmap_ref_done);
 }

 static void pci_p2pdma_release(void *data)
@@ -105,7 +107,6 @@ static void pci_p2pdma_release(void *data)
if (!pdev->p2pdma)
return;

-   wait_for_completion(>p2pdma->devmap_ref_done);
percpu_ref_exit(>p2pdma->devmap_ref);

gen_pool_destroy(pdev->p2pdma->pool);
@@ -198,6 +199,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev,
int bar, size_t size,
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
pci_resource_start(pdev, bar);
+   pgmap->kill = pci_p2pdma_percpu_kill;

addr = devm_memremap_pages(>dev, pgmap);
if (IS_ERR(addr)) {
@@ -211,11 +213,6 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev,
int bar, size_t size,
if (error)
goto pgmap_free;

-   error = devm_add_action_or_reset(>dev, pci_p2pdma_percpu_kill,
- >p2pdma->devmap_ref);
-   if (error)
-   goto pgmap_free;
-
pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
 

Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-03 Thread Logan Gunthorpe


On 02/04/18 11:20 AM, Jerome Glisse wrote:
> The point i have been trying to get accross is that you do have this
> information with dma_map_resource() you know the device to which you
> are trying to map (dev argument to dma_map_resource()) and you can
> easily get the device to which the memory belongs because you have the
> CPU physical address of the memory hence you can lookup the resource
> and get the device from that.

How do you go from a physical address to a struct device generally and
in a performant manner?

> IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
> the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.

PowerPC folks recently told us specifically that Power9 does not support
P2P between PCI root ports. I've said this many times. CAPI has nothing
to do with it.

> Mapping to userspace have nothing to do here. I am talking at hardware
> level. How thing are expose to userspace is a completely different
> problems that do not have one solution fit all. For GPU you want this
> to be under total control of GPU drivers. For storage like persistent
> memory, you might want to expose it userspace more directly ...

My understanding (and I worked on this a while ago) is that CAPI
hardware manages memory maps typically for userspace memory. When a
userspace program changes it's mapping, the CAPI hardware is updated so
that hardware is coherent with the user address space and it is safe to
DMA to any address without having to pin memory. (This is very similar
to ODP in RNICs.) This is *really* nice but doesn't solve *any* of the
problems we've been discussing. Moreover, many developers want to keep
P2P in-kernel, for the time being, where the problem of pinning memory
does not exist.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-03 Thread Logan Gunthorpe


On 30/03/18 01:45 PM, Jerome Glisse wrote:
> Looking at upstream code it seems that the x86 bits never made it upstream
> and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
> what happen, i was convince it got merge. So yes current code is broken on
> x86. ccing Joerg maybe he remembers what happened there.
> 
> [1] https://lwn.net/Articles/646605/

That looks like a significant improvement over what's upstream. But it's
three years old and looks to have been abandoned. I think I agree with
Bjorn's comments in that if it's going to be a general P2P API it will
need the device the resource belongs to in addition to the device that's
initiating the DMA request.

> Given it is currently only used by ARM folks it appear to at least work
> for them (tm) :) Note that Christian is doing this in PCIE only context
> and again dma_map_resource can easily figure that out if the address is
> a PCIE or something else. Note that the exporter export the CPU bus
> address. So again dma_map_resource has all the informations it will ever
> need, if the peer to peer is fundamentaly un-doable it can return dma
> error and it is up to the caller to handle this, just like GPU code do.
> 
> Do you claim that dma_map_resource is missing any information ?

Yes, that's what I said. All the existing ARM code appears to use it for
platform devices only. I suspect platform P2P is relatively trivial to
support on ARM. I think it's a big difference from using it for PCI P2P
or general P2P on any bus.

> I agree and understand that but for platform where such feature make sense
> this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
> which has far more advance feature when it comes to peer to peer, i don't
> see something more basic not working. On x86, Intel is a bit of lone wolf,
> dunno if they gonna support this usecase pro-actively. AMD definitly will.

Well PowerPC doesn't even support P2P between root ports. And I fail to
see how CAPI applies unless/until we get this memory mapped into
userspace and the mappings need to be dynamically managed. Seems like
that's a long way away.

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-03 Thread Logan Gunthorpe


On 02/04/18 01:16 PM, Jerome Glisse wrote:
> There isn't good API at the moment AFAIK, closest thing would either be
> lookup_resource() or region_intersects(), but a more appropriate one can
> easily be added, code to walk down the tree is readily available. More-
> over this can be optimize like vma lookup are, even more as resource are
> seldomly added so read side (finding a resource) can be heavily favor
> over write side (adding|registering a new resource).

So someone needs to create a highly optimized tree that registers all
physical address on the system and maps them to devices? That seems a
long way from being realized. I'd hardly characterize that as "easily".
If we can pass both devices to the API I'd suspect it would be preferred
over the complicated tree. This, of course, depends on what users of the
API need.

> cache coherency protocol (bit further than PCIE snoop). But also the
> other direction the CPU access to device memory can also be cache coherent,
> which is not the case in PCIE.

I was not aware that CAPI allows PCI device memory to be cache coherent.
That sounds like it would be very tricky...

> Note that with mmu_notifier there isn't any need to pin stuff (even
> without any special hardware capabilities), as long as you can preempt
> what is happening on your hardware to update its page table.

I've been told there's a lot of dislike of the mmu_notifier interface.
And being able to preempt what's happening on hardware, generally, is
not trivial. But, yes, this is essentially how ODP works.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-03 Thread Logan Gunthorpe


On 29/03/18 07:58 PM, Jerome Glisse wrote:
> On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 29/03/18 10:10 AM, Christian König wrote:
>>> Why not? I mean the dma_map_resource() function is for P2P while other 
>>> dma_map_* functions are only for system memory.
>>
>> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
>> P2P. Though it's a bit odd seeing we've been working under the
>> assumption that PCI P2P is different as it has to translate the PCI bus
>> address. Where as P2P for devices on other buses is a big unknown.
> 
> dma_map_resource() is the right API (thought its current implementation
> is fill with x86 assumptions). So i would argue that arch can decide to
> implement it or simply return dma error address which trigger fallback
> path into the caller (at least for GPU drivers). SG variant can be added
> on top.
> 
> dma_map_resource() is the right API because it has all the necessary
> informations. It use the CPU physical address as the common address
> "language" with CPU physical address of PCIE bar to map to another
> device you can find the corresponding bus address from the IOMMU code
> (NOP on x86). So dma_map_resource() knows both the source device which
> export its PCIE bar and the destination devices.

Well, as it is today, it doesn't look very sane to me. The default is to
just return the physical address if the architecture doesn't support it.
So if someone tries this on an arch that hasn't added itself to return
an error they're very likely going to just end up DMAing to an invalid
address and loosing the data or causing a machine check.

Furthermore, the API does not have all the information it needs to do
sane things. A phys_addr_t doesn't really tell you anything about the
memory behind it or what needs to be done with it. For example, on some
arm64 implementations if the physical address points to a PCI BAR and
that BAR is behind a switch with the DMA device then the address must be
converted to the PCI BUS address. On the other hand, if it's a physical
address of a device in an SOC it might need to  be handled in a
completely different way. And right now all the implementations I can
find seem to just assume that phys_addr_t points to regular memory and
can be treated as such.

This is one of the reasons that, based on feedback, our work went from
being general P2P with any device to being restricted to only P2P with
PCI devices. The dream that we can just grab the physical address of any
device and use it in a DMA request is simply not realistic.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Logan Gunthorpe


On 29/03/18 05:44 AM, Christian König wrote:
> Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe:
>>
>> On 28/03/18 01:44 PM, Christian König wrote:
>>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>>> I can see it makes sure IOMMU is aware of the access route and
>>> translates a CPU address into a PCI Bus address.
>>> I'm using that with the AMD IOMMU driver and at least there it works
>>> perfectly fine.
>> Yes, it would be nice, but no arch has implemented this yet. We are just
>> lucky in the x86 case because that arch is simple and doesn't need to do
>> anything for P2P (partially due to the Bus and CPU addresses being the
>> same). But in the general case, you can't rely on it.
> 
> Well, that an arch hasn't implemented it doesn't mean that we don't have 
> the right interface to do it.

Yes, but right now we don't have a performant way to check if we are
doing P2P or not in the dma_map_X() wrappers. And this is necessary to
check if the DMA ops in use support it or not. We can't have the
dma_map_X() functions do the wrong thing because they don't support it yet.

> Devices integrated in the CPU usually only "claim" to be PCIe devices. 
> In reality their memory request path go directly through the integrated 
> north bridge. The reason for this is simple better throughput/latency.

These are just more reasons why our patchset restricts to devices behind
a switch. And more mess for someone to deal with if they need to relax
that restriction.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Logan Gunthorpe


On 28/03/18 12:28 PM, Christian König wrote:
> I'm just using amdgpu as blueprint because I'm the co-maintainer of it 
> and know it mostly inside out.

Ah, I see.

> The resource addresses are translated using dma_map_resource(). As far 
> as I know that should be sufficient to offload all the architecture 
> specific stuff to the DMA subsystem.

It's not. The dma_map infrastructure currently has no concept of
peer-to-peer mappings and is designed for system memory only. No
architecture I'm aware of will translate PCI CPU addresses into PCI Bus
addresses which is necessary for any transfer that doesn't go through
the root complex (though on arches like x86 the CPU and Bus address
happen to be the same). There's a lot of people that would like to see
this change but it's likely going to be a long road before it does.

Furthermore, one of the reasons our patch-set avoids going through the
root complex at all is that IOMMU drivers will need to be made aware
that it is operating on P2P memory and do arch-specific things
accordingly. There will also need to be flags that indicate whether a
given IOMMU driver supports this. None of this work is done or easy.

> Yeah, but not for ours. See if you want to do real peer 2 peer you need 
> to keep both the operation as well as the direction into account.

Not sure what you are saying here... I'm pretty sure we are doing "real"
peer 2 peer...

> For example when you can do writes between A and B that doesn't mean 
> that writes between B and A work. And reads are generally less likely to 
> work than writes. etc...

If both devices are behind a switch then the PCI spec guarantees that A
can both read and write B and vice versa. Only once you involve root
complexes do you have this problem. Ie. you have unknown support which
may be no support, or partial support (stores but not loads); or
sometimes bad performance; or a combination of both... and you need some
way to figure out all this mess and that is hard. Whoever tries to
implement a white list will have to sort all this out.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Logan Gunthorpe


On 28/03/18 10:02 AM, Christian König wrote:
> Yeah, that looks very similar to what I picked up from the older 
> patches, going to read up on that after my vacation.

Yeah, I was just reading through your patchset and there are a lot of
similarities. Though, I'm not sure what you're trying to accomplish as I
could not find a cover letter and it seems to only enable one driver. Is
it meant to enable DMA transactions only between two AMD GPUs?

I also don't see where you've taken into account the PCI bus address. On
some architectures this is not the same as the CPU physical address.

> Just in general why are you interested in the "distance" of the devices?

We've taken a general approach where some drivers may provide p2p memory
(ie. an NVMe card or an RDMA NIC) and other drivers make use of it (ie.
the NVMe-of driver). The orchestrator driver needs to find the most
applicable provider device for a transaction in a situation that may
have multiple providers and multiple clients. So the most applicable
provider is the one that's closest ("distance"-wise) to all the clients
for the P2P transaction.

> And BTW: At least for writes that Peer 2 Peer transactions between 
> different root complexes work is actually more common than the other way 
> around.

Maybe on x86 with hardware made in the last few years. But on PowerPC,
ARM64, and likely a lot more the chance of support is *much* less. Also,
hardware that only supports P2P stores is hardly full support and is
insufficient for our needs.

> So I'm a bit torn between using a blacklist or a whitelist. A whitelist 
> is certainly more conservative approach, but that could get a bit long.

I think a whitelist approach is correct. Given old hardware and other
architectures, a black list is going to be too long and too difficult to
comprehensively populate.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Logan Gunthorpe


On 28/03/18 09:07 AM, Christian König wrote:
> Am 28.03.2018 um 14:38 schrieb Christoph Hellwig:
>> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
>>> From: "wda...@nvidia.com" 
>>>
>>> Add an interface to find the first device which is upstream of both
>>> devices.
>> Please work with Logan and base this on top of the outstanding peer
>> to peer patchset.
> 
> Can you point me to that? The last code I could find about that was from 
> 2015.

The latest posted series is here:

https://lkml.org/lkml/2018/3/12/830

However, we've made some significant changes to the area that's similar
to what you are doing. You can find lasted un-posted here:

https://github.com/sbates130272/linux-p2pmem/tree/pci-p2p-v4-pre2

Specifically this function would be of interest to you:

https://github.com/sbates130272/linux-p2pmem/blob/0e9468ae2a5a5198513dd12990151e09105f0351/drivers/pci/p2pdma.c#L239

However, the difference between what we are doing is that we are
interested in the distance through the common upstream device and you
appear to be finding the actual common device.

Thanks,

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Logan Gunthorpe


On 29/03/18 10:10 AM, Christian König wrote:
> Why not? I mean the dma_map_resource() function is for P2P while other 
> dma_map_* functions are only for system memory.

Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
P2P. Though it's a bit odd seeing we've been working under the
assumption that PCI P2P is different as it has to translate the PCI bus
address. Where as P2P for devices on other buses is a big unknown.

>> And this is necessary to
>> check if the DMA ops in use support it or not. We can't have the
>> dma_map_X() functions do the wrong thing because they don't support it yet.
> 
> Well that sounds like we should just return an error from 
> dma_map_resources() when an architecture doesn't support P2P yet as Alex 
> suggested.

Yes, well except in our patch-set we can't easily use
dma_map_resources() as we either have SGLs to deal with or we need to
create whole new interfaces to a number of subsystems.

> You don't seem to understand the implications: The devices do have a 
> common upstream bridge! In other words your code would currently claim 
> that P2P is supported, but in practice it doesn't work.

Do they? They don't on any of the Intel machines I'm looking at. The
previous version of the patchset not only required a common upstream
bridge but two layers of upstream bridges on both devices which would
effectively limit transfers to PCIe switches only. But Bjorn did not
like this.

> You need to include both drivers which participate in the P2P 
> transaction to make sure that both supports this and give them 
> opportunity to chicken out and in the case of AMD APUs even redirect the 
> request to another location (e.g. participate in the DMA translation).

I don't think it's the drivers responsibility to reject P2P . The
topology is what governs support or not. The discussions we had with
Bjorn settled on if the devices are all behind the same bridge they can
communicate with each other. This is essentially guaranteed by the PCI spec.

> DMA-buf fortunately seems to handle all this already, that's why we 
> choose it as base for our implementation.

Well, unfortunately DMA-buf doesn't help for the drivers we are working
with as neither the block layer nor the RDMA subsystem have any
interfaces for it.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Logan Gunthorpe


On 28/03/18 01:44 PM, Christian König wrote:
> Well, isn't that exactly what dma_map_resource() is good for? As far as 
> I can see it makes sure IOMMU is aware of the access route and 
> translates a CPU address into a PCI Bus address.

> I'm using that with the AMD IOMMU driver and at least there it works 
> perfectly fine.

Yes, it would be nice, but no arch has implemented this yet. We are just
lucky in the x86 case because that arch is simple and doesn't need to do
anything for P2P (partially due to the Bus and CPU addresses being the
same). But in the general case, you can't rely on it.

>>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>>> to keep both the operation as well as the direction into account.
>> Not sure what you are saying here... I'm pretty sure we are doing "real"
>> peer 2 peer...
>>
>>> For example when you can do writes between A and B that doesn't mean
>>> that writes between B and A work. And reads are generally less likely to
>>> work than writes. etc...
>> If both devices are behind a switch then the PCI spec guarantees that A
>> can both read and write B and vice versa.
> 
> Sorry to say that, but I know a whole bunch of PCI devices which 
> horrible ignores that.

Can you elaborate? As far as the device is concerned it shouldn't know
whether a request comes from a peer or from the host. If it does do
crazy stuff like that it's well out of spec. It's up to the switch (or
root complex if good support exists) to route the request to the device
and it's the root complex that tends to be what drops the load requests
which causes the asymmetries.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-24 Thread Logan Gunthorpe



On 23/10/17 10:08 AM, David Laight wrote:

It is also worth checking that the hardware actually supports p2p transfers.
Writes are more likely to be supported then reads.
ISTR that some intel cpus support some p2p writes, but there could easily
be errata against them.


Ludwig mentioned a PCIe switch. The few switches I'm aware of support 
P2P transfers. So if everything is setup correctly, the TLPs shouldn't 
even touch the CPU.


But, yes, generally it's a good idea to start with writes and see if 
they work first.


Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-22 Thread Logan Gunthorpe

On 22/10/17 12:13 AM, Petrosyan, Ludwig wrote:
> But at first sight it has to be simple:
> The PCIe Write transactions are address routed, so if in the packet header 
> the other endpoint address is written the TLP has to be routed (by PCIe 
> Switch to the endpoint), the DMA reading from the end point is really write 
> transactions from the endpoint, usually (Xilinx core) to start DMA one has to 
> write to the DMA control register of the endpoint the destination address. So 
> I have change the device driver to set in this register the physical address 
> of the other endpoint (get_resource start called to other endpoint, and it is 
> the same address which I could see in lspci - -s bus-address of the 
> switch port, memories behind bridge), so now the endpoint has to start send 
> writes TLP with the other endpoint address in the TLP header.
> But this is not working (I want to understand why ...), but I could see the 
> first address of the destination endpoint is changed (with the wrong value 
> 0xFF),
> now I want to try prepare in the driver of one endpoint the DMA buffer , but 
> using physical address of the other endpoint,
> Could be it will never work, but I want to understand why, there is my error 
> ...

Hmm, well if I understand you correctly it sounds like, in theory, it
should work. But there could be any number of reasons why it does not.
You may need to get a hold of a PCIe analyzer to figure out what's
actually going on.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-22 Thread Logan Gunthorpe

Hi Ludwig,

P2P transactions are still *very* experimental at the moment and take a 
lot of expertise to get working in a general setup. It will definitely 
require changes to the kernel, including the drivers of all the devices 
you are trying to make talk to eachother. If you're up for it you can 
take a look at:


https://github.com/sbates130272/linux-p2pmem/

Which has our current rough work making NVMe fabrics use p2p transactions.

Logan

On 10/20/2017 6:36 AM, Ludwig Petrosyan wrote:

Dear Linux kernel group

my name is Ludwig Petrosyan I am working in DESY (Germany)

we are responsible for the control system of  all accelerators in DESY.

For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
central Bus.


I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
endpoints).


The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
and/or usual Read/Write)


Could You please advise me where to start, is there some Documentation 
how to do it.



with best regards


Ludwig


On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
This is certainly not the first time this has been brought up, but I'd 
like to try and get some consensus on the best way to move this 
forward.  Allowing devices to talk directly improves performance and 
reduces latency by avoiding the use of staging buffers in system 
memory.  Also in cases where both devices are behind a switch, it 
avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, 
CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
able to take a CPU virtual address and be able to get to a physical 
address taking into account IOMMUs, etc.  Having struct pages for the 
memory would allow it to work more generally and wouldn't require as 
much explicit support in drivers that wanted to use it.

Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for 
testing.  This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by 
created corresponding struct page structures for device memory
- get_user_pages_fast() will  return corresponding struct pages when 
CPU address points to the device memory

- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer 
(http://www.spinics.net/lists/linux-pci/msg44560.html)

Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support 
it, doesn't handle S/G in device memory.

2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory 
(https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. 
(https://patchwork.kernel.org/patch/8583221/)

Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on 
device reset.

3. DMA-BUF
RDMA subsystem DMA-BUF support 
(http://www.spinics.net/lists/linux-rdma/msg38748.html)

Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in 
drivers.


4. iopmem
iopmem : A block device for PCIe memory 
(https://lwn.net/Articles/703895/)

5. HMM
Heterogeneous Memory Management 
(http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)


6. Some new mmap-like interface that takes a userptr and a length and 
returns a dma-buf and offset?

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
Linux-nvdimm mailing list
linux-nvd...@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 6/7] drm/tilcdc: clean up ifdef hacks around iowrite64

2017-06-26 Thread Logan Gunthorpe
Hi Jyri,

Thanks for the ack. However, I'm reworking this patch set to use the
include/linux/io-64-nonatomic* headers which will explicitly devolve
into two 32-bit transfers. It's not clear whether this is appropriate
for the tilcdc driver as it was never setup to use 32-bit transfers
(unlike the others I had patched).

If you think it's ok, I can still patch this driver to use the
non-atomic headers. Otherwise I can leave it out. Please let me know.

Thanks,

Logan


On 26/06/17 02:55 AM, Jyri Sarha wrote:
> Acked-by: Jyri Sarha 
> 
> And thanks!
> 
>> ---
>>  drivers/gpu/drm/tilcdc/tilcdc_regs.h | 6 --
>>  1 file changed, 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/tilcdc/tilcdc_regs.h 
>> b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
>> index e9ce725698a9..0b901405f30a 100644
>> --- a/drivers/gpu/drm/tilcdc/tilcdc_regs.h
>> +++ b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
>> @@ -133,13 +133,7 @@ static inline void tilcdc_write64(struct drm_device 
>> *dev, u32 reg, u64 data)
>>  struct tilcdc_drm_private *priv = dev->dev_private;
>>  void __iomem *addr = priv->mmio + reg;
>>  
>> -#ifdef iowrite64
>>  iowrite64(data, addr);
>> -#else
>> -__iowmb();
>> -/* This compiles to strd (=64-bit write) on ARM7 */
>> -*(u64 __force *)addr = __cpu_to_le64(data);
>> -#endif
>>  }
>>  
>>  static inline u32 tilcdc_read(struct drm_device *dev, u32 reg)
>>
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/7] iomap: implement ioread64 and iowrite64

2017-06-26 Thread Logan Gunthorpe

On 6/26/2017 2:43 PM, Arnd Bergmann wrote:

This hardcodes the behavior of include/linux/io-64-nonatomic-hi-lo.h, which
I find rather confusing, as only about one in five drivers wants this
behavior.

I'd suggest you don't add it in lib/iomap.c at all for 32-bit architectures,
but rather use the same logic that we have for readq/writeq in
io-64-nonatomic-hi-lo.h and io-64-nonatomic-lo-hi.h, adding
{lo_hi,hi_lo}_{ioread,iowrite}{,be} to the same files, and provide
the {ioread,iowrite}{,be} macros only if they have not been defined
at that point.


Thanks Arnd. Yes, I'm already reworking this patchset to do exactly that.

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 7/7] crypto: caam: cleanup CONFIG_64BIT ifdefs when using io{read|write}64

2017-06-25 Thread Logan Gunthorpe
Thanks Horia.

I'm inclined to just use your patch verbatim. I can set you as author,
but no matter how I do it, I'll need your Signed-off-by.

Logan

On 23/06/17 12:51 AM, Horia Geantă wrote:
> On 6/22/2017 7:49 PM, Logan Gunthorpe wrote:
>> Now that ioread64 and iowrite64 are always available we don't
>> need the ugly ifdefs to change their implementation when they
>> are not.
>>
> Thanks Logan.
> 
> Note however this is not equivalent - it changes the behaviour, since
> CAAM engine on i.MX6S/SL/D/Q platforms is broken in terms of 64-bit
> register endianness - see CONFIG_CRYPTO_DEV_FSL_CAAM_IMX usage in code
> you are removing.
> 
> [Yes, current code has its problems, as it does not differentiate b/w
> i.MX platforms with and without the (unofficial) erratum, but this
> should be fixed separately.]
> 
> Below is the change that would keep current logic - still forcing i.MX
> to write CAAM 64-bit registers in BE even if the engine is LE (yes, diff
> is doing a poor job).
> 
> Horia
> 
> diff --git a/drivers/crypto/caam/regs.h b/drivers/crypto/caam/regs.h
> index 84d2f838a063..b893ebb24e65 100644
> --- a/drivers/crypto/caam/regs.h
> +++ b/drivers/crypto/caam/regs.h
> @@ -134,50 +134,25 @@ static inline void clrsetbits_32(void __iomem
> *reg, u32 clear, u32 set)
>   *base + 0x : least-significant 32 bits
>   *base + 0x0004 : most-significant 32 bits
>   */
> -#ifdef CONFIG_64BIT
>  static inline void wr_reg64(void __iomem *reg, u64 data)
>  {
> +#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX
> if (caam_little_end)
> iowrite64(data, reg);
> else
> -   iowrite64be(data, reg);
> -}
> -
> -static inline u64 rd_reg64(void __iomem *reg)
> -{
> -   if (caam_little_end)
> -   return ioread64(reg);
> -   else
> -   return ioread64be(reg);
> -}
> -
> -#else /* CONFIG_64BIT */
> -static inline void wr_reg64(void __iomem *reg, u64 data)
> -{
> -#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX
> -   if (caam_little_end) {
> -   wr_reg32((u32 __iomem *)(reg) + 1, data >> 32);
> -   wr_reg32((u32 __iomem *)(reg), data);
> -   } else
>  #endif
> -   {
> -   wr_reg32((u32 __iomem *)(reg), data >> 32);
> -   wr_reg32((u32 __iomem *)(reg) + 1, data);
> -   }
> +   iowrite64be(data, reg);
>  }
> 
>  static inline u64 rd_reg64(void __iomem *reg)
>  {
>  #ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX
> if (caam_little_end)
> -   return ((u64)rd_reg32((u32 __iomem *)(reg) + 1) << 32 |
> -   (u64)rd_reg32((u32 __iomem *)(reg)));
> +   return ioread64(reg);
> else
>  #endif
> -   return ((u64)rd_reg32((u32 __iomem *)(reg)) << 32 |
> -   (u64)rd_reg32((u32 __iomem *)(reg) + 1));
> +   return ioread64be(reg);
>  }
> -#endif /* CONFIG_64BIT  */
> 
>  #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>  #ifdef CONFIG_SOC_IMX7D
> 
> 
>> Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
>> Cc: "Horia Geantă" <horia.gea...@nxp.com>
>> Cc: Dan Douglass <dan.dougl...@nxp.com>
>> Cc: Herbert Xu <herb...@gondor.apana.org.au>
>> Cc: "David S. Miller" <da...@davemloft.net>
>> ---
>>  drivers/crypto/caam/regs.h | 29 -
>>  1 file changed, 29 deletions(-)
>>
>> diff --git a/drivers/crypto/caam/regs.h b/drivers/crypto/caam/regs.h
>> index 84d2f838a063..26fc19dd0c39 100644
>> --- a/drivers/crypto/caam/regs.h
>> +++ b/drivers/crypto/caam/regs.h
>> @@ -134,7 +134,6 @@ static inline void clrsetbits_32(void __iomem *reg, u32 
>> clear, u32 set)
>>   *base + 0x : least-significant 32 bits
>>   *base + 0x0004 : most-significant 32 bits
>>   */
>> -#ifdef CONFIG_64BIT
>>  static inline void wr_reg64(void __iomem *reg, u64 data)
>>  {
>>  if (caam_little_end)
>> @@ -151,34 +150,6 @@ static inline u64 rd_reg64(void __iomem *reg)
>>  return ioread64be(reg);
>>  }
>>  
>> -#else /* CONFIG_64BIT */
>> -static inline void wr_reg64(void __iomem *reg, u64 data)
>> -{
>> -#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX
>> -if (caam_little_end) {
>> -wr_reg32((u32 __iomem *)(reg) + 1, data >> 32);
>> -wr_reg32((u32 __iomem *)(reg), data);
>> -} else
>> -#endif
>> -{
>> -wr_reg32((u32 __iomem *)(reg), data >> 32);
>> -wr_reg32((u32 __iomem *)(reg) + 1, data);
>&g

[PATCH 0/7] cleanup issues with io{read|write}64

2017-06-22 Thread Logan Gunthorpe
Hi,

Presently, the 64bit IO functions are not very usable in drivers because
they are not universally available in all architectures. This leads to
a bunch of hacks in the kernel to work around this. (See the last 3
patches in this series.) As part of my switchtec_ntb submission which
added another one of these warts, Greg asked me to look into fixing
it[1].

So this patchset attempts to solve this issue by filling in the missing
implementations in iomap.c and io.h. After that, the alpha architecture is
the only one I found that also needed a fix for this. Finally, this
patchset removes the hacks that have accumulated in the kernel,
thus far, for working around this.

This set is based off of v4.12-rc6.

Thanks,

Logan

[1] https://marc.info/?l=linux-kernel=149774601910663=2

Logan Gunthorpe (7):
  drm/tilcdc: don't use volatile with iowrite64
  iomap: implement ioread64 and iowrite64
  asm-generic/io.h: make ioread64 and iowrite64 universally available
  alpha: provide ioread64 and iowrite64 implementations
  ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks
  drm/tilcdc: clean up ifdef hacks around iowrite64
  crypto: caam: cleanup CONFIG_64BIT ifdefs when using io{read|write}64

 arch/alpha/include/asm/io.h  |  2 ++
 arch/alpha/kernel/io.c   | 18 +++
 arch/powerpc/include/asm/io.h|  2 ++
 drivers/crypto/caam/regs.h   | 29 -
 drivers/gpu/drm/tilcdc/tilcdc_regs.h |  8 +
 drivers/ntb/hw/intel/ntb_hw_intel.c  | 30 -
 include/asm-generic/io.h | 54 ---
 include/asm-generic/iomap.h  |  4 ---
 lib/iomap.c  | 62 
 9 files changed, 127 insertions(+), 82 deletions(-)

--
2.11.0
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 7/7] crypto: caam: cleanup CONFIG_64BIT ifdefs when using io{read|write}64

2017-06-22 Thread Logan Gunthorpe
Now that ioread64 and iowrite64 are always available we don't
need the ugly ifdefs to change their implementation when they
are not.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: "Horia Geantă" <horia.gea...@nxp.com>
Cc: Dan Douglass <dan.dougl...@nxp.com>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
Cc: "David S. Miller" <da...@davemloft.net>
---
 drivers/crypto/caam/regs.h | 29 -
 1 file changed, 29 deletions(-)

diff --git a/drivers/crypto/caam/regs.h b/drivers/crypto/caam/regs.h
index 84d2f838a063..26fc19dd0c39 100644
--- a/drivers/crypto/caam/regs.h
+++ b/drivers/crypto/caam/regs.h
@@ -134,7 +134,6 @@ static inline void clrsetbits_32(void __iomem *reg, u32 
clear, u32 set)
  *base + 0x : least-significant 32 bits
  *base + 0x0004 : most-significant 32 bits
  */
-#ifdef CONFIG_64BIT
 static inline void wr_reg64(void __iomem *reg, u64 data)
 {
if (caam_little_end)
@@ -151,34 +150,6 @@ static inline u64 rd_reg64(void __iomem *reg)
return ioread64be(reg);
 }
 
-#else /* CONFIG_64BIT */
-static inline void wr_reg64(void __iomem *reg, u64 data)
-{
-#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX
-   if (caam_little_end) {
-   wr_reg32((u32 __iomem *)(reg) + 1, data >> 32);
-   wr_reg32((u32 __iomem *)(reg), data);
-   } else
-#endif
-   {
-   wr_reg32((u32 __iomem *)(reg), data >> 32);
-   wr_reg32((u32 __iomem *)(reg) + 1, data);
-   }
-}
-
-static inline u64 rd_reg64(void __iomem *reg)
-{
-#ifndef CONFIG_CRYPTO_DEV_FSL_CAAM_IMX
-   if (caam_little_end)
-   return ((u64)rd_reg32((u32 __iomem *)(reg) + 1) << 32 |
-   (u64)rd_reg32((u32 __iomem *)(reg)));
-   else
-#endif
-   return ((u64)rd_reg32((u32 __iomem *)(reg)) << 32 |
-   (u64)rd_reg32((u32 __iomem *)(reg) + 1));
-}
-#endif /* CONFIG_64BIT  */
-
 #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
 #ifdef CONFIG_SOC_IMX7D
 #define cpu_to_caam_dma(value) \
-- 
2.11.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 2:36 PM, Alan Cox wrote:

I think that makes sense for the platforms with that problem. I'm not
sure there are many that can't do it for mmio at least. 486SX can't do it
and I guess some ARM32 but I think almost everyone else can including
most 32bit x86.

What's more of a problem is a lot of platforms can do 64bit MMIO via
ioread/write64 but not 64bit port I/O, and it's not clear how you
represent that via an ioread/write API that abstracts it away.


In Patch 2, we call bad_io_access for anyone trying to do 64bit accesses 
on port I/O.


Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 2/7] iomap: implement ioread64 and iowrite64

2017-06-22 Thread Logan Gunthorpe
Currently, ioread64 and iowrite64 are not impleminted in the generic
iomap implementation. The prototypes are defined if CONFIG_64BIT is set
but there is no actual implementation.

Seeing the functions are not universally available, they are unusable
for driver developers. This leads to ugly hacks such as those at
the top of

drivers/ntb/hw/intel/ntb_hw_intel.c

This patch adds generic implementations for these functions. We add
the obvious version if readq/writeq are implemented and fall back
to using two io32 calls in cases that don't provide direct 64bit
accesses. Thus making the functions universally available to
configurations with CONFIG_GENERIC_IOMAP=y.

For any pio accesses, the 64bit operations remain unsupported and
simply call bad_io_access in cases readq would be called.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Arnd Bergmann <a...@arndb.de>
Cc: Suresh Warrier <warr...@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npig...@gmail.com>
---
 arch/powerpc/include/asm/io.h |  2 ++
 include/asm-generic/iomap.h   |  4 ---
 lib/iomap.c   | 62 +++
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 422f99cf9924..11a83667d2c3 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -788,8 +788,10 @@ extern void __iounmap_at(void *ea, unsigned long size);
 
 #define mmio_read16be(addr)readw_be(addr)
 #define mmio_read32be(addr)readl_be(addr)
+#define mmio_read64be(addr)readq_be(addr)
 #define mmio_write16be(val, addr)  writew_be(val, addr)
 #define mmio_write32be(val, addr)  writel_be(val, addr)
+#define mmio_write64be(val, addr)  writeq_be(val, addr)
 #define mmio_insb(addr, dst, count)readsb(addr, dst, count)
 #define mmio_insw(addr, dst, count)readsw(addr, dst, count)
 #define mmio_insl(addr, dst, count)readsl(addr, dst, count)
diff --git a/include/asm-generic/iomap.h b/include/asm-generic/iomap.h
index 650fede33c25..43ec4ea9f6f9 100644
--- a/include/asm-generic/iomap.h
+++ b/include/asm-generic/iomap.h
@@ -30,20 +30,16 @@ extern unsigned int ioread16(void __iomem *);
 extern unsigned int ioread16be(void __iomem *);
 extern unsigned int ioread32(void __iomem *);
 extern unsigned int ioread32be(void __iomem *);
-#ifdef CONFIG_64BIT
 extern u64 ioread64(void __iomem *);
 extern u64 ioread64be(void __iomem *);
-#endif
 
 extern void iowrite8(u8, void __iomem *);
 extern void iowrite16(u16, void __iomem *);
 extern void iowrite16be(u16, void __iomem *);
 extern void iowrite32(u32, void __iomem *);
 extern void iowrite32be(u32, void __iomem *);
-#ifdef CONFIG_64BIT
 extern void iowrite64(u64, void __iomem *);
 extern void iowrite64be(u64, void __iomem *);
-#endif
 
 /*
  * "string" versions of the above. Note that they
diff --git a/lib/iomap.c b/lib/iomap.c
index fc3dcb4b238e..e38e036cb52f 100644
--- a/lib/iomap.c
+++ b/lib/iomap.c
@@ -66,6 +66,7 @@ static void bad_io_access(unsigned long port, const char 
*access)
 #ifndef mmio_read16be
 #define mmio_read16be(addr) be16_to_cpu(__raw_readw(addr))
 #define mmio_read32be(addr) be32_to_cpu(__raw_readl(addr))
+#define mmio_read64be(addr) be64_to_cpu(__raw_readq(addr))
 #endif
 
 unsigned int ioread8(void __iomem *addr)
@@ -93,11 +94,45 @@ unsigned int ioread32be(void __iomem *addr)
IO_COND(addr, return pio_read32be(port), return mmio_read32be(addr));
return 0x;
 }
+
+#ifdef readq
+u64 ioread64(void __iomem *addr)
+{
+   IO_COND(addr, bad_io_access(port, "ioread64"), return readq(addr));
+   return 0xLL;
+}
+u64 ioread64be(void __iomem *addr)
+{
+   IO_COND(addr, bad_io_access(port, "ioread64be"),
+   return mmio_read64be(addr));
+   return 0xLL;
+}
+#else
+u64 ioread64(void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32(addr);
+   high = ioread32(addr + sizeof(u32));
+   return low | (high << 32);
+}
+u64 ioread64be(void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32be(addr + sizeof(u32));
+   high = ioread32be(addr);
+   return low | (high << 32);
+}
+#endif
+
 EXPORT_SYMBOL(ioread8);
 EXPORT_SYMBOL(ioread16);
 EXPORT_SYMBOL(ioread16be);
 EXPORT_SYMBOL(ioread32);
 EXPORT_SYMBOL(ioread32be);
+EXPORT_SYMBOL(ioread64);
+EXPORT_SYMBOL(ioread64be);
 
 #ifndef pio_write16be
 #define pio_write16be(val,port) outw(swab16(val),port)
@@ -107,6 +142,7 @@ EXPORT_SYMBOL(ioread32be);
 #ifndef mmio_write16be
 #define mmio_write16be(val,port) __raw_writew(be16_to_cpu(val),port)
 #define mmio_write32be(val,port) __raw_writel(be32_to_cpu(val),port)
+#define mmio_write64be(val,port) __raw_writeq(be

[PATCH 1/7] drm/tilcdc: don't use volatile with iowrite64

2017-06-22 Thread Logan Gunthorpe
This is a prep patch for adding a universal iowrite64.

The patch is to prevent compiler warnings when we add iowrite64 that
would occur because there is an unnecessary volatile in this driver.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Jyri Sarha <jsa...@ti.com>
Cc: Tomi Valkeinen <tomi.valkei...@ti.com>
Cc: David Airlie <airl...@linux.ie>
---
 drivers/gpu/drm/tilcdc/tilcdc_regs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/tilcdc/tilcdc_regs.h 
b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
index 9d528c0a67a4..e9ce725698a9 100644
--- a/drivers/gpu/drm/tilcdc/tilcdc_regs.h
+++ b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
@@ -131,14 +131,14 @@ static inline void tilcdc_write(struct drm_device *dev, 
u32 reg, u32 data)
 static inline void tilcdc_write64(struct drm_device *dev, u32 reg, u64 data)
 {
struct tilcdc_drm_private *priv = dev->dev_private;
-   volatile void __iomem *addr = priv->mmio + reg;
+   void __iomem *addr = priv->mmio + reg;
 
 #ifdef iowrite64
iowrite64(data, addr);
 #else
__iowmb();
/* This compiles to strd (=64-bit write) on ARM7 */
-   *(volatile u64 __force *)addr = __cpu_to_le64(data);
+   *(u64 __force *)addr = __cpu_to_le64(data);
 #endif
 }
 
-- 
2.11.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available

2017-06-22 Thread Logan Gunthorpe
Currently, ioread64 and iowrite64 are only available io CONFIG_64BIT=y
and CONFIG_GENERIC_IOMAP=n. Thus, seeing the functions are not
universally available, it makes them unusable for driver developers.
This leads to ugly hacks such as those at the top of

drivers/ntb/hw/intel/ntb_hw_intel.c

This patch adds fallback implementations for when CONFIG_64BIT and
CONFIG_GENERIC_IOMAP are not set. These functions use two io32 based
calls to complete the operation.

Note, we do not use the volatile keyword in these functions like the
others in the same file. It is necessary to avoid a compiler warning
on arm.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Arnd Bergmann <a...@arndb.de>
---
 include/asm-generic/io.h | 54 +---
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index 7ef015eb3403..817edaa3da78 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -585,15 +585,24 @@ static inline u32 ioread32(const volatile void __iomem 
*addr)
 }
 #endif
 
-#ifdef CONFIG_64BIT
 #ifndef ioread64
 #define ioread64 ioread64
-static inline u64 ioread64(const volatile void __iomem *addr)
+#ifdef readq
+static inline u64 ioread64(const void __iomem *addr)
 {
return readq(addr);
 }
+#else
+static inline u64 ioread64(const void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32(addr);
+   high = ioread32(addr + sizeof(u32));
+   return low | (high << 32);
+}
+#endif
 #endif
-#endif /* CONFIG_64BIT */
 
 #ifndef iowrite8
 #define iowrite8 iowrite8
@@ -619,15 +628,21 @@ static inline void iowrite32(u32 value, volatile void 
__iomem *addr)
 }
 #endif
 
-#ifdef CONFIG_64BIT
 #ifndef iowrite64
 #define iowrite64 iowrite64
-static inline void iowrite64(u64 value, volatile void __iomem *addr)
+#ifdef writeq
+static inline void iowrite64(u64 value, void __iomem *addr)
 {
writeq(value, addr);
 }
+#else
+static inline void iowrite64(u64 value, void __iomem *addr)
+{
+   iowrite32(value, addr);
+   iowrite32(value >> 32, addr + sizeof(u32));
+}
+#endif
 #endif
-#endif /* CONFIG_64BIT */
 
 #ifndef ioread16be
 #define ioread16be ioread16be
@@ -645,15 +660,24 @@ static inline u32 ioread32be(const volatile void __iomem 
*addr)
 }
 #endif
 
-#ifdef CONFIG_64BIT
 #ifndef ioread64be
 #define ioread64be ioread64be
-static inline u64 ioread64be(const volatile void __iomem *addr)
+#ifdef readq
+static inline u64 ioread64be(const void __iomem *addr)
 {
return swab64(readq(addr));
 }
+#else
+static inline u64 ioread64be(const void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32be(addr + sizeof(u32));
+   high = ioread32be(addr);
+   return low | (high << 32);
+}
+#endif
 #endif
-#endif /* CONFIG_64BIT */
 
 #ifndef iowrite16be
 #define iowrite16be iowrite16be
@@ -671,15 +695,21 @@ static inline void iowrite32be(u32 value, volatile void 
__iomem *addr)
 }
 #endif
 
-#ifdef CONFIG_64BIT
 #ifndef iowrite64be
 #define iowrite64be iowrite64be
-static inline void iowrite64be(u64 value, volatile void __iomem *addr)
+#ifdef writeq
+static inline void iowrite64be(u64 value, void __iomem *addr)
 {
writeq(swab64(value), addr);
 }
+#else
+static inline void iowrite64be(u64 value, void __iomem *addr)
+{
+   iowrite32be(value >> 32, addr);
+   iowrite32be(value, addr + sizeof(u32));
+}
+#endif
 #endif
-#endif /* CONFIG_64BIT */
 
 #ifndef ioread8_rep
 #define ioread8_rep ioread8_rep
-- 
2.11.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 11:29 AM, Stephen  Bates wrote:

+#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p))


Logan, thanks for taking this cleanup on. I think this should be iowrite64 not 
iowrite32?


Yup, good catch. Thanks. I'll fix it in a v2 of this series.

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 5/7] ntb: ntb_hw_intel: remove ioread64 and iowrite64 hacks

2017-06-22 Thread Logan Gunthorpe
Now that ioread64 and iowrite64 are available generically we can
remove the hack at the top of ntb_hw_intel.c that patches them in
when they are not available.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Jon Mason <jdma...@kudzu.us>
Cc: Dave Jiang <dave.ji...@intel.com>
Cc: Allen Hubbe <allen.hu...@emc.com>
---
 drivers/ntb/hw/intel/ntb_hw_intel.c | 30 --
 1 file changed, 30 deletions(-)

diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c 
b/drivers/ntb/hw/intel/ntb_hw_intel.c
index c00238491673..56221d540c2b 100644
--- a/drivers/ntb/hw/intel/ntb_hw_intel.c
+++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
@@ -153,35 +153,6 @@ MODULE_PARM_DESC(xeon_b2b_dsd_bar5_addr32,
 static inline enum ntb_topo xeon_ppd_topo(struct intel_ntb_dev *ndev, u8 ppd);
 static int xeon_init_isr(struct intel_ntb_dev *ndev);
 
-#ifndef ioread64
-#ifdef readq
-#define ioread64 readq
-#else
-#define ioread64 _ioread64
-static inline u64 _ioread64(void __iomem *mmio)
-{
-   u64 low, high;
-
-   low = ioread32(mmio);
-   high = ioread32(mmio + sizeof(u32));
-   return low | (high << 32);
-}
-#endif
-#endif
-
-#ifndef iowrite64
-#ifdef writeq
-#define iowrite64 writeq
-#else
-#define iowrite64 _iowrite64
-static inline void _iowrite64(u64 val, void __iomem *mmio)
-{
-   iowrite32(val, mmio);
-   iowrite32(val >> 32, mmio + sizeof(u32));
-}
-#endif
-#endif
-
 static inline int pdev_is_atom(struct pci_dev *pdev)
 {
switch (pdev->device) {
@@ -3008,4 +2979,3 @@ static void __exit intel_ntb_pci_driver_exit(void)
debugfs_remove_recursive(debugfs_dir);
 }
 module_exit(intel_ntb_pci_driver_exit);
-
-- 
2.11.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 6/7] drm/tilcdc: clean up ifdef hacks around iowrite64

2017-06-22 Thread Logan Gunthorpe
Now that we can expect iowrite64 to always exist the hack is no longer
necessary so we just call iowrite64 directly.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Jyri Sarha <jsa...@ti.com>
Cc: Tomi Valkeinen <tomi.valkei...@ti.com>
Cc: David Airlie <airl...@linux.ie>
---
 drivers/gpu/drm/tilcdc/tilcdc_regs.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/gpu/drm/tilcdc/tilcdc_regs.h 
b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
index e9ce725698a9..0b901405f30a 100644
--- a/drivers/gpu/drm/tilcdc/tilcdc_regs.h
+++ b/drivers/gpu/drm/tilcdc/tilcdc_regs.h
@@ -133,13 +133,7 @@ static inline void tilcdc_write64(struct drm_device *dev, 
u32 reg, u64 data)
struct tilcdc_drm_private *priv = dev->dev_private;
void __iomem *addr = priv->mmio + reg;
 
-#ifdef iowrite64
iowrite64(data, addr);
-#else
-   __iowmb();
-   /* This compiles to strd (=64-bit write) on ARM7 */
-   *(u64 __force *)addr = __cpu_to_le64(data);
-#endif
 }
 
 static inline u32 tilcdc_read(struct drm_device *dev, u32 reg)
-- 
2.11.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Logan Gunthorpe
Alpha implements its own io operation and doesn't use the
common library. Thus to make ioread64 and iowrite64 globally
available we need to add implementations for alpha.

For this, we simply use calls that chain two 32-bit operations.
(mostly because I don't really understand the alpha architecture.)

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Richard Henderson <r...@twiddle.net>
Cc: Ivan Kokshaysky <i...@jurassic.park.msu.ru>
Cc: Matt Turner <matts...@gmail.com>
---
 arch/alpha/include/asm/io.h |  2 ++
 arch/alpha/kernel/io.c  | 18 ++
 2 files changed, 20 insertions(+)

diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h
index ff4049155c84..15588092c062 100644
--- a/arch/alpha/include/asm/io.h
+++ b/arch/alpha/include/asm/io.h
@@ -493,8 +493,10 @@ extern inline void writeq(u64 b, volatile void __iomem 
*addr)
 
 #define ioread16be(p) be16_to_cpu(ioread16(p))
 #define ioread32be(p) be32_to_cpu(ioread32(p))
+#define ioread64be(p) be64_to_cpu(ioread64(p))
 #define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p))
 #define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p))
+#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p))
 
 #define inb_p  inb
 #define inw_p  inw
diff --git a/arch/alpha/kernel/io.c b/arch/alpha/kernel/io.c
index 19c5875ab398..8c28026f7849 100644
--- a/arch/alpha/kernel/io.c
+++ b/arch/alpha/kernel/io.c
@@ -59,6 +59,24 @@ EXPORT_SYMBOL(iowrite8);
 EXPORT_SYMBOL(iowrite16);
 EXPORT_SYMBOL(iowrite32);
 
+u64 ioread64(void __iomem *addr)
+{
+   u64 low, high;
+
+   low = ioread32(addr);
+   high = ioread32(addr + sizeof(u32));
+   return low | (high << 32);
+}
+
+void iowrite64(u64 val, void __iomem *addr)
+{
+   iowrite32(val, addr);
+   iowrite32(val >> 32, addr + sizeof(u32));
+}
+
+EXPORT_SYMBOL(ioread64);
+EXPORT_SYMBOL(iowrite64);
+
 u8 inb(unsigned long port)
 {
return ioread8(ioport_map(port, 1));
-- 
2.11.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 3/7] asm-generic/io.h: make ioread64 and iowrite64 universally available

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 2:14 PM, Alan Cox wrote:

If a platform doesn't support 64bit I/O operations from the CPU then you
either need to use some kind of platform/architecture specific interface
if present or accept you don't have one.


Yes, I understand that.

The thing is that every user that's currently using it right now is 
patching in their own version that splits it on non-64bit systems.



It's not safe to split it. Possibly for some use cases you could add an
ioread64_maysplit()


I'm open to doing something like that.


What btw is the actual ARM compiler warning ? Is the compiler also trying
to tell you it's a bad idea ?


It's just the compiler noting that you are mixing volatile and 
non-volatile pointers. Strangely some io{read|write}XX use volatile but 
most do not. But it's nothing crazy.


Logan


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 2:08 PM, Alan Cox wrote:

But this does not do the same thing as an ioread64 with regards to
atomicity or side effects on the device. The same is true of the other
hacks. You either have a real 64bit single read/write from MMIO space or
you don't. You can't fake it.


Yes, I know. But is it not better than having every driver that wants to 
use these functions fake it themselves?


Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 3:03 PM, Arnd Bergmann wrote:

Drivers that want a non-atomic variant should include either
include/linux/io-64-nonatomic-hi-lo.h or include/linux/io-64-nonatomic-lo-hi.h
depending on what they need. Drivers that require 64-bit I/O should
probably just depend on CONFIG_64BIT and maybe use readq/writeq.


Ok, I will work something like that up.

We'll still need a patch similar to patch 2 (less the non-atomic 
versions) seeing even CONFIG_GENERIC_IOMAP arches don't actually have a 
working ioread64/iowrite64 implementation.


Thanks,

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH] drm: Fix boot panic when register_chrdev fails

2017-06-14 Thread Logan Gunthorpe
This is a bug found by the 0day kernel test robot. When drm is
compiled into the kernel, and register_chrdev fails (due, in this, case
to overfilling the chardev dynamic major numbers), a kernel panic
occurs on boot:

  BUG: unable to handle kernel NULL pointer dereference at
  00a8
  IP: down_write+0x2a/0x53
  Call Trace:
   start_creating+0x67/0x12e
   debugfs_create_dir+0x12/0x189
   drm_debugfs_init+0x7c/0x1b7
   ? ___might_sleep+0x172/0x192
   ? __might_sleep+0x6b/0xef
   drm_minor_register+0x6b/0x141
   drm_dev_register+0xcd/0x315
   ? pci_enable_device_flags+0x117/0x177
   drm_get_pci_dev+0x106/0x27a
   cirrus_pci_probe+0xfb/0x125
   pci_device_probe+0x11d/0x185
   ...

This is because when register_chrdev fails, it removes drm_debugfs_root.
However, seeing drm is not a module, nothing prevents other code from
calling drm_minor_register after drm_core_init failed and thus using an
invalid drm_debugfs_root.

This commit fixes this issue by setting drm_debugfs_root to NULL after
removal and checking that it's not NULL before using it.

Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
Cc: Daniel Vetter <daniel.vet...@intel.com>
Cc: Jani Nikula <jani.nik...@linux.intel.com>
Cc: Sean Paul <seanp...@chromium.org>
Cc: David Airlie <airl...@linux.ie>
Link: https://lkml.org/lkml/2017/6/4/107
---
 drivers/gpu/drm/drm_drv.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 37b8ad3e30d8..904420304b75 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -214,6 +214,9 @@ static int drm_minor_register(struct drm_device *dev, 
unsigned int type)
 
DRM_DEBUG("\n");
 
+   if (!drm_debugfs_root)
+   return -ENODEV;
+
minor = *drm_minor_get_slot(dev, type);
if (!minor)
return 0;
@@ -935,6 +938,7 @@ static void drm_core_exit(void)
 {
unregister_chrdev(DRM_MAJOR, "drm");
debugfs_remove(drm_debugfs_root);
+   drm_debugfs_root = NULL;
drm_sysfs_destroy();
idr_destroy(_minors_idr);
drm_connector_ida_destroy();
-- 
2.11.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 07/21] crypto: shash, caam: Make use of the new sg_map helper function

2017-04-28 Thread Logan Gunthorpe


On 28/04/17 11:51 AM, Herbert Xu wrote:
> On Fri, Apr 28, 2017 at 10:53:45AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 28/04/17 12:30 AM, Herbert Xu wrote:
>>> You are right.  Indeed the existing code looks buggy as they
>>> don't take sg->offset into account when doing the kmap.  Could
>>> you send me some patches that fix these problems first so that
>>> they can be easily backported?
>>
>> Ok, I think the only buggy one in crypto is hifn_795x. Shash and caam
>> both do have the sg->offset accounted for. I'll send a patch for the
>> buggy one shortly.
> 
> I think they're all buggy when sg->offset is greater than PAGE_SIZE.

Yes, technically. But that's a _very_ common mistake. Pretty nearly
every case I looked at did not take that into account. I don't think
sg's that point to more than one continuous page are all that common.

Fixing all those cases without making a common function is a waste of
time IMO.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 07/21] crypto: shash, caam: Make use of the new sg_map helper function

2017-04-28 Thread Logan Gunthorpe


On 28/04/17 12:30 AM, Herbert Xu wrote:
> You are right.  Indeed the existing code looks buggy as they
> don't take sg->offset into account when doing the kmap.  Could
> you send me some patches that fix these problems first so that
> they can be easily backported?

Ok, I think the only buggy one in crypto is hifn_795x. Shash and caam
both do have the sg->offset accounted for. I'll send a patch for the
buggy one shortly.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions

2017-04-27 Thread Logan Gunthorpe


On 27/04/17 12:53 AM, Christoph Hellwig wrote:
> I think you'll need to follow the existing kmap semantics and never
> fail the iomem version either.  Otherwise you'll have a special case
> that's almost never used that has a different error path.
>
> Again, wrong way.  Suddenly making things fail for your special case
> that normally don't fail is a receipe for bugs.

I don't disagree but these restrictions make the problem impossible to
solve? If there is iomem behind a page in an SGL and someone tries to
map it, we either have to fail or we break iomem safety which was your
original concern.

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function

2017-04-27 Thread Logan Gunthorpe


On 27/04/17 02:53 PM, Jason Gunthorpe wrote:
> blkfront is one of the drivers I looked at, and it appears to only be
> memcpying with the bvec_data pointer, so I wonder why it does not use
> sg_copy_X_buffer instead..

Yes, sort of...

But you'd potentially end up calling sg_copy_to_buffer multiple times
per page within the sg (given that gnttab_foreach_grant_in_range might
call blkif_copy_from_grant/blkif_setup_rw_req_grant multiple times).
Even calling sg_copy_to_buffer once per page seems rather inefficient as
it uses sg_miter internally.

Switching the for_each_sg to sg_miter is probably the nicer solution as
it takes care of the mapping and the offset/length accounting for you
and will have similar performance.

But, yes, if performance is not an issue, switching it to
sg_copy_to_buffer would be a less invasive change than sg_miter. Which
the same might be said about a lot of these cases.

Unfortunately, changing from kmap_atomic (which is a null operation in a
lot of cases) to sg_copy_X_buffer is a pretty big performance hit.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 07/21] crypto: shash, caam: Make use of the new sg_map helper function

2017-04-27 Thread Logan Gunthorpe


On 26/04/17 09:56 PM, Herbert Xu wrote:
> On Tue, Apr 25, 2017 at 12:20:54PM -0600, Logan Gunthorpe wrote:
>> Very straightforward conversion to the new function in the caam driver
>> and shash library.
>>
>> Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
>> Cc: Herbert Xu <herb...@gondor.apana.org.au>
>> Cc: "David S. Miller" <da...@davemloft.net>
>> ---
>>  crypto/shash.c| 9 ++---
>>  drivers/crypto/caam/caamalg.c | 8 +++-
>>  2 files changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/crypto/shash.c b/crypto/shash.c
>> index 5e31c8d..5914881 100644
>> --- a/crypto/shash.c
>> +++ b/crypto/shash.c
>> @@ -283,10 +283,13 @@ int shash_ahash_digest(struct ahash_request *req, 
>> struct shash_desc *desc)
>>  if (nbytes < min(sg->length, ((unsigned int)(PAGE_SIZE)) - offset)) {
>>  void *data;
>>  
>> -data = kmap_atomic(sg_page(sg));
>> -err = crypto_shash_digest(desc, data + offset, nbytes,
>> +data = sg_map(sg, 0, SG_KMAP_ATOMIC);
>> +if (IS_ERR(data))
>> +return PTR_ERR(data);
>> +
>> +err = crypto_shash_digest(desc, data, nbytes,
>>req->result);
>> -kunmap_atomic(data);
>> +sg_unmap(sg, data, 0, SG_KMAP_ATOMIC);
>>  crypto_yield(desc->flags);
>>  } else
>>  err = crypto_shash_init(desc) ?:
> 
> Nack.  This is an optimisation for the special case of a single
> SG list entry.  In fact in the common case the kmap_atomic should
> disappear altogether in the no-highmem case.  So replacing it
> with sg_map is not acceptable.

What you seem to have missed is that sg_map is just a thin wrapper
around kmap_atomic. Perhaps with a future check for a mappable page.
This change should have zero impact on performance.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions

2017-04-27 Thread Logan Gunthorpe


On 27/04/17 09:27 AM, Jason Gunthorpe wrote:
> On Thu, Apr 27, 2017 at 08:53:38AM +0200, Christoph Hellwig wrote:
> How about first switching as many call sites as possible to use
> sg_copy_X_buffer instead of kmap?

Yeah, I could look at doing that first.

One problem is we might get more Naks of the form of Herbert Xu's who
might be concerned with the performance implications.

These are definitely a bit more invasive changes than thin wrappers
around kmap calls.

> A random audit of Logan's series suggests this is actually a fairly
> common thing.

It's not _that_ common but there are a significant fraction. One of my
patches actually did this to two places that seemed to be reimplementing
the sg_copy_X_buffer logic.

Thanks,

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function

2017-04-27 Thread Logan Gunthorpe


On 26/04/17 01:37 AM, Roger Pau Monné wrote:
> On Tue, Apr 25, 2017 at 12:21:02PM -0600, Logan Gunthorpe wrote:
>> Straightforward conversion to the new helper, except due to the lack
>> of error path, we have to use SG_MAP_MUST_NOT_FAIL which may BUG_ON in
>> certain cases in the future.
>>
>> Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
>> Cc: Boris Ostrovsky <boris.ostrov...@oracle.com>
>> Cc: Juergen Gross <jgr...@suse.com>
>> Cc: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
>> Cc: "Roger Pau Monné" <roger@citrix.com>
>> ---
>>  drivers/block/xen-blkfront.c | 20 +++-
>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 3945963..ed62175 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -816,8 +816,9 @@ static int blkif_queue_rw_req(struct request *req, 
>> struct blkfront_ring_info *ri
>>  BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>>  
>>  if (setup.need_copy) {
>> -setup.bvec_off = sg->offset;
>> -setup.bvec_data = kmap_atomic(sg_page(sg));
>> +setup.bvec_off = 0;
>> +setup.bvec_data = sg_map(sg, 0, SG_KMAP_ATOMIC |
>> + SG_MAP_MUST_NOT_FAIL);
> 
> I assume that sg_map already adds sg->offset to the address?

Correct.

> Also wondering whether we can get rid of bvec_off and just increment 
> bvec_data,
> adding Julien who IIRC added this code.

bvec_off is used to keep track of the offset within the current mapping
so it's not a great idea given that you'd want to kunmap_atomic the
original address and not something with an offset. It would be nice if
this could be converted to use the sg_miter interface but that's a much
more invasive change that would require someone who knows this code and
can properly test it. I'd be very grateful if someone actually took that on.

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions

2017-04-27 Thread Logan Gunthorpe

On 26/04/17 01:44 AM, Christoph Hellwig wrote:
> I think we'll at least need a draft of those to make sense of these
> patches.  Otherwise they just look very clumsy.

Ok, what follows is a draft patch attempting to show where I'm thinking
of going with this. Obviously it will not compile because it assumes
the users throughout the kernel are a bit different than they are today.
Notably, there is no sg_page anymore.

There's also likely a ton of issues and arguments to have over a bunch
of the specifics below and I'd expect the concept to evolve more
as cleanup occurs. This itself is an evolution of the draft I posted
replying to you in my last RFC thread.

Also, before any of this is truly useful to us, pfn_t would have to
infect a few other places in the kernel.

Thanks,

Logan


diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index fad170b..85ef928 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -6,13 +6,14 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 struct scatterlist {
 #ifdef CONFIG_DEBUG_SG
unsigned long   sg_magic;
 #endif
-   unsigned long   page_link;
+   pfn_t   pfn;
unsigned intoffset;
unsigned intlength;
dma_addr_t  dma_address;
@@ -60,15 +61,68 @@ struct sg_table {

 #define SG_MAGIC   0x87654321

-/*
- * We overload the LSB of the page pointer to indicate whether it's
- * a valid sg entry, or whether it points to the start of a new
scatterlist.
- * Those low bits are there for everyone! (thanks mason :-)
- */
-#define sg_is_chain(sg)((sg)->page_link & 0x01)
-#define sg_is_last(sg) ((sg)->page_link & 0x02)
-#define sg_chain_ptr(sg)   \
-   ((struct scatterlist *) ((sg)->page_link & ~0x03))
+static inline bool sg_is_chain(struct scatterlist *sg)
+{
+   return sg->pfn.val & PFN_SG_CHAIN;
+}
+
+static inline bool sg_is_last(struct scatterlist *sg)
+{
+   return sg->pfn.val & PFN_SG_LAST;
+}
+
+static inline struct scatterlist *sg_chain_ptr(struct scatterlist *sg)
+{
+   unsigned long sgl = pfn_t_to_pfn(sg->pfn);
+   return (struct scatterlist *)(sgl << PAGE_SHIFT);
+}
+
+static inline bool sg_is_iomem(struct scatterlist *sg)
+{
+   return pfn_t_is_iomem(sg->pfn);
+}
+
+/**
+ * sg_assign_pfn - Assign a given pfn_t to an SG entry
+ * @sg:SG entry
+ * @pfn:   The pfn
+ *
+ * Description:
+ *   Assign a pfn to sg entry. Also see sg_set_pfn(), the most commonly
used
+ *   variant.w
+ *
+ **/
+static inline void sg_assign_pfn(struct scatterlist *sg, pfn_t pfn)
+{
+#ifdef CONFIG_DEBUG_SG
+   BUG_ON(sg->sg_magic != SG_MAGIC);
+   BUG_ON(sg_is_chain(sg));
+   BUG_ON(pfn.val & (PFN_SG_CHAIN | PFN_SG_LAST));
+#endif
+
+   sg->pfn = pfn;
+}
+
+/**
+ * sg_set_pfn - Set sg entry to point at given pfn
+ * @sg: SG entry
+ * @pfn:The page
+ * @len:Length of data
+ * @offset: Offset into page
+ *
+ * Description:
+ *   Use this function to set an sg entry pointing at a pfn, never assign
+ *   the page directly. We encode sg table information in the lower bits
+ *   of the page pointer. See sg_pfn_t for looking up the pfn_t belonging
+ *   to an sg entry.
+ **/
+static inline void sg_set_pfn(struct scatterlist *sg, pfn_t pfn,
+ unsigned int len, unsigned int offset)
+{
+   sg_assign_pfn(sg, pfn);
+   sg->offset = offset;
+   sg->length = len;
+}

 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -82,18 +136,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page
*page)
 {
-   unsigned long page_link = sg->page_link & 0x3;
+   if (!page) {
+   pfn_t null_pfn = {0};
+   sg_assign_pfn(sg, null_pfn);
+   return;
+   }

-   /*
-* In order for the low bit stealing approach to work, pages
-* must be aligned at a 32-bit boundary as a minimum.
-*/
-   BUG_ON((unsigned long) page & 0x03);
-#ifdef CONFIG_DEBUG_SG
-   BUG_ON(sg->sg_magic != SG_MAGIC);
-   BUG_ON(sg_is_chain(sg));
-#endif
-   sg->page_link = page_link | (unsigned long) page;
+   sg_assign_pfn(sg, page_to_pfn_t(page));
 }

 /**
@@ -106,8 +155,7 @@ static inline void sg_assign_page(struct scatterlist
*sg, struct page *page)
  * Description:
  *   Use this function to set an sg entry pointing at a page, never assign
  *   the page directly. We encode sg table information in the lower bits
- *   of the page pointer. See sg_page() for looking up the page belonging
- *   to an sg entry.
+ *   of the page pointer.
  *
  **/
 static inline void sg_set_page(struct scatterlist *sg, struct page *page,
@@ -118,13 +166,53 @@ static inline void sg_set_page(struct scatterlist
*sg, struct page *page,
sg->length = len;
 }

-static inline struct page *sg_page(struct scatterlist *sg)
+/**
+ * sg_pfn_t - Return the 

Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function

2017-04-27 Thread Logan Gunthorpe


On 27/04/17 05:20 PM, Jason Gunthorpe wrote:
> It seems the most robust: test for iomem, and jump to a slow path
> copy, otherwise inline the kmap and memcpy
> 
> Every place doing memcpy from sgl will need that pattern to be
> correct.

Ok, sounds like a good place to start to me. I'll see what I can do for
a v3 of this set. Though, I probably won't send anything until after the
merge window.

>>> sg_miter will still fail when the sg contains __iomem, however I would
>>> expect that the sg_copy will work with iomem, by using the __iomem
>>> memcpy variant.
>>
>> Yes, that's true. Any sg_miters that ever see iomem will need to be
>> converted to support it. This isn't much different than the other
>> kmap(sg_page()) users I was converting that will also fail if they see
>> iomem. Though, I suspect an sg_miter user would be easier to convert to
>> iomem than a random kmap user.
> 
> How? sg_miter seems like the next nightmare down this path, what is
> sg_miter_next supposed to do when something hits an iomem sgl?

My proposal is roughly included in the draft I sent upthread. We add an
sg_miter flag indicating the iteratee supports iomem and if miter finds
iomem (with the support flag set) it sets ioaddr which is __iomem. The
iteratee then just needs to null check addr and ioaddr and perform the
appropriate action.

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function

2017-04-27 Thread Logan Gunthorpe


On 27/04/17 04:11 PM, Jason Gunthorpe wrote:
> On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote:
> Well, that is in the current form, with more users it would make sense
> to optimize for the single page case, eg by providing the existing
> call, providing a faster single-page-only variant of the copy, perhaps
> even one that is inlined.

Ok, does it make sense then to have an sg_copy_page_to_buffer (or some
such... I'm having trouble thinking of a sane name that isn't too long).
That just does k(un)map_atomic and memcpy? I could try that if it makes
sense to people.

>> Switching the for_each_sg to sg_miter is probably the nicer solution as
>> it takes care of the mapping and the offset/length accounting for you
>> and will have similar performance.
> 
> sg_miter will still fail when the sg contains __iomem, however I would
> expect that the sg_copy will work with iomem, by using the __iomem
> memcpy variant.

Yes, that's true. Any sg_miters that ever see iomem will need to be
converted to support it. This isn't much different than the other
kmap(sg_page()) users I was converting that will also fail if they see
iomem. Though, I suspect an sg_miter user would be easier to convert to
iomem than a random kmap user.

Logan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions

2017-04-26 Thread Logan Gunthorpe


On 26/04/17 02:59 AM,   wrote:
> Good to know that somebody is working on this. Those problems troubled
> us as well.

Thanks Christian. It's a daunting problem and a there's a lot of work to
do before we will ever be where we need to be so any help, even an ack,
is greatly appreciated.

Logan

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions

2017-04-26 Thread Logan Gunthorpe


On 26/04/17 01:44 AM, Christoph Hellwig wrote:
> I think we'll at least need a draft of those to make sense of these
> patches.  Otherwise they just look very clumsy.

Ok, I'll work up a draft proposal and send it in a couple days. But
without a lot of cleanup such as this series it's not going to even be
able to compile.

> I'm sorry but this API is just a trainwreck.  Right now we have the
> nice little kmap_atomic API, which never fails and has a very nice
> calling convention where we just pass back the return address, but does
> not support sleeping inside the critical section.
> 
> And kmap, whіch may fail and requires the original page to be passed
> back.  Anything that mixes these two concepts up is simply a non-starter.

Ok, well for starters I think you are mistaken about kmap being able to
fail. I'm having a hard time finding many users of that function that
bother to check for an error when calling it. The main difficulty we
have now is that neither of those functions are expected to fail and we
need them to be able to in cases where the page doesn't map to system
RAM. This patch series is trying to address it for users of scatterlist.
I'm certainly open to other suggestions.

I also have to disagree that kmap and kmap_atomic are all that "nice".
Except for the sleeping restriction and performance, they effectively do
the same thing. And it was necessary to write a macro wrapper around
kunmap_atomic to ensure that users of that function don't screw it up.
(See 597781f3e5.) I'd say the kmap/kmap_atomic functions are the
trainwreck and I'm trying to do my best to cleanup a few cases.

There are a fair number of cases in the kernel that do something like:

if (something)
x = kmap(page);
else
x = kmap_atomic(page);
...
if (something)
kunmap(page)
else
kunmap_atomic(x)

Which just seems cumbersome to me.

In any case, if you can accept an sg_kmap and sg_kmap_atomic api just
say so and I'll make the change. But I'll still need a flags variable
for SG_MAP_MUST_NOT_FAIL to support legacy cases that have no fail path
and both of those functions will need to be pretty nearly replicas of
each other.

Logan


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


  1   2   >