On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:
> On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:
> > On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:
> > > On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:
> > > > On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:
> > > > > On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:
> > > > > > On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:
> > > > > > > On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:
> > > > > We probably want to default to 32-bit for arm32 in the absence of 
> > > > > dma-ranges.
> > > > > For arm64, I'd prefer if we could always mandate dma-ranges to be 
> > > > > present
> > > > > for each bus, just like we mandate ranges to be present.
> > > > > I hope it's not too late for that.
> > > > > 
> > > > > dma_set_mask should definitely look at the dma-ranges properties, and 
> > > > > the
> > > > > helper that Santosh just introduced should give us all the information
> > > > > we need. We just need to decide on the correct behavior.
> > > > 
> > > > Last time I looked at Santosh's patches I thought the dma-ranges is per
> > > > device rather than per bus. We could make it per bus only and let the
> > > > device call dma_set_mask() explicitly if it wants to restrict it
> > > > further.
> > > 
> > > Can you check again? I've read the code again yesterday to check this,
> > > and I concluded it was correctly doing this per bus.
> > 
> > You are right, I missed the fact that of_dma_get_range() checks the
> > dma-ranges property in the node's parent.
> > 
> > So what we need is setting the default dma mask based on the size
> > information in dma-ranges to something like this:
> > 
> >     mask = rounddown_pow_of_two(size) - 1;
> >     dma_set_mask(dev, mask);        /* or dma_set_mask_and_coherent() */
> 
> I don't think we should be calling dma_set_mask here, since that would
> just go and parse the same property again once we fix the implementation.
> 
> Since this is a low-level helper, we can probably just assign the dma mask
> directly.

I was thinking of calling it in of_dma_configure() as that's where we
already set the default dma_mask and coherent_dma_mask. Default to
32-bit if no dma-ranges are present.

> > > > > > > While we currently don't have a set of swiotlb DMA ops on ARM32, 
> > > > > > > we do
> > > > > > > have it on ARM64, and I think we should be using them properly. 
> > > > > > > It should
> > > > > > > really not be hard to implement a proper dma_set_mask() function 
> > > > > > > for
> > > > > > > ARM64 that gets is able to set up the swiotlb based on the 
> > > > > > > dma-ranges
> > > > > > > properties and always returns success but leaves the mask 
> > > > > > > unchanged.
> > > > > > 
> > > > > > The swiotlb bounce buffer needs to be pre-allocated at boot, 
> > > > > > otherwise
> > > > > > we don't have any guarantees. Since we can't honour random masks 
> > > > > > anyway,
> > > > > > we stick to ZONE_DMA which is currently in the 4G limit. But the 
> > > > > > driver
> > > > > > calls dma_set_mask() too late for any further swiotlb setup.
> > > > > > 
> > > > > > With IOMMU we can be more flexible around dma_set_mask(), can be 
> > > > > > done at
> > > > > > run-time.
> > > > > 
> > > > > What we can do with swiotlb is to check if the mask is smaller than 
> > > > > ZONE_DMA.
> > > > > If it ever is, we have to fail dma_set_mask and hope the driver can 
> > > > > fall
> > > > > back to PIO mode or it will have to fail its probe() function.
> > > > 
> > > > dma_set_(coherent_)mask check swiotlb_dma_supported() which returns
> > > > false if io_tlb_end goes beyond the device mask. So we just need to
> > > > ensure that io_tlb is allocated within ZONE_DMA.
> > > 
> > > Makes sense for dma_set_mask. Why do you do the same thing for
> > > coherent_mask? Shouldn't that check against ZONE_DMA instead?
> > 
> > It depends on the meaning of coherent_dma_mask. As I understand it,
> > that's the dma mask use for dma_alloc_coherent() to return a memory
> > buffer that the device is able to access. I don't see it much different
> > from the dma_mask used by the streaming API. I guess some hardware could
> > have different masks here if they have cache coherency only for certain
> > memory ranges (and the coherent_dma_mask would be smaller than dma_mask).
> 
> The point I was trying to make is a different one: For the streaming mapping,
> you can allow any mask as long as the swiotlb is able to create a bounce
> buffer.

dma_mask is a hardware parameter and is used by the streaming DMA API
implementation (which could be swiotlb) to decide whether to bounce or
not. The streaming DMA API can take any CPU address as input,
independent of the dma_mask, and bounce accordingly. If it doesn't have
a bounce buffer matching the dma_mask, dma_supported() would return
false.

> This does not work for the coherent mapping, because that by definition
> cannot use bounce buffers.

Yes but most of the time these masks have the same value.

> > > > > For dma_set_coherent_mask(), we also have to fail any call that tries 
> > > > > to
> > > > > set a mask larger than what the device hardware can do. Unlike that,
> > > > > dma_set_mask() can succeed with any mask, we just have to enable 
> > > > > swiotlb
> > > > > if the mask that the driver wants is larger than what the hardware can
> > > > > do.
> > > > 
> > > > Currently we can't satisfy any arbitrarily small dma mask even with
> > > > swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA.
> > > > Swiotlb allows for smaller masks but we need to reserve the io_tlb
> > > > buffer early during boot and at smaller addresses. For example,
> > > > swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if
> > > > the coherent_dma_mask isn't matched, it frees the pages and falls back
> > > > to the io_tlb buffer. However, I don't think it's worth going for masks
> > > > smaller than 32-bit on arm64.
> > > 
> > > Is that safe for noncoherent systems? I'd expect the io_tlb buffer
> > > to be cached there, which means we can't use it for coherent allocations.
> > 
> > For non-coherent systems, the current arm64 approach is to take the
> > address returned by swiotlb_alloc_coherent() and remap it as Normal
> > NonCacheable (see the arm64 __dma_alloc_noncoherent()). The CPU should
> > only access the dma buffer via the non-cacheable mapping.
> 
> Doesn't that require the allocated page to be mapped using small pages
> in the linear mapping?

Not in the linear mapping but in the vmalloc space (for arm64).

> As far as I understand it, you are not allowed
> to create a noncacheable mapping for a page that also has a cachable
> mapping, or did that change between armv7 and armv8?

No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what
"unpredictable" means when you mix different attributes. Basically, we
can mix the same memory type (Normal Cacheable vs Normal NonCacheable)
but if the cacheability is different, we need to do some cache
maintenance before accessing the non-cacheable mapping the first time
(clear potentially dirty lines that could be evicted randomly). Any
speculatively loaded cache lines via the cacheable mapping would not be
hit when accessed via the non-cacheable one.

> For CMA, we use the trick to change the mapping in place, but that
> doesn't work for normal pages, which are mapped using huge tlbs.

That's true for arm32 and we could do the same for arm64, though I don't
think its worth because we could break very huge mappings (e.g. 1GB). We
have plenty of VA space and we just create a new non-coherent mapping
(this latter part could be optimised, even generically, to use huge
pages where possible).

> > Of course, another approach would be to change the cacheability
> > attributes of the io_tlb buffer (or the CMA one) but current approach
> > has the advantage that the io_tlb buffer can be used for both coherent
> > and non-coherent devices.
> 
> That definitely shouldn't be done without performance testing. However
> I wonder if they were done in the first place for the swiotlb: It's
> possible that it's cheaper to have the bounce buffer for noncoherent
> devices be mapped noncachable so we do cache bypassing copy into the
> bounce buffer and out of it and save the extra flushes.

I doesn't do this AFAICT but it's an interesting idea. Well, to be
optimised when we can benchmark this on hardware.

One issue though is when some (or all) devices are coherent. In this
case, even if you use a bounce buffer, it needs to be mapped coherently
at the CPU level (e.g. the device has its own cache that can be snooped,
so CPU accesses need to be cacheable).

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to