On Wed, May 21, 2014 at 03:43:39PM +0100, Arnd Bergmann wrote:
> On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:
> > On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:
> > > On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:
> > > > So what we need is setting the default dma mask based on the size
> > > > information in dma-ranges to something like this:
> > > > 
> > > >         mask = rounddown_pow_of_two(size) - 1;
> > > >         dma_set_mask(dev, mask);        /* or 
> > > > dma_set_mask_and_coherent() */
> > > 
> > > I don't think we should be calling dma_set_mask here, since that would
> > > just go and parse the same property again once we fix the implementation.
> > > 
> > > Since this is a low-level helper, we can probably just assign the dma mask
> > > directly.
> > 
> > I was thinking of calling it in of_dma_configure() as that's where we
> > already set the default dma_mask and coherent_dma_mask. Default to
> > 32-bit if no dma-ranges are present.
> 
> Right. Actually it should also be capped to 32-bit, to allow compatibility
> with drivers that don't call dma_set_mask and can't do 64-bit DMA. This
> is the normal behavior for PCI drivers. They need to set a 64-bit mask
> and check the result.

OK.

> > > The point I was trying to make is a different one: For the streaming 
> > > mapping,
> > > you can allow any mask as long as the swiotlb is able to create a bounce
> > > buffer.
> > 
> > dma_mask is a hardware parameter and is used by the streaming DMA API
> > implementation (which could be swiotlb) to decide whether to bounce or
> > not. The streaming DMA API can take any CPU address as input,
> > independent of the dma_mask, and bounce accordingly. If it doesn't have
> > a bounce buffer matching the dma_mask, dma_supported() would return
> > false.
> >
> > > This does not work for the coherent mapping, because that by definition
> > > cannot use bounce buffers.
> > 
> > Yes but most of the time these masks have the same value.
> 
> Let me start again with an example:
> 
> io_tlb_end    =    0x1000000; // 16 MB
> dma_mask      =   0x10000000; // 256 MB
> ZONE_DMA      =  0x100000000; // 4 GB
> max_pfn               = 0x1000000000; // 64 GB
> 
> The device driver has set dma_mask and dma_coherent_mask to 256 because
> of a limitation in the hardware. This succeeded because io_tlb_end is
> well within the dma mask.
> 
> Since the coherent_dma_mask is smaller than max_pfn, the swiotlb version
> of dma_alloc_coherent then allocates the buffer using GFP_DMA. However,
> that likely returns an address above dma_mask/coherent_dma_mask.

swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA). If the
returned address is not within coherent_dma_mask, it frees the allocated
pages and tries to allocate from its bounce buffer (io_tlb).

> My point was that to fix this case, you must not compare the
> coherent_dma_mask requested by the device against io_tlb_end but against
> ZONE_DMA.

See above, I think swiotlb does the right thing.

We have a problem, however, with CMA, as we assume that the returned
address is within coherent_dma_mask (we need a contiguous_dma_supported
function, similar to swiotlb_dma_supported). We also don't limit the CMA
buffer to ZONE_DMA on arm64.

> > > As far as I understand it, you are not allowed
> > > to create a noncacheable mapping for a page that also has a cachable
> > > mapping, or did that change between armv7 and armv8?
> > 
> > No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what
> > "unpredictable" means when you mix different attributes. Basically, we
> > can mix the same memory type (Normal Cacheable vs Normal NonCacheable)
> > but if the cacheability is different, we need to do some cache
> > maintenance before accessing the non-cacheable mapping the first time
> > (clear potentially dirty lines that could be evicted randomly). Any
> > speculatively loaded cache lines via the cacheable mapping would not be
> > hit when accessed via the non-cacheable one.
> 
> Ah, only for the first access? I had remembered this differently, thanks
> for the clarification.

The first time, but assuming that you no longer write to the cacheable
alias to dirty it again (basically it's like the streaming DMA, if you
want to access the cacheable alias you need cache maintenance).

> > > For CMA, we use the trick to change the mapping in place, but that
> > > doesn't work for normal pages, which are mapped using huge tlbs.
> > 
> > That's true for arm32 and we could do the same for arm64, though I don't
> > think its worth because we could break very huge mappings (e.g. 1GB). We
> > have plenty of VA space and we just create a new non-coherent mapping
> > (this latter part could be optimised, even generically, to use huge
> > pages where possible).
> 
> I thought we specifically did this on arm32 to avoid the duplicate
> mapping because we had concluded that it would be broken even with the
> updated ARMv7 definition.

We did it because it was vaguely defined as "unpredictable" in the ARM
ARM for a long time until the hw guys realised it's not easy to change
Linux and decided to clarify what could actually happen (see A3.5.7 in
the ARMv7 ARM, "Mismatched memory attributes").

But in the past we used to have coherent DMA buffers mapped as Strongly
Ordered and this would be a different memory type than Normal
(NonCacheable). But even in this case, ARM ARM now states that such
mapping is permitted but the Strongly Ordered properties are not
guaranteed (could simply behave line Normal NonCacheable).

> > > > Of course, another approach would be to change the cacheability
> > > > attributes of the io_tlb buffer (or the CMA one) but current approach
> > > > has the advantage that the io_tlb buffer can be used for both coherent
> > > > and non-coherent devices.
> > > 
> > > That definitely shouldn't be done without performance testing. However
> > > I wonder if they were done in the first place for the swiotlb: It's
> > > possible that it's cheaper to have the bounce buffer for noncoherent
> > > devices be mapped noncachable so we do cache bypassing copy into the
> > > bounce buffer and out of it and save the extra flushes.
> > 
> > I doesn't do this AFAICT but it's an interesting idea. Well, to be
> > optimised when we can benchmark this on hardware.
> > 
> > One issue though is when some (or all) devices are coherent. In this
> > case, even if you use a bounce buffer, it needs to be mapped coherently
> > at the CPU level (e.g. the device has its own cache that can be snooped,
> > so CPU accesses need to be cacheable).
> 
> Do you mean there are cases where an uncached buffer is not coherent
> with a device cache?

Yes, if for example you have a device (e.g. GPU, micro-controller) on a
coherent interconnect, non-cacheable CPU accesses are not guaranteed to
(probably should not) hit into the other device's cache (it could be
seen as a side-effect of allowing cacheable vs non-cacheable aliases).

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to