On Mon, Nov 2, 2015 at 10:11 PM, Daniel Kurtz <djku...@chromium.org> wrote: > > +Tomasz, so he can reply to the thread > +Marek and Russell as recommended by Tomasz > > On Oct 30, 2015 22:27, "Robin Murphy" <robin.mur...@arm.com> wrote: > > > > Hi Dan, > > > > On 30/10/15 01:17, Daniel Kurtz wrote: > >> > >> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the > >> v4l2-contig's usage of the DMA API. > >> > >> Hi Robin, > >> > >> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.mur...@arm.com> > >> wrote: > >>> > >>> On 26/10/15 13:44, Yong Wu wrote: > >>>> > >>>> > >>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote: > >>>> [...] > >>>>> > >>>>> > >>>>> +/* > >>>>> + * The DMA API client is passing in a scatterlist which could describe > >>>>> + * any old buffer layout, but the IOMMU API requires everything to be > >>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of > >>>>> + * impedance-matching, to be able to hand off a suitably-aligned list, > >>>>> + * but still preserve the original offsets and sizes for the caller. > >>>>> + */ > >>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, > >>>>> + int nents, int prot) > >>>>> +{ > >>>>> + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); > >>>>> + struct iova_domain *iovad = domain->iova_cookie; > >>>>> + struct iova *iova; > >>>>> + struct scatterlist *s, *prev = NULL; > >>>>> + dma_addr_t dma_addr; > >>>>> + size_t iova_len = 0; > >>>>> + int i; > >>>>> + > >>>>> + /* > >>>>> + * Work out how much IOVA space we need, and align the segments > >>>>> to > >>>>> + * IOVA granules for the IOMMU driver to handle. With some > >>>>> clever > >>>>> + * trickery we can modify the list in-place, but reversibly, by > >>>>> + * hiding the original data in the as-yet-unused DMA fields. > >>>>> + */ > >>>>> + for_each_sg(sg, s, nents, i) { > >>>>> + size_t s_offset = iova_offset(iovad, s->offset); > >>>>> + size_t s_length = s->length; > >>>>> + > >>>>> + sg_dma_address(s) = s->offset; > >>>>> + sg_dma_len(s) = s_length; > >>>>> + s->offset -= s_offset; > >>>>> + s_length = iova_align(iovad, s_length + s_offset); > >>>>> + s->length = s_length; > >>>>> + > >>>>> + /* > >>>>> + * The simple way to avoid the rare case of a segment > >>>>> + * crossing the boundary mask is to pad the previous one > >>>>> + * to end at a naturally-aligned IOVA for this one's > >>>>> size, > >>>>> + * at the cost of potentially over-allocating a little.
I'd like to know what is the boundary mask and what hardware imposes requirements like this. The cost here is not only over-allocating a little, but making many, many buffers contiguously mappable on the CPU, unmappable contiguously in IOMMU, which just defeats the purpose of having an IOMMU, which I believe should be there for simple IP blocks taking one DMA address to be able to view the buffer the same way as the CPU. > >>>>> + */ > >>>>> + if (prev) { > >>>>> + size_t pad_len = roundup_pow_of_two(s_length); > >>>>> + > >>>>> + pad_len = (pad_len - iova_len) & (pad_len - 1); > >>>>> + prev->length += pad_len; > >>>> > >>>> > >>>> > >>>> Hi Robin, > >>>> While our v4l2 testing, It seems that we met a problem here. > >>>> Here we update prev->length again, Do we need update > >>>> sg_dma_len(prev) again too? > >>>> > >>>> Some function like vb2_dc_get_contiguous_size[1] always get > >>>> sg_dma_len(s) to compare instead of s->length. so it may break > >>>> unexpectedly while sg_dma_len(s) is not same with s->length. > >>> > >>> > >>> > >>> This is just tweaking the faked-up length that we hand off to > >>> iommu_map_sg() > >>> (see also the iova_align() above), to trick it into bumping this segment > >>> up > >>> to a suitable starting IOVA. The real length at this point is stashed in > >>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), > >>> so > >>> both will hold the same true length once we return to the caller. > >>> > >>> Yes, it does mean that if you have a list where the segment lengths are > >>> page > >>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then > >>> you'll > >>> still end up with a gap between the second and third segments, but that's > >>> fine because the DMA API offers no guarantees about what the resulting DMA > >>> addresses will be (consider the no-IOMMU case where they would each just > >>> be > >>> "mapped" to their physical address). If that breaks v4l, then it's > >>> probably > >>> v4l's DMA API use that needs looking at (again). > >> > >> > >> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of > >> memory pages into a contiguous block in device memory address space. > >> This would allow passing a dma mapped buffer to device dma using just > >> a device address and length. > > > > > > Not at all. The streaming DMA API (dma_map_* and friends) has two > > responsibilities: performing any necessary cache maintenance to ensure the > > device will correctly see data from the CPU, and the CPU will correctly see > > data from the device; and working out an address for that buffer from the > > device's point of view to actually hand off to the hardware (which is > > perfectly well allowed to fail). Agreed. The dma_map_*() API is not guaranteed to return a single contiguous part of virtual address space for any given SG list. However it was understood to be able to map buffers contiguously mappable by the CPU into a single segment and users, videobuf2-dma-contig in particular, relied on this. > > > > Consider SWIOTLB's implementation - segments which already lie at physical > > addresses within the device's DMA mask just get passed through, while those > > that lie outside it get mapped into the bounce buffer, but still as > > individual allocations (arch code just handles cache maintenance on the > > resulting physical addresses and can apply any hard-wired DMA offset for > > the device concerned). And this is fine for vb2-dma-contig, which was made for devices that require buffers contiguous in its address space. Without IOMMU it will allow only physically contiguous buffers and fails otherwise, which is fine, because it's a hardware requirement. > > > >> IIUC, the change above breaks this model by inserting gaps in how the > >> buffer is mapped to device memory, such that the buffer is no longer > >> contiguous in dma address space. > > > > > > Even the existing arch/arm IOMMU DMA code which I guess this implicitly > > relies on doesn't guarantee that behaviour - if the mapping happens to > > reach one of the segment length/boundary limits it won't just leave a gap, > > it'll start an entirely new IOVA allocation which could well start at a > > wildly different address[0]. Could you explain segment length/boundary limits and when buffers can reach them? Sorry, i haven't been following all the discussions, but I'm not aware of any similar requirements of the IOMMU hardware I worked with. > > > >> Here is the code in question from > >> drivers/media/v4l2-core/videobuf2-dma-contig.c : > >> > >> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt) > >> { > >> struct scatterlist *s; > >> dma_addr_t expected = sg_dma_address(sgt->sgl); > >> unsigned int i; > >> unsigned long size = 0; > >> > >> for_each_sg(sgt->sgl, s, sgt->nents, i) { > >> if (sg_dma_address(s) != expected) > >> break; > >> expected = sg_dma_address(s) + sg_dma_len(s); > >> size += sg_dma_len(s); > >> } > >> return size; > >> } > >> > >> > >> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr, > >> unsigned long size, enum dma_data_direction dma_dir) > >> { > >> struct vb2_dc_conf *conf = alloc_ctx; > >> struct vb2_dc_buf *buf; > >> struct frame_vector *vec; > >> unsigned long offset; > >> int n_pages, i; > >> int ret = 0; > >> struct sg_table *sgt; > >> unsigned long contig_size; > >> unsigned long dma_align = dma_get_cache_alignment(); > >> DEFINE_DMA_ATTRS(attrs); > >> > >> dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs); > >> > >> buf = kzalloc(sizeof *buf, GFP_KERNEL); > >> buf->dma_dir = dma_dir; > >> > >> offset = vaddr & ~PAGE_MASK; > >> vec = vb2_create_framevec(vaddr, size, dma_dir == > >> DMA_FROM_DEVICE); > >> buf->vec = vec; > >> n_pages = frame_vector_count(vec); > >> > >> sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); > >> > >> ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), > >> n_pages, > >> offset, size, GFP_KERNEL); > >> > >> sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents, > >> buf->dma_dir, &attrs); > >> > >> contig_size = vb2_dc_get_contiguous_size(sgt); > > > > > > (as an aside, it's rather unintuitive that the handling of the dma_map_sg > > call actually failing is entirely implicit here) I'm not sure what you mean, please elaborate. The code considers only the case of contiguously mapping at least the requested size as a success, because anything else is useless with the hardware. > > > >> if (contig_size < size) { > >> > >> <<<=== if the original buffer had sg entries that were not > >> aligned on the "natural" alignment for their size, the new arm64 iommu > >> core code inserts a 'gap' in the iommu mapping, which causes > >> vb2_dc_get_contiguous_size() to exit early (and return a smaller size > >> than expected). > >> > >> pr_err("contiguous mapping is too small %lu/%lu\n", > >> contig_size, size); > >> ret = -EFAULT; > >> goto fail_map_sg; > >> } > >> > >> > >> So, is the videobuf2-dma-contig.c based on an incorrect assumption > >> about how the DMA API is supposed to work? > >> Is it even possible to map a "contiguous-in-iova-range" mapping for a > >> buffer given as an sg_table with an arbitrary set of pages? > > > > > > From the Streaming DMA mappings section of Documentation/DMA-API.txt: > > > > Note also that the above constraints on physical contiguity and > > dma_mask may not apply if the platform has an IOMMU (a device which > > maps an I/O DMA address to a physical memory address). However, to be > > portable, device driver writers may *not* assume that such an IOMMU > > exists. > > > > There's not strictly any harm in using the DMA API this way and *hoping* > > you get what you want, as long as you're happy for it to fail pretty much > > 100% of the time on some systems, and still in a minority of corner cases > > on any system. Could you please elaborate? I'd like to see examples, because I can't really imagine buffers mappable contiguously on CPU, but not on IOMMU. Also, as I said, the hardware I worked with didn't suffer from problems like this. > > However, if there's a real dependency on IOMMUs and tight control of IOVA > > allocation here, then the DMA API isn't really the right tool for the job, > > and maybe it's time to start looking to how to better fit these > > multimedia-subsystem-type use cases into the IOMMU API - as far as I > > understand it there's at least some conceptual overlap with the HSA PASID > > stuff being prototyped in PCI/x86-land at the moment, so it could be an > > apposite time to try and bang out some common requirements. The DMA API is actually the only good tool to use here to keep the videobuf2-dma-contig code away from the knowledge about platform specific data, e.g. presence of IOMMU. The only thing it knows is that the target hardware requires a single contiguous buffer and it relies on the fact that in correct cases the buffer given to it will meet this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable with IOMMU). Best regards, Tomasz -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html