On 5/22/25 10:02, wangtao wrote: >> -----Original Message----- >> From: Christian König <christian.koe...@amd.com> >> Sent: Wednesday, May 21, 2025 7:57 PM >> To: wangtao <tao.wang...@honor.com>; T.J. Mercier >> <tjmerc...@google.com> >> Cc: sumit.sem...@linaro.org; benjamin.gaign...@collabora.com; >> brian.star...@arm.com; jstu...@google.com; linux-me...@vger.kernel.org; >> dri-devel@lists.freedesktop.org; linaro-mm-...@lists.linaro.org; linux- >> ker...@vger.kernel.org; wangbintian(BintianWang) >> <bintian.w...@honor.com>; yipengxiang <yipengxi...@honor.com>; liulu >> 00013167 <liulu....@honor.com>; hanfeng 00012985 <feng....@honor.com>; >> amir7...@gmail.com >> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement >> DMA_BUF_IOCTL_RW_FILE for system_heap >> >> On 5/21/25 12:25, wangtao wrote: >>> [wangtao] I previously explained that >>> read/sendfile/splice/copy_file_range >>> syscalls can't achieve dmabuf direct IO zero-copy. >> >> And why can't you work on improving those syscalls instead of creating a new >> IOCTL? >> > [wangtao] As I mentioned in previous emails, these syscalls cannot > achieve dmabuf zero-copy due to technical constraints.
Yeah, and why can't you work on removing those technical constrains? What is blocking you from improving the sendfile system call or proposing a patch to remove the copy_file_range restrictions? Regards, Christian. Could you > specify the technical points, code, or principles that need > optimization? > > Let me explain again why these syscalls can't work: > 1. read() syscall > - dmabuf fops lacks read callback implementation. Even if implemented, > file_fd info cannot be transferred > - read(file_fd, dmabuf_ptr, len) with remap_pfn_range-based mmap > cannot access dmabuf_buf pages, forcing buffer-mode reads > > 2. sendfile() syscall > - Requires CPU copy from page cache to memory file(tmpfs/shmem): > [DISK] --DMA--> [page cache] --CPU copy--> [MEMORY file] > - CPU overhead (both buffer/direct modes involve copies): > 55.08% do_sendfile > |- 55.08% do_splice_direct > |-|- 55.08% splice_direct_to_actor > |-|-|- 22.51% copy_splice_read > |-|-|-|- 16.57% f2fs_file_read_iter > |-|-|-|-|- 15.12% __iomap_dio_rw > |-|-|- 32.33% direct_splice_actor > |-|-|-|- 32.11% iter_file_splice_write > |-|-|-|-|- 28.42% vfs_iter_write > |-|-|-|-|-|- 28.42% do_iter_write > |-|-|-|-|-|-|- 28.39% shmem_file_write_iter > |-|-|-|-|-|-|-|- 24.62% generic_perform_write > |-|-|-|-|-|-|-|-|- 18.75% __pi_memmove > > 3. splice() requires one end to be a pipe, incompatible with regular files or > dmabuf. > > 4. copy_file_range() > - Blocked by cross-FS restrictions (Amir's commit 868f9f2f8e00) > - Even without this restriction, Even without restrictions, implementing > the copy_file_range callback in dmabuf fops would only allow dmabuf read > from regular files. This is because copy_file_range relies on > file_out->f_op->copy_file_range, which cannot support dmabuf write > operations to regular files. > > Test results confirm these limitations: > T.J. Mercier's 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > drop_caches > ------------------------|------------------- > udmabuf buffer read | 1210 > udmabuf direct read | 671 > udmabuf buffer sendfile | 1096 > udmabuf direct sendfile | 2340 > > My 3GHz CPU tests (cache cleared): > Method | alloc | read | vs. (%) > ----------------------------------------------- > udmabuf buffer read | 135 | 546 | 180% > udmabuf direct read | 159 | 300 | 99% > udmabuf buffer sendfile | 134 | 303 | 100% > udmabuf direct sendfile | 141 | 912 | 301% > dmabuf buffer read | 22 | 362 | 119% > my patch direct read | 29 | 265 | 87% > > My 1GHz CPU tests (cache cleared): > Method | alloc | read | vs. (%) > ----------------------------------------------- > udmabuf buffer read | 552 | 2067 | 198% > udmabuf direct read | 540 | 627 | 60% > udmabuf buffer sendfile | 497 | 1045 | 100% > udmabuf direct sendfile | 527 | 2330 | 223% > dmabuf buffer read | 40 | 1111 | 106% > patch direct read | 44 | 310 | 30% > > Test observations align with expectations: > 1. dmabuf buffer read requires slow CPU copies > 2. udmabuf direct read achieves zero-copy but has page retrieval > latency from vaddr > 3. udmabuf buffer sendfile suffers CPU copy overhead > 4. udmabuf direct sendfile combines CPU copies with frequent DMA > operations due to small pipe buffers > 5. dmabuf buffer read also requires CPU copies > 6. My direct read patch enables zero-copy with better performance > on low-power CPUs > 7. udmabuf creation time remains problematic (as you’ve noted). > >>> My focus is enabling dmabuf direct I/O for [regular file] <--DMA--> >>> [dmabuf] zero-copy. >> >> Yeah and that focus is wrong. You need to work on a general solution to the >> issue and not specific to your problem. >> >>> Any API achieving this would work. Are there other uAPIs you think >>> could help? Could you recommend experts who might offer suggestions? >> >> Well once more: Either work on sendfile or copy_file_range or eventually >> splice to make it what you want to do. >> >> When that is done we can discuss with the VFS people if that approach is >> feasible. >> >> But just bypassing the VFS review by implementing a DMA-buf specific IOCTL >> is a NO-GO. That is clearly not something you can do in any way. > [wangtao] The issue is that only dmabuf lacks Direct I/O zero-copy support. > Tmpfs/shmem > already work with Direct I/O zero-copy. As explained, existing syscalls or > generic methods can't enable dmabuf direct I/O zero-copy, which is why I > propose adding an IOCTL command. > > I respect your perspective. Could you clarify specific technical aspects, > code requirements, or implementation principles for modifying sendfile() > or copy_file_range()? This would help advance our discussion. > > Thank you for engaging in this dialogue. > >> >> Regards, >> Christian.