Apologies for interrupting the filesystem/memory experts. Due to dmabuf's attachment/map/fence model, its mmap callback uses remap_pfn_range, making read(file_fd, dmabuf_ptr, len) support buffer I/O only, not Direct I/O zero-copy. Embedded/mobile devices urgently require dmabuf Direct I/O for large-file operations, with prior patches attempting this.
While tmpfs/shmem support Direct I/O zero-copy, dmabuf does not. My patch adds an ioctl command for dmabuf Direct I/O zero-copy, achieving >80% bandwidth even on low-power CPUs. Christian argues udmabuf + sendfile/splice/copy_file_range could enable zero-copy, but analysis and testing (detailed prior email) show these syscalls fail for high-performance dmabuf Direct I/O: 1. sendfile(dst_memfile, src_disk): Requires page cache copies [DISK] --DMA--> [page cache] --CPU copy--> [MEMORY file] 2. splice: Requires pipe endpoint (incompatible with files/dmabuf) 3. copy_file_range: Cross-FS prohibited Technical question: Under fs/mm layer constraints, can/how should we modify sendfile/splice/copy_file_range (or other syscalls) to achieve efficient dmabuf Direct I/O zero-copy? Your insights on required syscall modifications would be invaluable. Thank you for guidance. > -----Original Message----- > From: Christian König <christian.koe...@amd.com> > Sent: Thursday, May 22, 2025 7:58 PM > To: wangtao <tao.wang...@honor.com>; T.J. Mercier > <tjmerc...@google.com> > Cc: sumit.sem...@linaro.org; benjamin.gaign...@collabora.com; > brian.star...@arm.com; jstu...@google.com; linux-me...@vger.kernel.org; > dri-devel@lists.freedesktop.org; linaro-mm-...@lists.linaro.org; linux- > ker...@vger.kernel.org; wangbintian(BintianWang) > <bintian.w...@honor.com>; yipengxiang <yipengxi...@honor.com>; liulu > 00013167 <liulu....@honor.com>; hanfeng 00012985 <feng....@honor.com>; > amir7...@gmail.com > Subject: Re: [PATCH 2/2] dmabuf/heaps: implement > DMA_BUF_IOCTL_RW_FILE for system_heap > > On 5/22/25 10:02, wangtao wrote: > >> -----Original Message----- > >> From: Christian König <christian.koe...@amd.com> > >> Sent: Wednesday, May 21, 2025 7:57 PM > >> To: wangtao <tao.wang...@honor.com>; T.J. Mercier > >> <tjmerc...@google.com> > >> Cc: sumit.sem...@linaro.org; benjamin.gaign...@collabora.com; > >> brian.star...@arm.com; jstu...@google.com; > >> linux-me...@vger.kernel.org; dri-devel@lists.freedesktop.org; > >> linaro-mm-...@lists.linaro.org; linux- ker...@vger.kernel.org; > >> wangbintian(BintianWang) <bintian.w...@honor.com>; yipengxiang > >> <yipengxi...@honor.com>; liulu > >> 00013167 <liulu....@honor.com>; hanfeng 00012985 > >> <feng....@honor.com>; amir7...@gmail.com > >> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement > >> DMA_BUF_IOCTL_RW_FILE for system_heap > >> > >> On 5/21/25 12:25, wangtao wrote: > >>> [wangtao] I previously explained that > >>> read/sendfile/splice/copy_file_range > >>> syscalls can't achieve dmabuf direct IO zero-copy. > >> > >> And why can't you work on improving those syscalls instead of > >> creating a new IOCTL? > >> > > [wangtao] As I mentioned in previous emails, these syscalls cannot > > achieve dmabuf zero-copy due to technical constraints. > > Yeah, and why can't you work on removing those technical constrains? > > What is blocking you from improving the sendfile system call or proposing a > patch to remove the copy_file_range restrictions? > > Regards, > Christian. > > Could you > > specify the technical points, code, or principles that need > > optimization? > > > > Let me explain again why these syscalls can't work: > > 1. read() syscall > > - dmabuf fops lacks read callback implementation. Even if implemented, > > file_fd info cannot be transferred > > - read(file_fd, dmabuf_ptr, len) with remap_pfn_range-based mmap > > cannot access dmabuf_buf pages, forcing buffer-mode reads > > > > 2. sendfile() syscall > > - Requires CPU copy from page cache to memory file(tmpfs/shmem): > > [DISK] --DMA--> [page cache] --CPU copy--> [MEMORY file] > > - CPU overhead (both buffer/direct modes involve copies): > > 55.08% do_sendfile > > |- 55.08% do_splice_direct > > |-|- 55.08% splice_direct_to_actor > > |-|-|- 22.51% copy_splice_read > > |-|-|-|- 16.57% f2fs_file_read_iter > > |-|-|-|-|- 15.12% __iomap_dio_rw > > |-|-|- 32.33% direct_splice_actor > > |-|-|-|- 32.11% iter_file_splice_write > > |-|-|-|-|- 28.42% vfs_iter_write > > |-|-|-|-|-|- 28.42% do_iter_write > > |-|-|-|-|-|-|- 28.39% shmem_file_write_iter > > |-|-|-|-|-|-|-|- 24.62% generic_perform_write > > |-|-|-|-|-|-|-|-|- 18.75% __pi_memmove > > > > 3. splice() requires one end to be a pipe, incompatible with regular files > > or > dmabuf. > > > > 4. copy_file_range() > > - Blocked by cross-FS restrictions (Amir's commit 868f9f2f8e00) > > - Even without this restriction, Even without restrictions, implementing > > the copy_file_range callback in dmabuf fops would only allow dmabuf > read > > from regular files. This is because copy_file_range relies on > > file_out->f_op->copy_file_range, which cannot support dmabuf > write > > operations to regular files. > > > > Test results confirm these limitations: > > T.J. Mercier's 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > > > drop_caches > > ------------------------|------------------- > > udmabuf buffer read | 1210 > > udmabuf direct read | 671 > > udmabuf buffer sendfile | 1096 > > udmabuf direct sendfile | 2340 > > > > My 3GHz CPU tests (cache cleared): > > Method | alloc | read | vs. (%) > > ----------------------------------------------- > > udmabuf buffer read | 135 | 546 | 180% > > udmabuf direct read | 159 | 300 | 99% > > udmabuf buffer sendfile | 134 | 303 | 100% > > udmabuf direct sendfile | 141 | 912 | 301% > > dmabuf buffer read | 22 | 362 | 119% > > my patch direct read | 29 | 265 | 87% > > > > My 1GHz CPU tests (cache cleared): > > Method | alloc | read | vs. (%) > > ----------------------------------------------- > > udmabuf buffer read | 552 | 2067 | 198% > > udmabuf direct read | 540 | 627 | 60% > > udmabuf buffer sendfile | 497 | 1045 | 100% udmabuf direct sendfile | > > 527 | 2330 | 223% > > dmabuf buffer read | 40 | 1111 | 106% > > patch direct read | 44 | 310 | 30% > > > > Test observations align with expectations: > > 1. dmabuf buffer read requires slow CPU copies 2. udmabuf direct read > > achieves zero-copy but has page retrieval > > latency from vaddr > > 3. udmabuf buffer sendfile suffers CPU copy overhead 4. udmabuf direct > > sendfile combines CPU copies with frequent DMA > > operations due to small pipe buffers 5. dmabuf buffer read also > > requires CPU copies 6. My direct read patch enables zero-copy with > > better performance > > on low-power CPUs > > 7. udmabuf creation time remains problematic (as you’ve noted). > > > >>> My focus is enabling dmabuf direct I/O for [regular file] <--DMA--> > >>> [dmabuf] zero-copy. > >> > >> Yeah and that focus is wrong. You need to work on a general solution > >> to the issue and not specific to your problem. > >> > >>> Any API achieving this would work. Are there other uAPIs you think > >>> could help? Could you recommend experts who might offer suggestions? > >> > >> Well once more: Either work on sendfile or copy_file_range or > >> eventually splice to make it what you want to do. > >> > >> When that is done we can discuss with the VFS people if that approach > >> is feasible. > >> > >> But just bypassing the VFS review by implementing a DMA-buf specific > >> IOCTL is a NO-GO. That is clearly not something you can do in any way. > > [wangtao] The issue is that only dmabuf lacks Direct I/O zero-copy > > support. Tmpfs/shmem already work with Direct I/O zero-copy. As > > explained, existing syscalls or generic methods can't enable dmabuf > > direct I/O zero-copy, which is why I propose adding an IOCTL command. > > > > I respect your perspective. Could you clarify specific technical > > aspects, code requirements, or implementation principles for modifying > > sendfile() or copy_file_range()? This would help advance our discussion. > > > > Thank you for engaging in this dialogue. > > > >> > >> Regards, > >> Christian.