> -----Original Message----- > From: T.J. Mercier <tjmerc...@google.com> > Sent: Wednesday, May 21, 2025 10:01 AM > To: wangtao <tao.wang...@honor.com> > Cc: Christian König <christian.koe...@amd.com>; sumit.sem...@linaro.org; > benjamin.gaign...@collabora.com; brian.star...@arm.com; > jstu...@google.com; linux-me...@vger.kernel.org; dri- > de...@lists.freedesktop.org; linaro-mm-...@lists.linaro.org; linux- > ker...@vger.kernel.org; wangbintian(BintianWang) > <bintian.w...@honor.com>; yipengxiang <yipengxi...@honor.com>; liulu > 00013167 <liulu....@honor.com>; hanfeng 00012985 <feng....@honor.com> > Subject: Re: [PATCH 2/2] dmabuf/heaps: implement > DMA_BUF_IOCTL_RW_FILE for system_heap > > On Mon, May 19, 2025 at 9:06 PM wangtao <tao.wang...@honor.com> > wrote: > > > > > > > > > -----Original Message----- > > > From: wangtao > > > Sent: Monday, May 19, 2025 8:04 PM > > > To: 'T.J. Mercier' <tjmerc...@google.com>; Christian König > > > <christian.koe...@amd.com> > > > Cc: sumit.sem...@linaro.org; benjamin.gaign...@collabora.com; > > > brian.star...@arm.com; jstu...@google.com; > > > linux-me...@vger.kernel.org; dri-devel@lists.freedesktop.org; > > > linaro-mm-...@lists.linaro.org; linux- ker...@vger.kernel.org; > > > wangbintian(BintianWang) <bintian.w...@honor.com>; yipengxiang > > > <yipengxi...@honor.com>; liulu > > > 00013167 <liulu....@honor.com>; hanfeng 00012985 > > > <feng....@honor.com> > > > Subject: RE: [PATCH 2/2] dmabuf/heaps: implement > > > DMA_BUF_IOCTL_RW_FILE for system_heap > > > > > > > > > > > > > -----Original Message----- > > > > From: T.J. Mercier <tjmerc...@google.com> > > > > Sent: Saturday, May 17, 2025 2:37 AM > > > > To: Christian König <christian.koe...@amd.com> > > > > Cc: wangtao <tao.wang...@honor.com>; sumit.sem...@linaro.org; > > > > benjamin.gaign...@collabora.com; brian.star...@arm.com; > > > > jstu...@google.com; linux-me...@vger.kernel.org; dri- > > > > de...@lists.freedesktop.org; linaro-mm-...@lists.linaro.org; > > > > linux- ker...@vger.kernel.org; wangbintian(BintianWang) > > > > <bintian.w...@honor.com>; yipengxiang <yipengxi...@honor.com>; > > > > liulu > > > > 00013167 <liulu....@honor.com>; hanfeng 00012985 > > > <feng....@honor.com> > > > > Subject: Re: [PATCH 2/2] dmabuf/heaps: implement > > > DMA_BUF_IOCTL_RW_FILE > > > > for system_heap > > > > > > > > On Fri, May 16, 2025 at 1:36 AM Christian König > > > > <christian.koe...@amd.com> > > > > wrote: > > > > > > > > > > On 5/16/25 09:40, wangtao wrote: > > > > > > > > > > > > > > > > > >> -----Original Message----- > > > > > >> From: Christian König <christian.koe...@amd.com> > > > > > >> Sent: Thursday, May 15, 2025 10:26 PM > > > > > >> To: wangtao <tao.wang...@honor.com>; > sumit.sem...@linaro.org; > > > > > >> benjamin.gaign...@collabora.com; brian.star...@arm.com; > > > > > >> jstu...@google.com; tjmerc...@google.com > > > > > >> Cc: linux-me...@vger.kernel.org; > > > > > >> dri-devel@lists.freedesktop.org; > > > > > >> linaro- mm-...@lists.linaro.org; > > > > > >> linux-ker...@vger.kernel.org; > > > > > >> wangbintian(BintianWang) <bintian.w...@honor.com>; > > > > > >> yipengxiang <yipengxi...@honor.com>; liulu 00013167 > > > > > >> <liulu....@honor.com>; hanfeng > > > > > >> 00012985 <feng....@honor.com> > > > > > >> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement > > > > > >> DMA_BUF_IOCTL_RW_FILE for system_heap > > > > > >> > > > > > >> On 5/15/25 16:03, wangtao wrote: > > > > > >>> [wangtao] My Test Configuration (CPU 1GHz, 5-test average): > > > > > >>> Allocation: 32x32MB buffer creation > > > > > >>> - dmabuf 53ms vs. udmabuf 694ms (10X slower) > > > > > >>> - Note: shmem shows excessive allocation time > > > > > >> > > > > > >> Yeah, that is something already noted by others as well. But > > > > > >> that is orthogonal. > > > > > >> > > > > > >>> > > > > > >>> Read 1024MB File: > > > > > >>> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower) > > > > > >>> - Note: pin_user_pages_fast consumes majority CPU cycles > > > > > >>> > > > > > >>> Key function call timing: See details below. > > > > > >> > > > > > >> Those aren't valid, you are comparing different functionalities > > > > > >> here. > > > > > >> > > > > > >> Please try using udmabuf with sendfile() as confirmed to be > > > > > >> working by > > > > T.J. > > > > > > [wangtao] Using buffer IO with dmabuf file read/write requires > > > > > > one > > > > memory copy. > > > > > > Direct IO removes this copy to enable zero-copy. The sendfile > > > > > > system call reduces memory copies from two (read/write) to one. > > > > > > However, with udmabuf, sendfile still keeps at least one copy, > > > > > > failing > > > zero-copy. > > > > > > > > > > > > > > > Then please work on fixing this. > > > > > > > > > > Regards, > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > If udmabuf sendfile uses buffer IO (file page cache), read > > > > > > latency matches dmabuf buffer read, but allocation time is much > longer. > > > > > > With Direct IO, the default 16-page pipe size makes it slower > > > > > > than buffer > > > > IO. > > > > > > > > > > > > Test data shows: > > > > > > udmabuf direct read is much faster than udmabuf sendfile. > > > > > > dmabuf direct read outperforms udmabuf direct read by a large > margin. > > > > > > > > > > > > Issue: After udmabuf is mapped via map_dma_buf, apps using > > > > > > memfd or udmabuf for Direct IO might cause errors, but there > > > > > > are no safeguards to prevent this. > > > > > > > > > > > > Allocate 32x32MB buffer and read 1024 MB file Test: > > > > > > Metric | alloc (ms) | read (ms) | total (ms) > > > > > > -----------------------|------------|-----------|----------- > > > > > > udmabuf buffer read | 539 | 2017 | 2555 > > > > > > udmabuf direct read | 522 | 658 | 1179 > > > > > > > > I can't reproduce the part where udmabuf direct reads are faster > > > > than buffered reads. That's the opposite of what I'd expect. > > > > Something seems wrong with those buffered reads. > > > > > > > > > > udmabuf buffer sendfile| 505 | 1040 | 1546 > > > > > > udmabuf direct sendfile| 510 | 2269 | 2780 > > > > > > > > I can reproduce the 3.5x slower udambuf direct sendfile compared > > > > to udmabuf direct read. It's a pretty disappointing result, so it > > > > seems like something could be improved there. > > > > > > > > 1G from ext4 on 6.12.17 | read/sendfile (ms) > > > > ------------------------|------------------- > > > > udmabuf buffer read | 351 > > > > udmabuf direct read | 540 > > > > udmabuf buffer sendfile | 255 > > > > udmabuf direct sendfile | 1990 > > > > > > > [wangtao] By the way, did you clear the file cache during testing? > > > Looking at your data again, read and sendfile buffers are faster > > > than Direct I/O, which suggests the file cache wasn’t cleared. If > > > you didn’t clear the file cache, the test results are unfair and > > > unreliable for reference. On embedded devices, it’s nearly > > > impossible to maintain stable caching for multi-GB files. If such > > > files could be cached, we might as well cache dmabufs directly to save > time on creating dmabufs and reading file data. > > > You can call posix_fadvise(file_fd, 0, len, POSIX_FADV_DONTNEED) > > > after opening the file or before closing it to clear the file cache, > > > ensuring actual file I/O operations are tested. > > > > > [wangtao] Please confirm if cache clearing was performed during testing. > > I reduced the test scope from 3GB to 1GB. While results without cache > > clearing show general alignment, udmabuf buffer read remains slower > > than direct read. Comparative data: > > > > Your test reading 1GB(ext4 on 6.12.17: > > Method | read/sendfile (ms) | read vs. (%) > > ---------------------------------------------------------- > > udmabuf buffer read | 351 | 138% > > udmabuf direct read | 540 | 212% > > udmabuf buffer sendfile | 255 | 100% > > udmabuf direct sendfile | 1990 | 780% > > > > My 3.5GHz tests (f2fs): > > Without cache clearing: > > Method | alloc | read | vs. (%) > > ----------------------------------------------- > > udmabuf buffer read | 140 | 386 | 310% > > udmabuf direct read | 151 | 326 | 262% > > udmabuf buffer sendfile | 136 | 124 | 100% > > udmabuf direct sendfile | 132 | 892 | 717% > > dmabuf buffer read | 23 | 154 | 124% > > patch direct read | 29 | 271 | 218% > > > > With cache clearing: > > Method | alloc | read | vs. (%) > > ----------------------------------------------- > > udmabuf buffer read | 135 | 546 | 180% > > udmabuf direct read | 159 | 300 | 99% > > udmabuf buffer sendfile | 134 | 303 | 100% > > udmabuf direct sendfile | 141 | 912 | 301% > > dmabuf buffer read | 22 | 362 | 119% > > patch direct read | 29 | 265 | 87% > > > > Results without cache clearing aren't representative for embedded > > mobile devices. Notably, on low-power CPUs @1GHz, sendfile latency > > without cache clearing exceeds dmabuf direct I/O read time. > > > > Without cache clearing: > > Method | alloc | read | vs. (%) > > ----------------------------------------------- > > udmabuf buffer read | 546 | 1745 | 442% > > udmabuf direct read | 511 | 704 | 178% > > udmabuf buffer sendfile | 496 | 395 | 100% > > udmabuf direct sendfile | 498 | 2332 | 591% > > dmabuf buffer read | 43 | 453 | 115% > > my patch direct read | 49 | 310 | 79% > > > > With cache clearing: > > Method | alloc | read | vs. (%) > > ----------------------------------------------- > > udmabuf buffer read | 552 | 2067 | 198% > > udmabuf direct read | 540 | 627 | 60% > > udmabuf buffer sendfile | 497 | 1045 | 100% udmabuf direct sendfile | > > 527 | 2330 | 223% > > dmabuf buffer read | 40 | 1111 | 106% > > my patch direct read | 44 | 310 | 30% > > > > Reducing CPU overhead/power consumption is critical for mobile devices. > > We need simpler and more efficient dmabuf direct I/O support. > > > > As Christian evaluated sendfile performance based on your data, could > > you confirm whether the cache was cleared? If not, please share the > > post-cache-clearing test data. Thank you for your support. > > Yes sorry, I was out yesterday riding motorcycles. I did not clear the cache > for > the buffered reads, I didn't realize you had. The IO plus the copy certainly > explains the difference. > > Your point about the unlikelihood of any of that data being in the cache also > makes sense. [wangtao] Thank you for testing and clarifying.
> > I'm not sure it changes anything about the ioctl approach though. > Another way to do this would be to move the (optional) support for direct IO > into the exporter via dma_buf_fops and dma_buf_ops. Then normal read() > syscalls would just work for buffers that support them. > I know that's more complicated, but at least it doesn't require inventing new > uapi to do it. > [wangtao] Thank you for the discussion. I fully support any method that enables dmabuf direct I/O. I understand using sendfile/splice with regular files for dmabuf adds an extra CPU copy, preventing zero-copy. For example: sendfile path: [DISK] → DMA → [page cache] → CPU copy → [memory file]. The read() syscall can't pass regular file fd parameters, so I added an ioctl command. While copy_file_range() supports two fds (fd_in/fd_out), it blocks cross-fs use. Even without this restriction, file_out->f_op->copy_file_range only enables dmabuf direct reads from regular files, not writes. Since dmabuf's direct I/O limitation comes from its unique attachment/map/fence model and lacks suitable syscalls, adding an ioctl seems necessary. When system exporters return a duplicated sg_table via map_dma_buf (used exclusively like a pages array), they should retain control over it. I welcome all solutions to achieve dmabuf direct I/O! Your feedback is greatly appreciated. > 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > drop_caches > ------------------------|------------------- > udmabuf buffer read | 1210 > udmabuf direct read | 671 > udmabuf buffer sendfile | 1096 > udmabuf direct sendfile | 2340 > > > > > > > > > > > > > > > dmabuf buffer read | 51 | 1068 | 1118 > > > > > > dmabuf direct read | 52 | 297 | 349 > > > > > > > > > > > > udmabuf sendfile test steps: > > > > > > 1. Open data file(1024MB), get back_fd 2. Create memfd(32MB) # > > > > > > Loop steps 2-6 3. Allocate udmabuf with memfd 4. Call > > > > > > sendfile(memfd, > > > > > > back_fd) 5. Close memfd after sendfile 6. Close udmabuf 7. > > > > > > Close back_fd > > > > > > > > > > > >> > > > > > >> Regards, > > > > > >> Christian. > > > > > > > > > > > > >