heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap

wangtao Tue, 20 May 2025 21:17:21 -0700


> -----Original Message-----
> From: T.J. Mercier <tjmerc...@google.com>
> Sent: Wednesday, May 21, 2025 10:01 AM
> To: wangtao <tao.wang...@honor.com>
> Cc: Christian König <christian.koe...@amd.com>; sumit.sem...@linaro.org;
> benjamin.gaign...@collabora.com; brian.star...@arm.com;
> jstu...@google.com; linux-me...@vger.kernel.org; dri-
> de...@lists.freedesktop.org; linaro-mm-...@lists.linaro.org; linux-
> ker...@vger.kernel.org; wangbintian(BintianWang)
> <bintian.w...@honor.com>; yipengxiang <yipengxi...@honor.com>; liulu
> 00013167 <liulu....@honor.com>; hanfeng 00012985 <feng....@honor.com>
> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> DMA_BUF_IOCTL_RW_FILE for system_heap
> 
> On Mon, May 19, 2025 at 9:06 PM wangtao <tao.wang...@honor.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: wangtao
> > > Sent: Monday, May 19, 2025 8:04 PM
> > > To: 'T.J. Mercier' <tjmerc...@google.com>; Christian König
> > > <christian.koe...@amd.com>
> > > Cc: sumit.sem...@linaro.org; benjamin.gaign...@collabora.com;
> > > brian.star...@arm.com; jstu...@google.com;
> > > linux-me...@vger.kernel.org; dri-devel@lists.freedesktop.org;
> > > linaro-mm-...@lists.linaro.org; linux- ker...@vger.kernel.org;
> > > wangbintian(BintianWang) <bintian.w...@honor.com>; yipengxiang
> > > <yipengxi...@honor.com>; liulu
> > > 00013167 <liulu....@honor.com>; hanfeng 00012985
> > > <feng....@honor.com>
> > > Subject: RE: [PATCH 2/2] dmabuf/heaps: implement
> > > DMA_BUF_IOCTL_RW_FILE for system_heap
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: T.J. Mercier <tjmerc...@google.com>
> > > > Sent: Saturday, May 17, 2025 2:37 AM
> > > > To: Christian König <christian.koe...@amd.com>
> > > > Cc: wangtao <tao.wang...@honor.com>; sumit.sem...@linaro.org;
> > > > benjamin.gaign...@collabora.com; brian.star...@arm.com;
> > > > jstu...@google.com; linux-me...@vger.kernel.org; dri-
> > > > de...@lists.freedesktop.org; linaro-mm-...@lists.linaro.org;
> > > > linux- ker...@vger.kernel.org; wangbintian(BintianWang)
> > > > <bintian.w...@honor.com>; yipengxiang <yipengxi...@honor.com>;
> > > > liulu
> > > > 00013167 <liulu....@honor.com>; hanfeng 00012985
> > > <feng....@honor.com>
> > > > Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> > > DMA_BUF_IOCTL_RW_FILE
> > > > for system_heap
> > > >
> > > > On Fri, May 16, 2025 at 1:36 AM Christian König
> > > > <christian.koe...@amd.com>
> > > > wrote:
> > > > >
> > > > > On 5/16/25 09:40, wangtao wrote:
> > > > > >
> > > > > >
> > > > > >> -----Original Message-----
> > > > > >> From: Christian König <christian.koe...@amd.com>
> > > > > >> Sent: Thursday, May 15, 2025 10:26 PM
> > > > > >> To: wangtao <tao.wang...@honor.com>;
> sumit.sem...@linaro.org;
> > > > > >> benjamin.gaign...@collabora.com; brian.star...@arm.com;
> > > > > >> jstu...@google.com; tjmerc...@google.com
> > > > > >> Cc: linux-me...@vger.kernel.org;
> > > > > >> dri-devel@lists.freedesktop.org;
> > > > > >> linaro- mm-...@lists.linaro.org;
> > > > > >> linux-ker...@vger.kernel.org;
> > > > > >> wangbintian(BintianWang) <bintian.w...@honor.com>;
> > > > > >> yipengxiang <yipengxi...@honor.com>; liulu 00013167
> > > > > >> <liulu....@honor.com>; hanfeng
> > > > > >> 00012985 <feng....@honor.com>
> > > > > >> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> > > > > >> DMA_BUF_IOCTL_RW_FILE for system_heap
> > > > > >>
> > > > > >> On 5/15/25 16:03, wangtao wrote:
> > > > > >>> [wangtao] My Test Configuration (CPU 1GHz, 5-test average):
> > > > > >>> Allocation: 32x32MB buffer creation
> > > > > >>> - dmabuf 53ms vs. udmabuf 694ms (10X slower)
> > > > > >>> - Note: shmem shows excessive allocation time
> > > > > >>
> > > > > >> Yeah, that is something already noted by others as well. But
> > > > > >> that is orthogonal.
> > > > > >>
> > > > > >>>
> > > > > >>> Read 1024MB File:
> > > > > >>> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower)
> > > > > >>> - Note: pin_user_pages_fast consumes majority CPU cycles
> > > > > >>>
> > > > > >>> Key function call timing: See details below.
> > > > > >>
> > > > > >> Those aren't valid, you are comparing different functionalities 
> > > > > >> here.
> > > > > >>
> > > > > >> Please try using udmabuf with sendfile() as confirmed to be
> > > > > >> working by
> > > > T.J.
> > > > > > [wangtao] Using buffer IO with dmabuf file read/write requires
> > > > > > one
> > > > memory copy.
> > > > > > Direct IO removes this copy to enable zero-copy. The sendfile
> > > > > > system call reduces memory copies from two (read/write) to one.
> > > > > > However, with udmabuf, sendfile still keeps at least one copy,
> > > > > > failing
> > > zero-copy.
> > > > >
> > > > >
> > > > > Then please work on fixing this.
> > > > >
> > > > > Regards,
> > > > > Christian.
> > > > >
> > > > >
> > > > > >
> > > > > > If udmabuf sendfile uses buffer IO (file page cache), read
> > > > > > latency matches dmabuf buffer read, but allocation time is much
> longer.
> > > > > > With Direct IO, the default 16-page pipe size makes it slower
> > > > > > than buffer
> > > > IO.
> > > > > >
> > > > > > Test data shows:
> > > > > > udmabuf direct read is much faster than udmabuf sendfile.
> > > > > > dmabuf direct read outperforms udmabuf direct read by a large
> margin.
> > > > > >
> > > > > > Issue: After udmabuf is mapped via map_dma_buf, apps using
> > > > > > memfd or udmabuf for Direct IO might cause errors, but there
> > > > > > are no safeguards to prevent this.
> > > > > >
> > > > > > Allocate 32x32MB buffer and read 1024 MB file Test:
> > > > > > Metric                 | alloc (ms) | read (ms) | total (ms)
> > > > > > -----------------------|------------|-----------|-----------
> > > > > > udmabuf buffer read    | 539        | 2017      | 2555
> > > > > > udmabuf direct read    | 522        | 658       | 1179
> > > >
> > > > I can't reproduce the part where udmabuf direct reads are faster
> > > > than buffered reads. That's the opposite of what I'd expect.
> > > > Something seems wrong with those buffered reads.
> > > >
> > > > > > udmabuf buffer sendfile| 505        | 1040      | 1546
> > > > > > udmabuf direct sendfile| 510        | 2269      | 2780
> > > >
> > > > I can reproduce the 3.5x slower udambuf direct sendfile compared
> > > > to udmabuf direct read. It's a pretty disappointing result, so it
> > > > seems like something could be improved there.
> > > >
> > > > 1G from ext4 on 6.12.17 | read/sendfile (ms)
> > > > ------------------------|-------------------
> > > > udmabuf buffer read     | 351
> > > > udmabuf direct read     | 540
> > > > udmabuf buffer sendfile | 255
> > > > udmabuf direct sendfile | 1990
> > > >
> > > [wangtao] By the way, did you clear the file cache during testing?
> > > Looking at your data again, read and sendfile buffers are faster
> > > than Direct I/O, which suggests the file cache wasn’t cleared. If
> > > you didn’t clear the file cache, the test results are unfair and
> > > unreliable for reference. On embedded devices, it’s nearly
> > > impossible to maintain stable caching for multi-GB files. If such
> > > files could be cached, we might as well cache dmabufs directly to save
> time on creating dmabufs and reading file data.
> > > You can call posix_fadvise(file_fd, 0, len, POSIX_FADV_DONTNEED)
> > > after opening the file or before closing it to clear the file cache,
> > > ensuring actual file I/O operations are tested.
> > >
> > [wangtao] Please confirm if cache clearing was performed during testing.
> > I reduced the test scope from 3GB to 1GB. While results without cache
> > clearing show general alignment, udmabuf buffer read remains slower
> > than direct read. Comparative data:
> >
> > Your test reading 1GB(ext4 on 6.12.17:
> > Method                | read/sendfile (ms) | read vs. (%)
> > ----------------------------------------------------------
> > udmabuf buffer read   | 351                | 138%
> > udmabuf direct read   | 540                | 212%
> > udmabuf buffer sendfile | 255              | 100%
> > udmabuf direct sendfile | 1990             | 780%
> >
> > My 3.5GHz tests (f2fs):
> > Without cache clearing:
> > Method                | alloc | read  | vs. (%)
> > -----------------------------------------------
> > udmabuf buffer read   | 140   | 386   | 310%
> > udmabuf direct read   | 151   | 326   | 262%
> > udmabuf buffer sendfile | 136 | 124   | 100%
> > udmabuf direct sendfile | 132 | 892   | 717%
> > dmabuf buffer read    | 23    | 154   | 124%
> > patch direct read     | 29    | 271   | 218%
> >
> > With cache clearing:
> > Method                | alloc | read  | vs. (%)
> > -----------------------------------------------
> > udmabuf buffer read   | 135   | 546   | 180%
> > udmabuf direct read   | 159   | 300   | 99%
> > udmabuf buffer sendfile | 134 | 303   | 100%
> > udmabuf direct sendfile | 141 | 912   | 301%
> > dmabuf buffer read    | 22    | 362   | 119%
> > patch direct read     | 29    | 265   | 87%
> >
> > Results without cache clearing aren't representative for embedded
> > mobile devices. Notably, on low-power CPUs @1GHz, sendfile latency
> > without cache clearing exceeds dmabuf direct I/O read time.
> >
> > Without cache clearing:
> > Method                | alloc | read  | vs. (%)
> > -----------------------------------------------
> > udmabuf buffer read   | 546   | 1745  | 442%
> > udmabuf direct read   | 511   | 704   | 178%
> > udmabuf buffer sendfile | 496 | 395   | 100%
> > udmabuf direct sendfile | 498 | 2332  | 591%
> > dmabuf buffer read    | 43    | 453   | 115%
> > my patch direct read  | 49    | 310   | 79%
> >
> > With cache clearing:
> > Method                | alloc | read  | vs. (%)
> > -----------------------------------------------
> > udmabuf buffer read   | 552   | 2067  | 198%
> > udmabuf direct read   | 540   | 627   | 60%
> > udmabuf buffer sendfile | 497 | 1045  | 100% udmabuf direct sendfile |
> > 527 | 2330  | 223%
> > dmabuf buffer read    | 40    | 1111  | 106%
> > my patch direct read  | 44    | 310   | 30%
> >
> > Reducing CPU overhead/power consumption is critical for mobile devices.
> > We need simpler and more efficient dmabuf direct I/O support.
> >
> > As Christian evaluated sendfile performance based on your data, could
> > you confirm whether the cache was cleared? If not, please share the
> > post-cache-clearing test data. Thank you for your support.
> 
> Yes sorry, I was out yesterday riding motorcycles. I did not clear the cache 
> for
> the buffered reads, I didn't realize you had. The IO plus the copy certainly
> explains the difference.
> 
> Your point about the unlikelihood of any of that data being in the cache also
> makes sense.
[wangtao] Thank you for testing and clarifying.


> 
> I'm not sure it changes anything about the ioctl approach though.
> Another way to do this would be to move the (optional) support for direct IO
> into the exporter via dma_buf_fops and dma_buf_ops. Then normal read()
> syscalls would just work for buffers that support them.
> I know that's more complicated, but at least it doesn't require inventing new
> uapi to do it.
> 
[wangtao] Thank you for the discussion. I fully support any method that enables
dmabuf direct I/O.

I understand using sendfile/splice with regular files for dmabuf
adds an extra CPU copy, preventing zero-copy. For example:
sendfile path: [DISK] → DMA → [page cache] → CPU copy → [memory file].

The read() syscall can't pass regular file fd parameters, so I added
an ioctl command.
While copy_file_range() supports two fds (fd_in/fd_out), it blocks cross-fs use.
Even without this restriction, file_out->f_op->copy_file_range
only enables dmabuf direct reads from regular files, not writes.

Since dmabuf's direct I/O limitation comes from its unique
attachment/map/fence model and lacks suitable syscalls, adding
an ioctl seems necessary.

When system exporters return a duplicated sg_table via map_dma_buf
(used exclusively like a pages array), they should retain control
over it.

I welcome all solutions to achieve dmabuf direct I/O! Your feedback
is greatly appreciated.
 
> 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > drop_caches
> ------------------------|-------------------
> udmabuf buffer read     | 1210
> udmabuf direct read     | 671
> udmabuf buffer sendfile | 1096
> udmabuf direct sendfile | 2340
> 
> 
> 
> >
> > > >
> > > > > > dmabuf buffer read     | 51         | 1068      | 1118
> > > > > > dmabuf direct read     | 52         | 297       | 349
> > > > > >
> > > > > > udmabuf sendfile test steps:
> > > > > > 1. Open data file(1024MB), get back_fd 2. Create memfd(32MB) #
> > > > > > Loop steps 2-6 3. Allocate udmabuf with memfd 4. Call
> > > > > > sendfile(memfd,
> > > > > > back_fd) 5. Close memfd after sendfile 6. Close udmabuf 7.
> > > > > > Close back_fd
> > > > > >
> > > > > >>
> > > > > >> Regards,
> > > > > >> Christian.
> > > > > >
> > > > >
> >

RE: [PATCH 2/2] dmabuf/heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap

Reply via email to