heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap

wangtao Sun, 18 May 2025 21:37:33 -0700


> -----Original Message-----
> From: T.J. Mercier <tjmerc...@google.com>
> Sent: Saturday, May 17, 2025 2:37 AM
> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> DMA_BUF_IOCTL_RW_FILE for system_heap
> 
> On Fri, May 16, 2025 at 1:36 AM Christian König <christian.koe...@amd.com>
> wrote:
> >
> > On 5/16/25 09:40, wangtao wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Christian König <christian.koe...@amd.com>
> > >> Sent: Thursday, May 15, 2025 10:26 PM
> > >> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> > >> DMA_BUF_IOCTL_RW_FILE for system_heap
> > >>
> > >> On 5/15/25 16:03, wangtao wrote:
> > >>> [wangtao] My Test Configuration (CPU 1GHz, 5-test average):
> > >>> Allocation: 32x32MB buffer creation
> > >>> - dmabuf 53ms vs. udmabuf 694ms (10X slower)
> > >>> - Note: shmem shows excessive allocation time
> > >>
> > >> Yeah, that is something already noted by others as well. But that
> > >> is orthogonal.
> > >>
> > >>>
> > >>> Read 1024MB File:
> > >>> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower)
> > >>> - Note: pin_user_pages_fast consumes majority CPU cycles
> > >>>
> > >>> Key function call timing: See details below.
> > >>
> > >> Those aren't valid, you are comparing different functionalities here.
> > >>
> > >> Please try using udmabuf with sendfile() as confirmed to be working by
> T.J.
> > > [wangtao] Using buffer IO with dmabuf file read/write requires one
> memory copy.
> > > Direct IO removes this copy to enable zero-copy. The sendfile system
> > > call reduces memory copies from two (read/write) to one. However,
> > > with udmabuf, sendfile still keeps at least one copy, failing zero-copy.
> >
> >
> > Then please work on fixing this.
> >
> > Regards,
> > Christian.
> >
> >
> > >
> > > If udmabuf sendfile uses buffer IO (file page cache), read latency
> > > matches dmabuf buffer read, but allocation time is much longer.
> > > With Direct IO, the default 16-page pipe size makes it slower than buffer
> IO.
> > >
> > > Test data shows:
> > > udmabuf direct read is much faster than udmabuf sendfile.
> > > dmabuf direct read outperforms udmabuf direct read by a large margin.
> > >
> > > Issue: After udmabuf is mapped via map_dma_buf, apps using memfd or
> > > udmabuf for Direct IO might cause errors, but there are no
> > > safeguards to prevent this.
> > >
> > > Allocate 32x32MB buffer and read 1024 MB file Test:
> > > Metric                 | alloc (ms) | read (ms) | total (ms)
> > > -----------------------|------------|-----------|-----------
> > > udmabuf buffer read    | 539        | 2017      | 2555
> > > udmabuf direct read    | 522        | 658       | 1179
> 
> I can't reproduce the part where udmabuf direct reads are faster than
> buffered reads. That's the opposite of what I'd expect. Something seems
> wrong with those buffered reads.
> 
[wangtao] Buffer read requires an extra CPU memory copy. Our device's low CPU
performance leads to longer latency. On high-performance 3.5GHz CPUs, buffer
read shows better ratios but still lags behind direct I/O.


Tests used single-thread programs with 32MB readahead to minimize 
latency(Embedded mobile devices usually <= 2MB).

Test results (time in ms):
|                   |     little core @1GHz     |      big core @3.5GHz     |
|                   | alloc             | read  | alloc             | read  |
|-------------------|-------------------|-------|-------------------|-------|
| udmabuf buffer RD | 543               | 2078  | 135               | 549   |
| udmabuf direct RD | 543               | 640   | 163               | 291   |
| udmabuf buffer SF | 494               | 1058  | 137               | 315   |
| udmabuf direct SF | 529               | 2335  | 143               | 909   |
| dmabuf buffer  RD | 39                | 1077  | 23                | 349   |
| patch direct RD   | 51                | 306   | 30                | 267   |

> > > udmabuf buffer sendfile| 505        | 1040      | 1546
> > > udmabuf direct sendfile| 510        | 2269      | 2780
> 
> I can reproduce the 3.5x slower udambuf direct sendfile compared to
> udmabuf direct read. It's a pretty disappointing result, so it seems like
> something could be improved there.
> 
> 1G from ext4 on 6.12.17 | read/sendfile (ms)
> ------------------------|-------------------
> udmabuf buffer read     | 351
> udmabuf direct read     | 540
> udmabuf buffer sendfile | 255
> udmabuf direct sendfile | 1990
> 
[wangtao] Key observations:
1. Direct sendfile underperforms due to small pipe buffers/memory file page,
   requiring more DMA operations.
2. ext4 vs f2fs: ext4 supports hugepage/larger folio (unlike f2fs). Mobile
   devices mostly use f2fs, which affects performance.

I/O path comparison:
- Buffer read: [DISK] → DMA → [page cache] → CPU copy → [memory file]
- Direct read: [DISK] → DMA → [memory file]
- Buffer sendfile: [DISK] → DMA → [page cache] → CPU copy → [memory file]
- Direct sendfile: [DISK] → DMA → [pipe buffer] → CPU copy → [memory file]

The extra CPU copy and pipe limitations explain the performance gap.

> 
> > > dmabuf buffer read     | 51         | 1068      | 1118
> > > dmabuf direct read     | 52         | 297       | 349
> > >
> > > udmabuf sendfile test steps:
> > > 1. Open data file(1024MB), get back_fd 2. Create memfd(32MB) # Loop
> > > steps 2-6 3. Allocate udmabuf with memfd 4. Call sendfile(memfd,
> > > back_fd) 5. Close memfd after sendfile 6. Close udmabuf 7. Close
> > > back_fd
> > >
> > >>
> > >> Regards,
> > >> Christian.
> > >
> >

RE: [PATCH 2/2] dmabuf/heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap

Reply via email to