> -----Original Message----- > From: T.J. Mercier <tjmerc...@google.com> > Sent: Saturday, May 17, 2025 2:37 AM > Subject: Re: [PATCH 2/2] dmabuf/heaps: implement > DMA_BUF_IOCTL_RW_FILE for system_heap > > On Fri, May 16, 2025 at 1:36 AM Christian König <christian.koe...@amd.com> > wrote: > > > > On 5/16/25 09:40, wangtao wrote: > > > > > > > > >> -----Original Message----- > > >> From: Christian König <christian.koe...@amd.com> > > >> Sent: Thursday, May 15, 2025 10:26 PM > > >> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement > > >> DMA_BUF_IOCTL_RW_FILE for system_heap > > >> > > >> On 5/15/25 16:03, wangtao wrote: > > >>> [wangtao] My Test Configuration (CPU 1GHz, 5-test average): > > >>> Allocation: 32x32MB buffer creation > > >>> - dmabuf 53ms vs. udmabuf 694ms (10X slower) > > >>> - Note: shmem shows excessive allocation time > > >> > > >> Yeah, that is something already noted by others as well. But that > > >> is orthogonal. > > >> > > >>> > > >>> Read 1024MB File: > > >>> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower) > > >>> - Note: pin_user_pages_fast consumes majority CPU cycles > > >>> > > >>> Key function call timing: See details below. > > >> > > >> Those aren't valid, you are comparing different functionalities here. > > >> > > >> Please try using udmabuf with sendfile() as confirmed to be working by > T.J. > > > [wangtao] Using buffer IO with dmabuf file read/write requires one > memory copy. > > > Direct IO removes this copy to enable zero-copy. The sendfile system > > > call reduces memory copies from two (read/write) to one. However, > > > with udmabuf, sendfile still keeps at least one copy, failing zero-copy. > > > > > > Then please work on fixing this. > > > > Regards, > > Christian. > > > > > > > > > > If udmabuf sendfile uses buffer IO (file page cache), read latency > > > matches dmabuf buffer read, but allocation time is much longer. > > > With Direct IO, the default 16-page pipe size makes it slower than buffer > IO. > > > > > > Test data shows: > > > udmabuf direct read is much faster than udmabuf sendfile. > > > dmabuf direct read outperforms udmabuf direct read by a large margin. > > > > > > Issue: After udmabuf is mapped via map_dma_buf, apps using memfd or > > > udmabuf for Direct IO might cause errors, but there are no > > > safeguards to prevent this. > > > > > > Allocate 32x32MB buffer and read 1024 MB file Test: > > > Metric | alloc (ms) | read (ms) | total (ms) > > > -----------------------|------------|-----------|----------- > > > udmabuf buffer read | 539 | 2017 | 2555 > > > udmabuf direct read | 522 | 658 | 1179 > > I can't reproduce the part where udmabuf direct reads are faster than > buffered reads. That's the opposite of what I'd expect. Something seems > wrong with those buffered reads. > [wangtao] Buffer read requires an extra CPU memory copy. Our device's low CPU performance leads to longer latency. On high-performance 3.5GHz CPUs, buffer read shows better ratios but still lags behind direct I/O.
Tests used single-thread programs with 32MB readahead to minimize latency(Embedded mobile devices usually <= 2MB). Test results (time in ms): | | little core @1GHz | big core @3.5GHz | | | alloc | read | alloc | read | |-------------------|-------------------|-------|-------------------|-------| | udmabuf buffer RD | 543 | 2078 | 135 | 549 | | udmabuf direct RD | 543 | 640 | 163 | 291 | | udmabuf buffer SF | 494 | 1058 | 137 | 315 | | udmabuf direct SF | 529 | 2335 | 143 | 909 | | dmabuf buffer RD | 39 | 1077 | 23 | 349 | | patch direct RD | 51 | 306 | 30 | 267 | > > > udmabuf buffer sendfile| 505 | 1040 | 1546 > > > udmabuf direct sendfile| 510 | 2269 | 2780 > > I can reproduce the 3.5x slower udambuf direct sendfile compared to > udmabuf direct read. It's a pretty disappointing result, so it seems like > something could be improved there. > > 1G from ext4 on 6.12.17 | read/sendfile (ms) > ------------------------|------------------- > udmabuf buffer read | 351 > udmabuf direct read | 540 > udmabuf buffer sendfile | 255 > udmabuf direct sendfile | 1990 > [wangtao] Key observations: 1. Direct sendfile underperforms due to small pipe buffers/memory file page, requiring more DMA operations. 2. ext4 vs f2fs: ext4 supports hugepage/larger folio (unlike f2fs). Mobile devices mostly use f2fs, which affects performance. I/O path comparison: - Buffer read: [DISK] → DMA → [page cache] → CPU copy → [memory file] - Direct read: [DISK] → DMA → [memory file] - Buffer sendfile: [DISK] → DMA → [page cache] → CPU copy → [memory file] - Direct sendfile: [DISK] → DMA → [pipe buffer] → CPU copy → [memory file] The extra CPU copy and pipe limitations explain the performance gap. > > > > dmabuf buffer read | 51 | 1068 | 1118 > > > dmabuf direct read | 52 | 297 | 349 > > > > > > udmabuf sendfile test steps: > > > 1. Open data file(1024MB), get back_fd 2. Create memfd(32MB) # Loop > > > steps 2-6 3. Allocate udmabuf with memfd 4. Call sendfile(memfd, > > > back_fd) 5. Close memfd after sendfile 6. Close udmabuf 7. Close > > > back_fd > > > > > >> > > >> Regards, > > >> Christian. > > > > >