Re: [Dri-devel] Mach64 dma fixes

Linus Torvalds Sat, 25 May 2002 17:43:57 -0700

On Sun, 26 May 2002, Jos� Fonseca wrote:
>
> The vertex data alone (no textures here) can be several MBs per frame

Yes, yes, I realize that the cumulative sizes are big. The question is not
the absolute size, but the size of "one bunch".

> Throwing some numbers just to get a rough idea: 2[MB/frame] x
> 25[frames/second] / 4[Kb/buffer] = 12800 buffers/second.

The thing is, if you do processing of vertexes, I wouldn't be surpised if
you're better off using a 8kB buffer over and over, and just doing 6400
system call entries, than you are to actually to trye to buffer up 2MB and
then just doing 25 system call entries.

Sure, in one case you do 6400 system calls, and in the other case you do
only 25, so people who are afraid of system calls think that "obviously"
the 25 system calls must be faster.

But that "obviously" is just wrong. Pretty much all modern CPU's handle
big working sets badly, and handle tight nice loops very well.

Just to take a non-graphics-related example: on my machine, lmbench
reports that I get pipe bandwidths that sometimes exceed 1GB/s.

At the same time, a normal "memcpy()" goes along at 625MB/s.

In short: according to that benchmark it is _faster_ to copy data from one
process to another though a pipe, than it is to use memcpy() within one
process.

That's obviously a load of bull, and yet lmbench isn't really lying. The
reason the pipe throughput is higher than the memory copy throughput is
simply that the pipe data is chunked up in 4kB chunks, and because the
source and the destinations are re-used in the pipe benchmarks in 64kB
chunks, you get a lot better cache behaviour.

(In fact, even TCP beats a plain memcpy() occasionally, which also says
that the Linux TCP layer is an impressive piece of work _despite_ the same
cache advantage).

> I'm not very familiar with these issues but won't this number of ioctls
> per second create a significant overhead here? Or would the benefits of
> having each buffer fit on the cache (facilitating the copy) prevail?

A hot system call takes about 0.2 us on an athlon (it takes significantly
longer on a P4, which I'm beating up Intel over all the time). The ioctl
stuff goes through slightly more layers, but we're not talking huge
numbers here. The system calls are fast enough that you're better off
trying to keep stuff in the cache, than trying to minimize system calls.

(The "memcpy()" example is a perfect example of something where _zero_
system calls is slower than two system calls and a task switch, simply
because the zero system call example ends up being all noncached).

Note that the cache issues show up on the instruction side too, especially
on the P4 which has a fairly small trace cache. You can often make things
go faster by simplifying and streamlining the code rather than trying to
be clever and having a big footprint. Ask Keith Packard about the X frame
buffer code and this very issue some day.

NOTE NOTE NOTE! The tradeoffs are seldom all that clear. Sometimes big
buffers and few system calls are better. Sometimes they aren't. It just
depends on a lot of things.

                                Linus


_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes

Reply via email to