Re: tree reorganisation - core/module separation

Rodolphe Ortalo Sat, 23 Mar 2002 05:59:11 -0800

Well, it seems I should give it  a try to all this now... I'll have
look a display/kgi/ and we'll see...

On Fri, 22 Mar 2002, Brian S. Julin wrote:

> On Fri, 22 Mar 2002, Rodolphe Ortalo wrote:
> > Concerning 1) I guess this is because you think the mmap()-trick
> > and ring buffers idea is not good. IIRC you would rather see a sort of
> > "dmalloc()" resource that would give userspace an area of memory with an
> > "Exec" capability. Plus, you would like the memory to be returned intact
> > to the application after execution.
> 
> >  I see three problems with this. One is probably due to my lack of
> > understanding of kgi resources so I'll skip. The second is: this resource
> > is in fact a "resource factory" that you probably expect to allocate
> > several times (to get several areas). I don't see how to handle this sort
> > of behavior easily (and we need to put some limits on this).
> 
> Hmmm... well, I was thinking more like the resource had a max length
> and you could grow or shrink it within the limits -- simpler than multiple 
> allocations that way.

I am not sure a single (DMA) area would fit your needs: the driver expects
to drive a single resource as a single unit. (At least, that's the simpler
way.) So that single resource would be lock by the kernel for it's own use
while the graphic engine executes it.
 However, I expect that the application would like to work concurrently
with the graphics engine execution to prepare some other part of the
display list. Big performance improvement is achieved when asynchronous
execution of the CPU and the GPU are possible. For that, you would need,
either several DMA resources (that can be queued for execution
independently) or some heavy machinery in that single resource. Maybe this
last option is the one you want; but then, in some sense, it would be the
current accel resource.
 For example, currently, the accel buffers content *is* preserved - it's
just not written that we will always preserve them. :-) Plus, the
application does not have much control over the granularity of the
area(s). When allocating the accel resource (via the mmap() trick) the
application can only say: I want a queue of N areas of L pages of memory.
As soon as the application wants to touch area[2], (asynchronous)
execution of area[1] is to be scheduled first. Plus, the application will
not be able to touch area[1] again before going through area[2]..area[N].

This "FIFO" behavior seems to be the one that worries you (aside from the
mmap() trick). It seems you would like to say, e.g., "execute
area[1],area[3]" and then muck around into area[2] before saying "execute
area[2]". (The fact that the application be blocked if area[2] is still in
use while trying to access it does not seem to annoy you too much: the
area will become available again as soon as the graphics engine is done
with it.)

With the current scheme, the application only owns in its memory space the
memory area that it is working with *now*; the kernel driver owns all the
other ones (either in EXEC or IDLE state).
 Am I right if I say that you would like to have the opposite: the kernel
driver should only own the memory areas that currently in execution by the
graphics engine, all the other areas should be mapped in the application
memory space?

 BTW, I've never tried to mmap() the accelerator resource several
times... It would solve a lot of things and make this whole
discussion deprecated... I wonder...

> I assume you mean the KGI kernel module by driver.  If the DMA stream 
> contains security violations, then it is userspace's fault for putting
> them there, so it loses it's right to expect the DMA buffer will retain
> data.

OK. Normal.

> I suspect the chipset version compatibility may have many answers 
> depending on the details.  In most cases, the answer would be that
> userspace is responsible for putting the right material in the DMA
> buffer in the first place, but there could be cases where that just
> doesn't work, and if there's no elegant work-around, then worst case
> we have to eat the overhead of a buffer memcpy in the driver.

There is no real memcpy issue here: the driver can walk the DMA stream to
check it before execution (this is optional of course, but it does
not cost so much it seems). It can either fault when the DMA stream is
"illegal" (whatever illegal means), or "passivate" the faulty commands
(with NOOPS for example), or even "correct" the commands (in some rare
cases where the correction is totally unambiguous).
 Note that even if this may sound totally awful to speed freaks, I found
it *extremely* useful to have the driver protect me from sending random
bits to the graphics engine. (A faulty application always occur, and it is
very convenient if it does not freeze the graphics engine.) I'd really
spend the 5% CPU load needed to do that even in a production system.

> Well, if you can get textures directly from RAM into the card, then
> that would allow the display-kgi to benchmark WAY above all other targets
> on all the Put* functions.

This is indeed an "unexplored" issue (assuming I understand you
correctly): using in-main-memory textures. I anticipate some big question
around there: should the in-kernel driver control *all* (main) memory
allocations *related* to its graphics engine?
 It seems this is needed (DMA capable areas, locking, resource limits,
etc.). And it may be where KGI currently lacks some features (because it
primarily targeted on-card hw resources).

> > that once the graphics engine is 100% busy (with drawing) and the CPU is
> > <10% busy feeding the pipeline, the exact way that pipeline is filled
> > won't matter much to application developpers. No? (*I* do *not* say that
> > it does not matter!)
> 
> Well, let's see if that 10% number happens in reality on anything
> short of an Athlon.  If it does, then great :-)

These figures are of the order of those I observed in my configuration.
G400 via PCI driven by a K6-3 450MHz with '-g -O'. But obviously, they are
just an indication and a lot of uncertainty remains (e.g. the G400 is not
so fast an engine; and may have been operating at a reduced speed due to 
incomplete configuration).
 Keep on doubting (we need to be careful). But this is the kind of things
that initial experiments seem to indicate: the
fast_{map,remap}_page_range() functions of KGI apparently killed the main
bottleneck...

Rodolphe

Re: tree reorganisation - core/module separation

Reply via email to