I've been making some tests to the bus mastering in Mach64 chip as I told 
yesterday on IRC. Since this was discussed pretty late I would like to 
briefly document to the others DRI developers what I'm trying to do:


    Since there is no way of caching DMA buffers on the Mach64 chip (is is 
done via the CCE on Rage128, or the primary DMA buffer in Matrox) nor to 
get notification when it's done, appearantly the only way left would be to 
poll (either when new buffers were received by the DRM from the client, or 
on a constant time interrupt such as VBLANK). This means the engine could 
be stopped quite often yielding lower preformance.

    A last resource alternative is to modify the descriptor table which 
hold pointers to 4k DMA buffers blocks and add to it _while_ the engine is 
running, and trying to resolve the resulting race condition with buffer 
aging. The following tests try to access the possibility or not of that 
scheme.


I've come with very good results in the tests so far. I want to tell you 
so that you may know what we can depend upon so far:

1. The card expects no alignment on the table descriptor table besides the 
16byte due to the size of each descriptor entry. That is, the BM_GUI_TABLE 
register make use of the full 31:4 bits as in the specs. Note that the 
whole table chunk has to be aligned with the 
CIRCULAR_BUF_SIZE@BM_GUI_TABLE but there is no restriction to where in 
that table we tell the card to read.

2. The card expects no aligment on the data buffers. That is, the 
BM_SYSTEM_MEM_ADDR register makes use of the full 31:0 bits as in the 
specs.

    Note: Later on we should see if we can use scatter-gather memory for 
the buffers to be able to allocate greater amounts of DMA space without 
straining the kernel VM.

3. We can mess with the descriptor table after the bus mastering operation 
has begun(!)

(The tests used are attached.)


Although this is seems promising I still have to workout more details:

   a) check if there is no other buffering besides the FIFO going on. This 
can only be checked by making a full proof of concept example and check if 
nothing goes wrong.

   b) see if the descriptor table can be made into a circular buffer. The 
specs mention something about this but they aren't clear. They say the 
circular buffer is in the card memory, but if the card was copying the 
whole buffer then test 3 couldn't be happening...

   c) instead of using a GUI register it's probably better to use 
END_OF_LIST_STATUS@BM_COMMAND to see if the card is processing the last 
entry of the descriptor table. If that bit is set then there is no point 
in adding to the table was the engine will surely stop. We'll still need 
the buffer aging register to resolve the race condition of the engine 
stops while we change the table.

Not veryfing a) is the worst it can happen, as it makes it impossible to 
solve the race condition.

If b) is not true we can still get advantages from this scheme. The 
maximum table descriptor size is 128Kb, i.e., 8K entries, which is 32 MB 
of buffer memory. That means that we would be able to fill several frames 
before we need to do a wait_for_idle. Note that we don't need really 32 MB 
was we can reuse the buffers in the process.

I hope that tomorrow I can give more info regarding b) and c).


Leif, to be able to do a) I would like to base on the buffer aging code 
you have already written. There is no need for commiting anything as I 
don't want to update my tree now - could you just send me a diff of your 
current tree as is so that I can see how you did?



José Fonseca

Attachment: mach64-dma-tests.tar.bz2
Description: BZip2 compressed data

Reply via email to