I've been making some tests to the bus mastering in Mach64 chip as I told yesterday on IRC. Since this was discussed pretty late I would like to briefly document to the others DRI developers what I'm trying to do:
Since there is no way of caching DMA buffers on the Mach64 chip (is is
done via the CCE on Rage128, or the primary DMA buffer in Matrox) nor to
get notification when it's done, appearantly the only way left would be to
poll (either when new buffers were received by the DRM from the client, or
on a constant time interrupt such as VBLANK). This means the engine could
be stopped quite often yielding lower preformance.
A last resource alternative is to modify the descriptor table which
hold pointers to 4k DMA buffers blocks and add to it _while_ the engine is
running, and trying to resolve the resulting race condition with buffer
aging. The following tests try to access the possibility or not of that
scheme.
I've come with very good results in the tests so far. I want to tell you
so that you may know what we can depend upon so far:
1. The card expects no alignment on the table descriptor table besides the
16byte due to the size of each descriptor entry. That is, the BM_GUI_TABLE
register make use of the full 31:4 bits as in the specs. Note that the
whole table chunk has to be aligned with the
CIRCULAR_BUF_SIZE@BM_GUI_TABLE but there is no restriction to where in
that table we tell the card to read.
2. The card expects no aligment on the data buffers. That is, the
BM_SYSTEM_MEM_ADDR register makes use of the full 31:0 bits as in the
specs.
Note: Later on we should see if we can use scatter-gather memory for
the buffers to be able to allocate greater amounts of DMA space without
straining the kernel VM.
3. We can mess with the descriptor table after the bus mastering operation
has begun(!)
(The tests used are attached.)
Although this is seems promising I still have to workout more details:
a) check if there is no other buffering besides the FIFO going on. This
can only be checked by making a full proof of concept example and check if
nothing goes wrong.
b) see if the descriptor table can be made into a circular buffer. The
specs mention something about this but they aren't clear. They say the
circular buffer is in the card memory, but if the card was copying the
whole buffer then test 3 couldn't be happening...
c) instead of using a GUI register it's probably better to use
END_OF_LIST_STATUS@BM_COMMAND to see if the card is processing the last
entry of the descriptor table. If that bit is set then there is no point
in adding to the table was the engine will surely stop. We'll still need
the buffer aging register to resolve the race condition of the engine
stops while we change the table.
Not veryfing a) is the worst it can happen, as it makes it impossible to
solve the race condition.
If b) is not true we can still get advantages from this scheme. The
maximum table descriptor size is 128Kb, i.e., 8K entries, which is 32 MB
of buffer memory. That means that we would be able to fill several frames
before we need to do a wait_for_idle. Note that we don't need really 32 MB
was we can reuse the buffers in the process.
I hope that tomorrow I can give more info regarding b) and c).
Leif, to be able to do a) I would like to base on the buffer aging code
you have already written. There is no need for commiting anything as I
don't want to update my tree now - could you just send me a diff of your
current tree as is so that I can see how you did?
Jos� Fonseca
mach64-dma-tests.tar.bz2
Description: BZip2 compressed data
