I've been making some tests to the bus mastering in Mach64 chip as I told yesterday on IRC. Since this was discussed pretty late I would like to briefly document to the others DRI developers what I'm trying to do:
Since there is no way of caching DMA buffers on the Mach64 chip (is is done via the CCE on Rage128, or the primary DMA buffer in Matrox) nor to get notification when it's done, appearantly the only way left would be to poll (either when new buffers were received by the DRM from the client, or on a constant time interrupt such as VBLANK). This means the engine could be stopped quite often yielding lower preformance. A last resource alternative is to modify the descriptor table which hold pointers to 4k DMA buffers blocks and add to it _while_ the engine is running, and trying to resolve the resulting race condition with buffer aging. The following tests try to access the possibility or not of that scheme. I've come with very good results in the tests so far. I want to tell you so that you may know what we can depend upon so far: 1. The card expects no alignment on the table descriptor table besides the 16byte due to the size of each descriptor entry. That is, the BM_GUI_TABLE register make use of the full 31:4 bits as in the specs. Note that the whole table chunk has to be aligned with the CIRCULAR_BUF_SIZE@BM_GUI_TABLE but there is no restriction to where in that table we tell the card to read. 2. The card expects no aligment on the data buffers. That is, the BM_SYSTEM_MEM_ADDR register makes use of the full 31:0 bits as in the specs. Note: Later on we should see if we can use scatter-gather memory for the buffers to be able to allocate greater amounts of DMA space without straining the kernel VM. 3. We can mess with the descriptor table after the bus mastering operation has begun(!) (The tests used are attached.) Although this is seems promising I still have to workout more details: a) check if there is no other buffering besides the FIFO going on. This can only be checked by making a full proof of concept example and check if nothing goes wrong. b) see if the descriptor table can be made into a circular buffer. The specs mention something about this but they aren't clear. They say the circular buffer is in the card memory, but if the card was copying the whole buffer then test 3 couldn't be happening... c) instead of using a GUI register it's probably better to use END_OF_LIST_STATUS@BM_COMMAND to see if the card is processing the last entry of the descriptor table. If that bit is set then there is no point in adding to the table was the engine will surely stop. We'll still need the buffer aging register to resolve the race condition of the engine stops while we change the table. Not veryfing a) is the worst it can happen, as it makes it impossible to solve the race condition. If b) is not true we can still get advantages from this scheme. The maximum table descriptor size is 128Kb, i.e., 8K entries, which is 32 MB of buffer memory. That means that we would be able to fill several frames before we need to do a wait_for_idle. Note that we don't need really 32 MB was we can reuse the buffers in the process. I hope that tomorrow I can give more info regarding b) and c). Leif, to be able to do a) I would like to base on the buffer aging code you have already written. There is no need for commiting anything as I don't want to update my tree now - could you just send me a diff of your current tree as is so that I can see how you did? José Fonseca
mach64-dma-tests.tar.bz2
Description: BZip2 compressed data