Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver

Martin Sperl Mon, 04 Nov 2013 09:33:48 -0800

Hi!

(after some emails that went offline)
I have created the "prepared" message interface for the spi-bcm2835dma driver
on the 3.10 kernel I am working off right now. 
(so I had to do a bit of "linking" trickery)


So here a link of an example showing how much the prepared spi_messages 
really can improve the SPI thruput - even without changing anything
(besides) preparing the messages for direct DMA use.

http://www.raspberrypi.org/phpBB3/viewtopic.php?f=44&t=19489&p=448328#p448328
The link goes to the RPI forum and includes the measurements plus some images
showing both cases (configured via a module parameter for now)

Quick summary:
For my test-usecase by just enabling "prepared messages" I have reduced the 
time  for a "simple" transfer from 390us (without prepare) to 230us (with
prepare) and the driver is still using the threaded message_pump.

OK - the "default" spi-bcm2835.c driver currently in the mainline takes 245us
for the same thing, but it runs at 130k interrupts/s and 95% System load, 
which is quite impractical to do anything else.
So getting this down to 80% System load plus shorter responses is already
quite an improvement. 

It is a drawback hat DMA driver needs more overhead for the unprepared 
spi-message case. But then if the driver that is perfromance critical
does not make use of prepared messages yet and there is a performance issue,
then it needs to get modified to call spi_prepare message during setup. 

Anyway, if a driver cares about thru-put, it better avoids allocating 
new memory to create (and release)  an SPI message in the interrupt handler 
or callback in the first place.
Then adding the single extra statement to prepare and unprepare the message
comes cheap.

There is one hard assumptions:
each and every xfer has to be configured with a DMA-address for source
and destination (unless NULL)
Obviously the message is "static" as soon as prepare has been called.

Otherwise the overhead for iterating the messages and calling
dma_(un)map_single becomes the limiting factor and the difference in code 
(compared to creating the whole chain from scratch) is minimal besides 
additional allocations for the memory - we have to "walk", we have to parse,
...

(also think about how it works if memory itself is fragmented on the bus 
address then - depending on the alignment of data a different amount of 
DMA-transfers would be needed - thus it seems quite impractical to implement)

So from my experience I would recommend adding something like this to spi.h, 
so that it can get used for real and not export-linked way, like I had to do it
for this proof of concept.

static int bcm2835dma_spi_prepare_message(struct spi_device *spi,
        struct spi_message *msg)
{
        if(spi->prepare_message) {
                return spi->prepare_message(spi,msg);
        } else {
                return 0;
        }
}
static int bcm2835dma_spi_unprepare_message(struct spi_device *spi,
        struct spi_message* msg)
{
        if(spi->unprepare_message) {
                return spi->unprepare_message(spi,msg);
        } else {
                return 0;
        }
}

and finally also some management functionality for "finding" those prepared 
messages - better to implement something and making it opaque,
then every driver implementing its own thing and then having to track that.
Already my mcp2515a driver is allocating something like 14 prepared messages 
already, so walking that list is not as expensive, but at some point a binary
tree might be better from the performance-perspective 


struct spi_prepared_message {
        /* the list in which we store the message */
        struct list_head prepared_list;
        /* the identification data for matching */
        struct spi_device *spi;
        struct spi_message *message;
};

something like this:
struct list_head prepared_messages_list;
plus a spinlock protecting the above in spi_master

and the following prototypes:
static struct spi_prepared_message *spi_find_prepared_message_nolock(
        struct spi_device *spi,
        struct spi_message *message);
static struct spi_prepared_message spi_find_prepared_message(
        struct spi_device *spi,
        struct spi_message *message);
static int spi_add_prepared_message(
        struct spi_prepared_message * prepared);
static struct spi_prepared_message *spi_remove_prepared_message(
        struct spi_device *spi,
        struct spi_message *message);

I got the above prototypes implemented (but with the data sitting in the 
master-private data structure, so I am not sure if it is worth sharing them)

Maybe we can get those into the "next" tree so that it can get pulled into 3.13?

Thanks,
         Martin

P.s:
Note: I have not switched to the dmaengine interface for the driver.
There are too many things needed for me to learn to make that switch now.
Also (as far as Mark said during our offline discussions) there seem to be some 
gaps that would require extending the DMA engine, which would making the driver 
work at first hard. Finally I fear (from my limited knowledge) that scheduling 
via the DMA engine itself would require a prepare and also I would have to keep 
up with double allocations to link everything in place further reducing the 
"thruput" for the case of non-prepared drivers.

The driver itself can get found at: https://github.com/msperl/spi-bcm2835


--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver

Reply via email to