Hi again,

I made a mistake in my double buffering implementation.
I assumed dma_unmap did not do any cache operations. Well, it does.
Due to L2 read prefetch the L2 needs to be invalidated at dma_unmap.

I made a quick test to see how much throughput would improved if
dma_unmap could be run in parallel.
In this run dma_unmap is removed.

Then the figures for read becomes:
* 7-16 % gain if double buffering in the ideal case. Closing on the
same performance as for PIO.

Relative diff: MMC-VANILLA-DMA-LOG -> MMC-MMCI-2-BUF-DMA-LOG-NO-UNMAP
CPU is abs diff
                                                        random  random
        KB      reclen  write   rewrite read    reread  read    write
        51200   4       +0%     +0%     +7%     +8%     +2%     +0%
        cpu:            +0.0    +0.0    +0.7    +0.7    -0.0    +0.0

        51200   8       +0%     +0%     +10%    +10%    +6%     +0%
        cpu:            -0.1    +0.1    +0.6    +0.9    +0.3    +0.0

        51200   16      +0%     +0%     +11%    +11%    +8%     +0%
        cpu:            -0.0    -0.1    +0.9    +1.0    +0.3    +0.0

        51200   32      +0%     +0%     +13%    +13%    +10%    +0%
        cpu:            -0.1    +0.0    +1.0    +0.5    +0.8    +0.0

        51200   64      +0%     +0%     +13%    +13%    +12%    +1%
        cpu:            +0.0    +0.0    +0.4    +1.0    +0.9    +0.1

        51200   128     +0%     +5%     +14%    +14%    +14%    +1%
        cpu:            +0.0    +0.2    +1.0    +0.9    +1.0    +0.0

        51200   256     +0%     +2%     +13%    +13%    +13%    +1%
        cpu:            +0.0    +0.1    +0.9    +0.3    +1.6    -0.1

        51200   512     +0%     +1%     +14%    +14%    +14%    +8%
        cpu:            -0.0    +0.3    +2.5    +1.8    +2.4    +0.3

        51200   1024    +0%     +2%     +14%    +15%    +15%    +0%
        cpu:            +0.0    +0.3    +1.3    +1.4    +1.3    +0.1

        51200   2048    +2%     +2%     +15%    +15%    +15%    +4%
        cpu:            +0.3    +0.1    +1.6    +2.1    +0.9    +0.3

        51200   4096    +5%     +3%     +15%    +16%    +16%    +5%
        cpu:            +0.0    +0.4    +1.1    +1.7    +1.7    +0.5

        51200   8192    +5%     +3%     +16%    +16%    +16%    +2%
        cpu:            +0.0    +0.4    +2.0    +1.3    +1.8    +0.1

        51200   16384   +1%     +1%     +16%    +16%    +16%    +4%
        cpu:            +0.1    -0.2    +2.3    +1.7    +2.6    +0.2

I will work on adding unmap to double buffering next week.

/Per

On 16 December 2010 15:15, Per Forlin <per.for...@linaro.org> wrote:
> Hi,
>
> I am working on the blueprint
> https://blueprints.launchpad.net/linux-linaro/+spec/other-storage-performance-emmc.
> Currently I am investigating performance for DMA vs PIO on eMMC.
>
> Pros and cons for DMA on MMC
> + Offloads CPU
> + Fewer interrupts, one single interrupt for each transfer compared to
> 100s or even 1000s
> + Power save, DMA consumes less power than CPU
> - Less bandwidth / throughput compared to PIO-CPU
>
> The reason for introducing double buffering in the MMC framework is to
> address the throughput issue for DMA on MMC.
> The assumption is that the CPU and DMA have higher throughput than the
> MMC / SD-card.
> My hypothesis is that the difference in performance between PIO-mode
> and DMA-mode for MMC is due to latency for preparing a DMA-job.
> If the next DMA-job could be prepared while the current job is ongoing
> this latency would be reduced. The biggest part of preparing a DMA-job
> is maintenance of caches.
> In my case I run on U5500 (mach-ux500) which has both L1 and L2
> caches. The host mmc driver in use is the mmci driver (PL180).
>
> I have done a hack in both the MMC-framework and mmci in order to make
> a prove of concept. I have run IOZone to get measurements to prove my
> case worthy.
> The next step, if the results are promising will be to clean up my
> work and send out patches for review.
>
> The DMAC in ux500 support to modes LOG and PHY.
> LOG - Many logical channels are multiplex on top of one physical channel
> PHY - Only one channel per physical channel
>
> DMA mode LOG and PHY have different latency both HW and SW wise. One
> could almost treat them as "two different DMACs. To get a wider test
> scope I have tested using both modes.
>
> Summary of the results.
> * It is optional for the mmc host driver to utitlize the 2-buf
> support. 2-buf in framework requires no change in the host drivers.
> * IOZone shows no performance hit on existing drivers* if adding 2-buf
> to the framework but not in the host driver.
>  (* So far I have only test one driver)
> * The performance gain for DMA using 2-buf is probably proportional to
> the cache maintenance time.
>  The faster the card is the more significant the cache maintenance
> part becomes and vice versa.
> * For U5500 with 2-buf performance for DMA is:
> Throughput: DMA vanilla vs DMA 2-buf
>  * read +5-10 %
>  * write +0-3 %
> CPU load: CPU vs DMA 2-buf
>  * read large data: minus 10-20 units of %
>  * read small data: same as PIO
>  * write: same load as PIO ( why? )
>
> Here follows two of the measurements from IOZones comparing MMC with
> double buffering and without. The rest you can find in the text files
> attached.
>
> === Performance CPU compared with DMA vanilla kernel ===
> Absolute diff: MMC-VANILLA-CPU -> MMC-VANILLA-DMA-LOG
>                                                        random  random
>        KB      reclen  write   rewrite read    reread  read    write
>        51200   4       -14     -8      -1005   -988    -679    -1
>        cpu:            -0.0    -0.1    -0.8    -0.9    -0.7    +0.0
>
>        51200   8       -35     -34     -1763   -1791   -1327   +0
>        cpu:            +0.0    -0.1    -0.9    -1.2    -0.7    +0.0
>
>        51200   16      +6      -38     -2712   -2728   -2225   +0
>        cpu:            -0.1    -0.0    -1.6    -1.2    -0.7    -0.0
>
>        51200   32      -10     -79     -3640   -3710   -3298   -1
>        cpu:            -0.1    -0.2    -1.2    -1.2    -0.7    -0.0
>
>        51200   64      +31     -16     -4401   -4533   -4212   -1
>        cpu:            -0.2    -0.2    -0.6    -1.2    -1.2    -0.0
>
>        51200   128     +58     -58     -4749   -4776   -4532   -4
>        cpu:            -0.2    -0.0    -1.2    -1.1    -1.2    +0.1
>
>        51200   256     +192    +283    -5343   -5347   -5184   +13
>        cpu:            +0.0    +0.1    -1.2    -0.6    -1.2    +0.0
>
>        51200   512     +232    +470    -4663   -4690   -4588   +171
>        cpu:            +0.1    +0.1    -4.5    -3.9    -3.8    -0.1
>
>        51200   1024    +250    +68     -3151   -3318   -3303   +122
>        cpu:            -0.1    -0.5    -14.0   -13.5   -14.0   -0.1
>
>        51200   2048    +224    +401    -2708   -2601   -2612   +161
>        cpu:            -1.7    -1.3    -18.4   -19.5   -17.8   -0.5
>
>        51200   4096    +194    +417    -2380   -2361   -2520   +242
>        cpu:            -1.3    -1.6    -19.4   -19.9   -19.4   -0.6
>
>        51200   8192    +228    +315    -2279   -2327   -2291   +270
>        cpu:            -1.0    -0.9    -20.8   -20.3   -21.0   -0.6
>
>        51200   16384   +254    +289    -2260   -2232   -2269   +308
>        cpu:            -0.8    -0.8    -20.5   -19.9   -21.5   -0.4
>
> === Performance CPU compared with DMA with MMC double buffering ===
> Absolute diff: MMC-VANILLA-CPU -> MMC-MMCI-2-BUF-DMA-LOG
>                                                        random  random
>        KB      reclen  write   rewrite read    reread  read    write
>        51200   4       -7      -11     -533    -513    -365    +0
>        cpu:            -0.0    -0.1    -0.5    -0.7    -0.4    +0.0
>
>        51200   8       -19     -28     -916    -932    -671    +0
>        cpu:            -0.0    -0.0    -0.3    -0.6    -0.2    +0.0
>
>        51200   16      +14     -13     -1467   -1479   -1203   +1
>        cpu:            +0.0    -0.1    -0.7    -0.7    -0.2    -0.0
>
>        51200   32      +61     +24     -2008   -2088   -1853   +4
>        cpu:            -0.3    -0.2    -0.7    -0.7    -0.2    -0.0
>
>        51200   64      +130    +84     -2571   -2692   -2483   +5
>        cpu:            +0.0    -0.4    -0.1    -0.7    -0.7    +0.0
>
>        51200   128     +275    +279    -2760   -2747   -2607   +19
>        cpu:            -0.1    +0.1    -0.7    -0.6    -0.7    +0.1
>
>        51200   256     +558    +503    -3455   -3429   -3216   +55
>        cpu:            -0.1    +0.1    -0.8    -0.1    -0.8    +0.0
>
>        51200   512     +608    +820    -2476   -2497   -2504   +154
>        cpu:            +0.2    +0.5    -3.3    -2.1    -2.7    +0.0
>
>        51200   1024    +652    +493    -818    -977    -1023   +291
>        cpu:            +0.0    -0.1    -13.2   -12.8   -13.3   +0.1
>
>        51200   2048    +654    +809    -241    -218    -242    +501
>        cpu:            -1.5    -1.2    -16.9   -18.2   -17.0   -0.2
>
>        51200   4096    +482    +908    -80     +82     -154    +633
>        cpu:            -1.4    -1.2    -19.1   -18.4   -18.6   -0.2
>
>        51200   8192    +643    +810    +199    +186    +182    +675
>        cpu:            -0.8    -0.7    -19.8   -19.2   -19.5   -0.7
>
>        51200   16384   +684    +724    +275    +323    +269    +724
>        cpu:            -0.6    -0.7    -19.2   -18.6   -19.8   -0.2
>

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Reply via email to