Re: MMC double buffering

Kyungmin Park Sat, 18 Dec 2010 07:30:11 -0800

Thanks

No problem.


We are also want to test it and wait until you release it for mmc mailing list.
I saw the mmc performance blueprint and we're now suffer from mmc
performance when low cpu frequency.
even though input clock is consistent at 50MHz. the performance
depends on cpu frequency.

Need to investigate it.

Thank you,
Kyungmin Park

On Sat, Dec 18, 2010 at 11:19 PM, Per Forlin <per.for...@linaro.org> wrote:
> Hi,
>
> Thanks for your interest. I am in the middle of rewriting parts due to
> my findings about dma_unmap. If everything goes well I should have a
> new prototype ready on Tuesday.
> My code base is 2.6.37 rc4. Will that work for you?
>
> After Tuesday I will go on vacation until Linaro sprint in Dallas Jan
> 10. I will not make any new updates on my code during my vacation but
> I try to keep up with my emails.
> I don't want to send it out for a full review yet because the code is
> far from ready. It would only cause to much noise I'm afraid, and
> since I am going on vacation it is not the best timing.
>
> Patches.
> Is it ok for you to wait until Tuesday (or a few days later if I run
> into trouble) and then you can test my latest version supporting
> double buffering for unmap. I can send out the patches directly to
> you.
>
> BR
> Per
>
> On 18 December 2010 03:50, Kyungmin Park <kmp...@infradead.org> wrote:
>> Hi,
>>
>> It's interesting.
>>
>> Can you send the your working codes to test it in our environment. Samsung 
>> SoC.
>>
>> Thank you,
>> Kyungmin Park
>>
>> On Sat, Dec 18, 2010 at 12:38 AM, Per Forlin <per.for...@linaro.org> wrote:
>>> Hi again,
>>>
>>> I made a mistake in my double buffering implementation.
>>> I assumed dma_unmap did not do any cache operations. Well, it does.
>>> Due to L2 read prefetch the L2 needs to be invalidated at dma_unmap.
>>>
>>> I made a quick test to see how much throughput would improved if
>>> dma_unmap could be run in parallel.
>>> In this run dma_unmap is removed.
>>>
>>> Then the figures for read becomes:
>>> * 7-16 % gain if double buffering in the ideal case. Closing on the
>>> same performance as for PIO.
>>>
>>> Relative diff: MMC-VANILLA-DMA-LOG -> MMC-MMCI-2-BUF-DMA-LOG-NO-UNMAP
>>> CPU is abs diff
>>>                                                        random  random
>>>        KB      reclen  write   rewrite read    reread  read    write
>>>        51200   4       +0%     +0%     +7%     +8%     +2%     +0%
>>>        cpu:            +0.0    +0.0    +0.7    +0.7    -0.0    +0.0
>>>
>>>        51200   8       +0%     +0%     +10%    +10%    +6%     +0%
>>>        cpu:            -0.1    +0.1    +0.6    +0.9    +0.3    +0.0
>>>
>>>        51200   16      +0%     +0%     +11%    +11%    +8%     +0%
>>>        cpu:            -0.0    -0.1    +0.9    +1.0    +0.3    +0.0
>>>
>>>        51200   32      +0%     +0%     +13%    +13%    +10%    +0%
>>>        cpu:            -0.1    +0.0    +1.0    +0.5    +0.8    +0.0
>>>
>>>        51200   64      +0%     +0%     +13%    +13%    +12%    +1%
>>>        cpu:            +0.0    +0.0    +0.4    +1.0    +0.9    +0.1
>>>
>>>        51200   128     +0%     +5%     +14%    +14%    +14%    +1%
>>>        cpu:            +0.0    +0.2    +1.0    +0.9    +1.0    +0.0
>>>
>>>        51200   256     +0%     +2%     +13%    +13%    +13%    +1%
>>>        cpu:            +0.0    +0.1    +0.9    +0.3    +1.6    -0.1
>>>
>>>        51200   512     +0%     +1%     +14%    +14%    +14%    +8%
>>>        cpu:            -0.0    +0.3    +2.5    +1.8    +2.4    +0.3
>>>
>>>        51200   1024    +0%     +2%     +14%    +15%    +15%    +0%
>>>        cpu:            +0.0    +0.3    +1.3    +1.4    +1.3    +0.1
>>>
>>>        51200   2048    +2%     +2%     +15%    +15%    +15%    +4%
>>>        cpu:            +0.3    +0.1    +1.6    +2.1    +0.9    +0.3
>>>
>>>        51200   4096    +5%     +3%     +15%    +16%    +16%    +5%
>>>        cpu:            +0.0    +0.4    +1.1    +1.7    +1.7    +0.5
>>>
>>>        51200   8192    +5%     +3%     +16%    +16%    +16%    +2%
>>>        cpu:            +0.0    +0.4    +2.0    +1.3    +1.8    +0.1
>>>
>>>        51200   16384   +1%     +1%     +16%    +16%    +16%    +4%
>>>        cpu:            +0.1    -0.2    +2.3    +1.7    +2.6    +0.2
>>>
>>> I will work on adding unmap to double buffering next week.
>>>
>>> /Per
>>>
>>> On 16 December 2010 15:15, Per Forlin <per.for...@linaro.org> wrote:
>>>> Hi,
>>>>
>>>> I am working on the blueprint
>>>> https://blueprints.launchpad.net/linux-linaro/+spec/other-storage-performance-emmc.
>>>> Currently I am investigating performance for DMA vs PIO on eMMC.
>>>>
>>>> Pros and cons for DMA on MMC
>>>> + Offloads CPU
>>>> + Fewer interrupts, one single interrupt for each transfer compared to
>>>> 100s or even 1000s
>>>> + Power save, DMA consumes less power than CPU
>>>> - Less bandwidth / throughput compared to PIO-CPU
>>>>
>>>> The reason for introducing double buffering in the MMC framework is to
>>>> address the throughput issue for DMA on MMC.
>>>> The assumption is that the CPU and DMA have higher throughput than the
>>>> MMC / SD-card.
>>>> My hypothesis is that the difference in performance between PIO-mode
>>>> and DMA-mode for MMC is due to latency for preparing a DMA-job.
>>>> If the next DMA-job could be prepared while the current job is ongoing
>>>> this latency would be reduced. The biggest part of preparing a DMA-job
>>>> is maintenance of caches.
>>>> In my case I run on U5500 (mach-ux500) which has both L1 and L2
>>>> caches. The host mmc driver in use is the mmci driver (PL180).
>>>>
>>>> I have done a hack in both the MMC-framework and mmci in order to make
>>>> a prove of concept. I have run IOZone to get measurements to prove my
>>>> case worthy.
>>>> The next step, if the results are promising will be to clean up my
>>>> work and send out patches for review.
>>>>
>>>> The DMAC in ux500 support to modes LOG and PHY.
>>>> LOG - Many logical channels are multiplex on top of one physical channel
>>>> PHY - Only one channel per physical channel
>>>>
>>>> DMA mode LOG and PHY have different latency both HW and SW wise. One
>>>> could almost treat them as "two different DMACs. To get a wider test
>>>> scope I have tested using both modes.
>>>>
>>>> Summary of the results.
>>>> * It is optional for the mmc host driver to utitlize the 2-buf
>>>> support. 2-buf in framework requires no change in the host drivers.
>>>> * IOZone shows no performance hit on existing drivers* if adding 2-buf
>>>> to the framework but not in the host driver.
>>>>  (* So far I have only test one driver)
>>>> * The performance gain for DMA using 2-buf is probably proportional to
>>>> the cache maintenance time.
>>>>  The faster the card is the more significant the cache maintenance
>>>> part becomes and vice versa.
>>>> * For U5500 with 2-buf performance for DMA is:
>>>> Throughput: DMA vanilla vs DMA 2-buf
>>>>  * read +5-10 %
>>>>  * write +0-3 %
>>>> CPU load: CPU vs DMA 2-buf
>>>>  * read large data: minus 10-20 units of %
>>>>  * read small data: same as PIO
>>>>  * write: same load as PIO ( why? )
>>>>
>>>> Here follows two of the measurements from IOZones comparing MMC with
>>>> double buffering and without. The rest you can find in the text files
>>>> attached.
>>>>
>>>> === Performance CPU compared with DMA vanilla kernel ===
>>>> Absolute diff: MMC-VANILLA-CPU -> MMC-VANILLA-DMA-LOG
>>>>                                                        random  random
>>>>        KB      reclen  write   rewrite read    reread  read    write
>>>>        51200   4       -14     -8      -1005   -988    -679    -1
>>>>        cpu:            -0.0    -0.1    -0.8    -0.9    -0.7    +0.0
>>>>
>>>>        51200   8       -35     -34     -1763   -1791   -1327   +0
>>>>        cpu:            +0.0    -0.1    -0.9    -1.2    -0.7    +0.0
>>>>
>>>>        51200   16      +6      -38     -2712   -2728   -2225   +0
>>>>        cpu:            -0.1    -0.0    -1.6    -1.2    -0.7    -0.0
>>>>
>>>>        51200   32      -10     -79     -3640   -3710   -3298   -1
>>>>        cpu:            -0.1    -0.2    -1.2    -1.2    -0.7    -0.0
>>>>
>>>>        51200   64      +31     -16     -4401   -4533   -4212   -1
>>>>        cpu:            -0.2    -0.2    -0.6    -1.2    -1.2    -0.0
>>>>
>>>>        51200   128     +58     -58     -4749   -4776   -4532   -4
>>>>        cpu:            -0.2    -0.0    -1.2    -1.1    -1.2    +0.1
>>>>
>>>>        51200   256     +192    +283    -5343   -5347   -5184   +13
>>>>        cpu:            +0.0    +0.1    -1.2    -0.6    -1.2    +0.0
>>>>
>>>>        51200   512     +232    +470    -4663   -4690   -4588   +171
>>>>        cpu:            +0.1    +0.1    -4.5    -3.9    -3.8    -0.1
>>>>
>>>>        51200   1024    +250    +68     -3151   -3318   -3303   +122
>>>>        cpu:            -0.1    -0.5    -14.0   -13.5   -14.0   -0.1
>>>>
>>>>        51200   2048    +224    +401    -2708   -2601   -2612   +161
>>>>        cpu:            -1.7    -1.3    -18.4   -19.5   -17.8   -0.5
>>>>
>>>>        51200   4096    +194    +417    -2380   -2361   -2520   +242
>>>>        cpu:            -1.3    -1.6    -19.4   -19.9   -19.4   -0.6
>>>>
>>>>        51200   8192    +228    +315    -2279   -2327   -2291   +270
>>>>        cpu:            -1.0    -0.9    -20.8   -20.3   -21.0   -0.6
>>>>
>>>>        51200   16384   +254    +289    -2260   -2232   -2269   +308
>>>>        cpu:            -0.8    -0.8    -20.5   -19.9   -21.5   -0.4
>>>>
>>>> === Performance CPU compared with DMA with MMC double buffering ===
>>>> Absolute diff: MMC-VANILLA-CPU -> MMC-MMCI-2-BUF-DMA-LOG
>>>>                                                        random  random
>>>>        KB      reclen  write   rewrite read    reread  read    write
>>>>        51200   4       -7      -11     -533    -513    -365    +0
>>>>        cpu:            -0.0    -0.1    -0.5    -0.7    -0.4    +0.0
>>>>
>>>>        51200   8       -19     -28     -916    -932    -671    +0
>>>>        cpu:            -0.0    -0.0    -0.3    -0.6    -0.2    +0.0
>>>>
>>>>        51200   16      +14     -13     -1467   -1479   -1203   +1
>>>>        cpu:            +0.0    -0.1    -0.7    -0.7    -0.2    -0.0
>>>>
>>>>        51200   32      +61     +24     -2008   -2088   -1853   +4
>>>>        cpu:            -0.3    -0.2    -0.7    -0.7    -0.2    -0.0
>>>>
>>>>        51200   64      +130    +84     -2571   -2692   -2483   +5
>>>>        cpu:            +0.0    -0.4    -0.1    -0.7    -0.7    +0.0
>>>>
>>>>        51200   128     +275    +279    -2760   -2747   -2607   +19
>>>>        cpu:            -0.1    +0.1    -0.7    -0.6    -0.7    +0.1
>>>>
>>>>        51200   256     +558    +503    -3455   -3429   -3216   +55
>>>>        cpu:            -0.1    +0.1    -0.8    -0.1    -0.8    +0.0
>>>>
>>>>        51200   512     +608    +820    -2476   -2497   -2504   +154
>>>>        cpu:            +0.2    +0.5    -3.3    -2.1    -2.7    +0.0
>>>>
>>>>        51200   1024    +652    +493    -818    -977    -1023   +291
>>>>        cpu:            +0.0    -0.1    -13.2   -12.8   -13.3   +0.1
>>>>
>>>>        51200   2048    +654    +809    -241    -218    -242    +501
>>>>        cpu:            -1.5    -1.2    -16.9   -18.2   -17.0   -0.2
>>>>
>>>>        51200   4096    +482    +908    -80     +82     -154    +633
>>>>        cpu:            -1.4    -1.2    -19.1   -18.4   -18.6   -0.2
>>>>
>>>>        51200   8192    +643    +810    +199    +186    +182    +675
>>>>        cpu:            -0.8    -0.7    -19.8   -19.2   -19.5   -0.7
>>>>
>>>>        51200   16384   +684    +724    +275    +323    +269    +724
>>>>        cpu:            -0.6    -0.7    -19.2   -18.6   -19.8   -0.2
>>>>
>>>
>>> _______________________________________________
>>> linaro-dev mailing list
>>> linaro-dev@lists.linaro.org
>>> http://lists.linaro.org/mailman/listinfo/linaro-dev
>>>
>>
>

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: MMC double buffering

Reply via email to