Hi Grazvydas, Kevin,

I did some gather some performance measurements and statistics using
custom tracepoints in __omap3_enter_idle.
All the details are at
http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
.

The setup is:
- Beagleboard (OMAP3530) at 500MHz,
- l-o master kernel + functional power states + per-device PM QoS. It
has been checked that the changes from l-o master do not have an
impact on the performance.
- The data transfer is performed using dd from a file in JFFS2 to
/dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.

On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman <khil...@ti.com> wrote:
> Grazvydas Ignotas <nota...@gmail.com> writes:
>
>> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khil...@ti.com> wrote:
>>> It would be helpful now to narrow down what are the big contributors to
>>> the overhead in omap_sram_idle().  Most of the code there is skipped for
>>> C1 because the next states for MPU and CORE are both ON.
>>
>> Ok I did some tests, all in mostly idle system with just init, busybox
>> shell and dd doing a NAND read to /dev/null .
>
...
>
>> MB/s is throughput that
>> dd reports, mA and approx. current draw during the transfer, read from
>> fuel gauge that's onboard.
>>
>> MB/s| mA|comment
>>  3.7|218|mainline f549e088b80
>>  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
>>  4.4|220|[1] + pwrdm_p*_transition commented [2]
>>  3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3]
>>  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
>>  4.0|224|[1] + 'Deny idle' [5]
>>  5.1|210|[2] + [4] + [5]
>>  5.2|202|[5] + omap_sram_idle->cpu_do_idle [6]
>>  5.5|243|!CONFIG_PM
>>  6.1|282|busywait DMA end (for reference)

Here are the results (BW in MB/s) on Beagleboard:
- 4.7: without using DMA,

- Using DMA
  2.1: [0]
  2.1: [1] only C1
  2.6: [1]+[2] no pre_ post_
  2.3: [1]+[5] no pwrdm_for_each_clkdm
  2.8: [1]+[5]+[2]
  3.1: [1]+[5]+[6] no omap_sram_idle
  3.1: No IDLE, no omap_sram_idle, all pwrdms to ON

So indeed this shows there is some serious performance issue with the
C1 C-state.

> Thanks for the detailed experiments.  This definitely confirms we have
> some serious unwanted overhead for C1, and our C-state latency values
> are clearly way off base, since they only account HW latency and not any
> of the SW latency introduced in omap_sram_idle().
>
>>> There are 2 primary differences that I see as possible causes.  I list
>>> them here with a couple more experiments for you to try to help us
>>> narrow this down.
>>>
>>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()
>>>
>>> Could you try using omap_sram_idle() and just commenting out those
>>> calls?  Does that help performance?  Those iterate over all the
>>> powerdomains, so defintely add some overhead, but I don't think it
>>> would be as significant as what you're seeing.
>>
>> Seems to be taking good part of it.
>>
>>>    Much more likely is...
>>>
>>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
>>
>> Could not notice any difference.
>>
>> To me it looks like this results from many small things adding up..
>> Idle is called so often that pwrdm_p*_transition() and those
>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
>> because they access lots of registers on slow buses?

>From the list of contributors, the main ones are:
    (140us) pwrdm_pre_transition and pwrdm_post_transition,
    (105us) omap2_gpio_prepare_for_idle and
omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
the latency-critical C-states,
    (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
    (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
    (11 us) clkdm_allow_idle(mpu). Is this needed?

Here are a few questions and suggestions:
- In case of latency critical C-states could the high-latency code be
bypassed in favor of a much simpler version? Pushing the concept a bit
farther one could have a C1 state that just relaxes the cpu (no WFI),
a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
rest of the C-states as we have today,
- Is it needed to iterate through all the power and clock domains in
order to keep them active?
- Trying to idle some non related power domains (e.g. PER) causes a
performance hit. How to link all the power domains states to the
cpuidle C-state? The per-device PM QoS framework could be used to
constraint some power domains, but this is highly dependent on the use
case.

> Yes PRCM register accesses are unfortunately rather slow, and we've
> known that for some time, but haven't done any detailed analysis of the
> overhead.
That would be worth doing the analysis. A lot of read accesses to the
current, next and previous power states are performed in the idle
code.

> Using the function_graph tracer, I was able to see that the pre/post
> transition are taking an enormous amount of time:
>
>  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
>  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)
>
> Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
> using CPUfreq at all in your tests?  If using cpufreq + ondemand
> governor, you're probably running at low OPP due to lack of CPU activity
> which will also affect the latencies in the idle path.
>
>> Maybe some register cache would help us there, or are those registers
>> expected to be changed by hardware often?
>
> Yes, we've known that some sort of register cache here would be useful
> for some time, but haven't got to implementing it.
I can try some proof of concept code, just to prove its usefulness.

>> Also trying to idle PER while transfer is ongoing (as reported in
>> previous mail) doesn't sound like a good idea and is one of the
>> reasons for slowdown. Seems to also causing more current drain,
>> ironically.
>
> Agreed.  Again, using the function_graph tracer, I get some pretty big
> latencies from the GPIO pre/post idling process:
>
>  - gpio_prepare_for_idle(): 2400+ us at 600MHz (8200+ us at 125MHz)
>  - gpio_resume_from_idle(): 2200+ us at 600MHz (7600+ us at 125MHz)
>
> Removing PER transtions as you did will get rid of those.
>
> I'm looking into this in more detail know, and will likely have a few
> patches for you to experiment with.
>
> Thanks again for digging into this with us,
>
> Kevin
>

Any thoughts?

Regards,
Jean

>>
>>
>> changes (again, sorry for corrupted diffs, but they should be easy to
>> reproduce):
>> [2]:
>> --- a/arch/arm/mach-omap2/pm34xx.c
>> +++ b/arch/arm/mach-omap2/pm34xx.c
>> @@ -307,7 +307,7 @@ void omap_sram_idle(void)
>>                         omap3_enable_io_chain();
>>         }
>>
>> -       pwrdm_pre_transition();
>> +//     pwrdm_pre_transition();
>>
>>         /* PER */
>>         if (per_next_state < PWRDM_POWER_ON) {
>> @@ -372,7 +373,7 @@ void omap_sram_idle(void)
>>         }
>>         omap3_intc_resume_idle();
>>
>> -       pwrdm_post_transition();
>> +//     pwrdm_post_transition();
>>
>>         /* PER */
>>         if (per_next_state < PWRDM_POWER_ON) {
>> [3]:
>> --- a/arch/arm/mach-omap2/pm34xx.c
>> +++ b/arch/arm/mach-omap2/pm34xx.c
>> @@ -347,7 +347,7 @@ void omap_sram_idle(void)
>>         if (save_state == 1 || save_state == 3)
>>                 cpu_suspend(save_state, omap34xx_do_sram_idle);
>>         else
>> -               omap34xx_do_sram_idle(save_state);
>> +               cpu_do_idle();
>>
>>         /* Restore normal SDRC POWER settings */
>>         if (cpu_is_omap3430() && omap_rev() >= OMAP3430_REV_ES3_0 &&
>> [4]:
>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -107,6 +107,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>         if (index == 0) {
>>                 pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
>>                 pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
>> +               pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON);
>>         }
>>
>>         /*
>> [5]:
>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -105,8 +105,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>
>>         /* Deny idle for C1 */
>>         if (index == 0) {
>> -               pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
>> -               pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
>> +               clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
>>         }
>>
>>         /*
>> @@ -128,8 +128,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>
>>         /* Re-allow idle for C1 */
>>         if (index == 0) {
>> -               pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
>> -               pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
>> +               clkdm_allow_idle(mpu_pd->pwrdm_clkdms[0]);
>>         }
>>
>>  return_sleep_time:
>> [6]:
>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -117,7 +116,8 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>                 cpu_pm_enter();
>>
>>         /* Execute ARM wfi */
>> -       omap_sram_idle();
>> +       //omap_sram_idle();
>> +       cpu_do_idle();
>>
>>         /*
>>          * Call idle CPU PM enter notifier chain to restore
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to