Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-07-01 Thread Willy Wolff
On 2020-06-29-12-52-10, Lukasz Luba wrote:
> Hi Chanwoo,
> 
> On 6/29/20 2:43 AM, Chanwoo Choi wrote:
> > Hi,
> > 
> > Sorry for late reply because of my perfornal issue. I count not check the 
> > email.
> 
> I hope you are good now.
> 
> > 
> > On 6/26/20 8:22 PM, Bartlomiej Zolnierkiewicz wrote:
> > > 
> > > On 6/25/20 2:12 PM, Kamil Konieczny wrote:
> > > > On 25.06.2020 14:02, Lukasz Luba wrote:
> > > > > 
> > > > > 
> > > > > On 6/25/20 12:30 PM, Kamil Konieczny wrote:
> > > > > > Hi Lukasz,
> > > > > > 
> > > > > > On 25.06.2020 12:02, Lukasz Luba wrote:
> > > > > > > Hi Sylwester,
> > > > > > > 
> > > > > > > On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
> > > > > > > > Hi All,
> > > > > > > > 
> > > > > > > > On 24.06.2020 12:32, Lukasz Luba wrote:
> > > > > > > > > I had issues with devfreq governor which wasn't called by 
> > > > > > > > > devfreq
> > > > > > > > > workqueue. The old DELAYED vs DEFERRED work discussions and 
> > > > > > > > > my patches
> > > > > > > > > for it [1]. If the CPU which scheduled the next work went 
> > > > > > > > > idle, the
> > > > > > > > > devfreq workqueue will not be kicked and devfreq governor 
> > > > > > > > > won't check
> > > > > > > > > DMC status and will not decide to decrease the frequency 
> > > > > > > > > based on low
> > > > > > > > > busy_time.
> > > > > > > > > The same applies for going up with the frequency. They both 
> > > > > > > > > are
> > > > > > > > > done by the governor but the workqueue must be scheduled 
> > > > > > > > > periodically.
> > > > > > > > 
> > > > > > > > As I have been working on resolving the video mixer IOMMU fault 
> > > > > > > > issue
> > > > > > > > described here: https://patchwork.kernel.org/patch/10861757
> > > > > > > > I did some investigation of the devfreq operation, mostly on 
> > > > > > > > Odroid U3.
> > > > > > > > 
> > > > > > > > My conclusions are similar to what Lukasz says above. I would 
> > > > > > > > like to add
> > > > > > > > that broken scheduling of the performance counters read and the 
> > > > > > > > devfreq
> > > > > > > > updates seems to have one more serious implication. In each 
> > > > > > > > call, which
> > > > > > > > normally should happen periodically with fixed interval we stop 
> > > > > > > > the counters,
> > > > > > > > read counter values and start the counters again. But if period 
> > > > > > > > between
> > > > > > > > calls becomes long enough to let any of the counters overflow, 
> > > > > > > > we will
> > > > > > > > get wrong performance measurement results. My observations are 
> > > > > > > > that
> > > > > > > > the workqueue job can be suspended for several seconds and 
> > > > > > > > conditions for
> > > > > > > > the counter overflow occur sooner or later, depending among 
> > > > > > > > others
> > > > > > > > on the CPUs load.
> > > > > > > > Wrong bus load measurement can lead to setting too low 
> > > > > > > > interconnect bus
> > > > > > > > clock frequency and then bad things happen in peripheral 
> > > > > > > > devices.
> > > > > > > > 
> > > > > > > > I agree the workqueue issue needs to be fixed. I have some WIP 
> > > > > > > > code to use
> > > > > > > > the performance counters overflow interrupts instead of SW 
> > > > > > > > polling and with
> > > > > > > > that the interconnect bus clock control seems to work much 
> > > > > > > > better.
> > > > > > > > 
> > > > > > > 
> > > > > > > Thank you for sharing your use case and investigation results. I 
> > > > > > > think
> > > > > > > we are reaching a decent number of developers to maybe address 
> > > > > > > this
> > > > > > > issue: 'workqueue issue needs to be fixed'.
> > > > > > > I have been facing this devfreq workqueue issue ~5 times in 
> > > > > > > different
> > > > > > > platforms.
> > > > > > > 
> > > > > > > Regarding the 'performance counters overflow interrupts' there is 
> > > > > > > one
> > > > > > > thing worth to keep in mind: variable utilization and frequency.
> > > > > > > For example, in order to make a conclusion in algorithm deciding 
> > > > > > > that
> > > > > > > the device should increase or decrease the frequency, we fix the 
> > > > > > > period
> > > > > > > of observation, i.e. to 500ms. That can cause the long delay if 
> > > > > > > the
> > > > > > > utilization of the device suddenly drops. For example we set an
> > > > > > > overflow threshold to value i.e. 1000 and we know that at 1000MHz
> > > > > > > and full utilization (100%) the counter will reach that threshold
> > > > > > > after 500ms (which we want, because we don't want too many 
> > > > > > > interrupts
> > > > > > > per sec). What if suddenly utilization drops to 2% (i.e. from 
> > > > > > > 5GB/s
> > > > > > > to 250MB/s (what if it drops to 25MB/s?!)), the counter will 
> > > > > > > reach the
> > > > > > > threshold after 50*500ms = 25s. It is impossible just for the 
> > > > > > > counters
> > > > > > > to predict next utilization and adjust the threshold. [...]
> > > > > > 
> > > 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-29 Thread Lukasz Luba

Hi Chanwoo,

On 6/29/20 2:43 AM, Chanwoo Choi wrote:

Hi,

Sorry for late reply because of my perfornal issue. I count not check the email.


I hope you are good now.



On 6/26/20 8:22 PM, Bartlomiej Zolnierkiewicz wrote:


On 6/25/20 2:12 PM, Kamil Konieczny wrote:

On 25.06.2020 14:02, Lukasz Luba wrote:



On 6/25/20 12:30 PM, Kamil Konieczny wrote:

Hi Lukasz,

On 25.06.2020 12:02, Lukasz Luba wrote:

Hi Sylwester,

On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:

Hi All,

On 24.06.2020 12:32, Lukasz Luba wrote:

I had issues with devfreq governor which wasn't called by devfreq
workqueue. The old DELAYED vs DEFERRED work discussions and my patches
for it [1]. If the CPU which scheduled the next work went idle, the
devfreq workqueue will not be kicked and devfreq governor won't check
DMC status and will not decide to decrease the frequency based on low
busy_time.
The same applies for going up with the frequency. They both are
done by the governor but the workqueue must be scheduled periodically.


As I have been working on resolving the video mixer IOMMU fault issue
described here: https://patchwork.kernel.org/patch/10861757
I did some investigation of the devfreq operation, mostly on Odroid U3.

My conclusions are similar to what Lukasz says above. I would like to add
that broken scheduling of the performance counters read and the devfreq
updates seems to have one more serious implication. In each call, which
normally should happen periodically with fixed interval we stop the counters,
read counter values and start the counters again. But if period between
calls becomes long enough to let any of the counters overflow, we will
get wrong performance measurement results. My observations are that
the workqueue job can be suspended for several seconds and conditions for
the counter overflow occur sooner or later, depending among others
on the CPUs load.
Wrong bus load measurement can lead to setting too low interconnect bus
clock frequency and then bad things happen in peripheral devices.

I agree the workqueue issue needs to be fixed. I have some WIP code to use
the performance counters overflow interrupts instead of SW polling and with
that the interconnect bus clock control seems to work much better.



Thank you for sharing your use case and investigation results. I think
we are reaching a decent number of developers to maybe address this
issue: 'workqueue issue needs to be fixed'.
I have been facing this devfreq workqueue issue ~5 times in different
platforms.

Regarding the 'performance counters overflow interrupts' there is one
thing worth to keep in mind: variable utilization and frequency.
For example, in order to make a conclusion in algorithm deciding that
the device should increase or decrease the frequency, we fix the period
of observation, i.e. to 500ms. That can cause the long delay if the
utilization of the device suddenly drops. For example we set an
overflow threshold to value i.e. 1000 and we know that at 1000MHz
and full utilization (100%) the counter will reach that threshold
after 500ms (which we want, because we don't want too many interrupts
per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
threshold after 50*500ms = 25s. It is impossible just for the counters
to predict next utilization and adjust the threshold. [...]


irq triggers for underflow and overflow, so driver can adjust freq



Probably possible on some platforms, depends on how many PMU registers
are available, what information can be can assign to them and type of
interrupt. A lot of hassle and still - platform and device specific.
Also, drivers should not adjust the freq, governors (different types
of them with different settings that they can handle) should do it.

What the framework can do is to take this responsibility and provide
generic way to monitor the devices (or stop if they are suspended).
That should work nicely with the governors, which try to predict the
next best frequency. From my experience the more fluctuating intervals
the governors are called, the more odd decisions they make.
That's why I think having a predictable interval i.e. 100ms is something
desirable. Tuning the governors is easier in this case, statistics
are easier to trace and interpret, solution is not to platform specific,
etc.

Kamil do you have plans to refresh and push your next version of the
workqueue solution?


I do not, as Bartek takes over my work,
+CC Bartek


Hi Lukasz,

As you remember in January Chanwoo has proposed another idea (to allow
selecting workqueue type by devfreq device driver):

"I'm developing the RFC patch and then I'll send it as soon as possible."
(https://lore.kernel.org/linux-pm/6107fa2b-81ad-060d-89a2-d8941ac4d...@samsung.com/)

"After posting my suggestion, we can discuss it"
(https://lore.kernel.org/linux-pm/f5c5cd64-b72c-2802-f6ea-ab3d28483...@samsung.com/)

so we have been waiting on the patch to be posted..



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-29 Thread Lukasz Luba




On 6/26/20 6:50 PM, Sylwester Nawrocki wrote:

Hi Lukasz,

On 25.06.2020 12:02, Lukasz Luba wrote:

Regarding the 'performance counters overflow interrupts' there is one
thing worth to keep in mind: variable utilization and frequency.
For example, in order to make a conclusion in algorithm deciding that
the device should increase or decrease the frequency, we fix the period
of observation, i.e. to 500ms. That can cause the long delay if the
utilization of the device suddenly drops. For example we set an
overflow threshold to value i.e. 1000 and we know that at 1000MHz
and full utilization (100%) the counter will reach that threshold
after 500ms (which we want, because we don't want too many interrupts
per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
threshold after 50*500ms = 25s. It is impossible just for the counters
to predict next utilization and adjust the threshold.


Agreed, that's in case when we use just the performance counter (PMCNT)
overflow interrupts. In my experiments I used the (total) cycle counter
(CCNT) overflow interrupts. As that counter is clocked with fixed rate
between devfreq updates it can be used as a timer by pre-loading it with
initial value depending on current bus frequency. But we could as well
use some reliable system timer mechanism to generate periodic events.
I was hoping to use the cycle counter to generate low frequency monitor
events and the actual performance counters overflow interrupts to detect
any sudden changes of utilization. However, it seems it cannot be done
with as simple performance counters HW architecture as on Exynos4412.
It looks like on Exynos5422 we have all what is needed, there is more
flexibility in selecting the counter source signal, e.g. each counter
can be a clock cycle counter or can count various bus events related to
actual utilization. Moreover, we could configure the counter gating period
and alarm interrupts are available for when the counter value drops below
configured MIN threshold or exceeds configured MAX value.


I see. I don't have TRM for Exynos5422 so couldn't see that. I also
have to keep in mind other platforms which might not have this feature.



So it should be possible to configure the HW to generate the utilization
monitoring events without excessive continuous CPU intervention.


I agree, that would be desirable especially for low load in the system.


But I'm rather not going to work on the Exynos5422 SoC support at the moment.


I see.




To address that, we still need to have another mechanism (like watchdog)
which will be triggered just to check if the threshold needs adjustment.
This mechanism can be a local timer in the driver or a framework
timer running kind of 'for loop' on all this type of devices (like
the scheduled workqueue). In both cases in the system there will be
interrupts, timers (even at workqueues) and scheduling.
The approach to force developers to implement their local watchdog
timers (or workqueues) in drivers is IMHO wrong and that's why we have
frameworks.


Yes, it should be also possible in the framework to use the counter alarm
events where the hardware is advanced enough, in order to avoid excessive
SW polling.


Looks promising, but that would need more plumbing I assume.

Regards,
Lukasz



--
Regards,
Sylwester



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-29 Thread Lukasz Luba




On 6/26/20 12:22 PM, Bartlomiej Zolnierkiewicz wrote:


On 6/25/20 2:12 PM, Kamil Konieczny wrote:

On 25.06.2020 14:02, Lukasz Luba wrote:



On 6/25/20 12:30 PM, Kamil Konieczny wrote:


[snip]



Kamil do you have plans to refresh and push your next version of the
workqueue solution?


I do not, as Bartek takes over my work,
+CC Bartek


Hi Lukasz,


Hi Bartek,



As you remember in January Chanwoo has proposed another idea (to allow
selecting workqueue type by devfreq device driver):

"I'm developing the RFC patch and then I'll send it as soon as possible."
(https://lore.kernel.org/linux-pm/6107fa2b-81ad-060d-89a2-d8941ac4d...@samsung.com/)

"After posting my suggestion, we can discuss it"
(https://lore.kernel.org/linux-pm/f5c5cd64-b72c-2802-f6ea-ab3d28483...@samsung.com/)

so we have been waiting on the patch to be posted..

Similarly we have been waiting on (any) feedback for exynos-bus/nocp
fixes for Exynos5422 support (which have been posted by Kamil also in
January):

https://lore.kernel.org/linux-pm/8f82d8d5-927b-afb4-272f-45c16b5a2...@samsung.com/

Considering the above and how hard it has been to push the changes
through review/merge process last year we are near giving up when it
comes to upstream devfreq contributions. Sylwester is still working on
exynos-bus & interconnect integration (continuation of Artur Swigon's
work from last year) & related issues (IRQ support for PPMU)  but
I'm seriously considering putting it all on-hold..


Thank you for detailed explanation and update. I see. Anyway, if you or
Sylwester need some help with this devfreq workqueue, I offer my time
as a reviewer

The more generic solution you propose, the better for all platforms.

Regards,
Lukasz



Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R Institute Poland
Samsung Electronics



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-28 Thread Chanwoo Choi
Hi Sylwester,

On 6/25/20 12:11 AM, Sylwester Nawrocki wrote:
> Hi All,
> 
> On 24.06.2020 12:32, Lukasz Luba wrote:
>> I had issues with devfreq governor which wasn't called by devfreq
>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>> for it [1]. If the CPU which scheduled the next work went idle, the
>> devfreq workqueue will not be kicked and devfreq governor won't check
>> DMC status and will not decide to decrease the frequency based on low
>> busy_time.
>> The same applies for going up with the frequency. They both are
>> done by the governor but the workqueue must be scheduled periodically.
> 
> As I have been working on resolving the video mixer IOMMU fault issue
> described here: https://patchwork.kernel.org/patch/10861757
> I did some investigation of the devfreq operation, mostly on Odroid U3.
> 
> My conclusions are similar to what Lukasz says above. I would like to add
> that broken scheduling of the performance counters read and the devfreq 
> updates seems to have one more serious implication. In each call, which
> normally should happen periodically with fixed interval we stop the counters, 
> read counter values and start the counters again. But if period between 
> calls becomes long enough to let any of the counters overflow, we will
> get wrong performance measurement results. My observations are that 
> the workqueue job can be suspended for several seconds and conditions for 
> the counter overflow occur sooner or later, depending among others 
> on the CPUs load.
> Wrong bus load measurement can lead to setting too low interconnect bus 
> clock frequency and then bad things happen in peripheral devices.
> 
> I agree the workqueue issue needs to be fixed. I have some WIP code to use
> the performance counters overflow interrupts instead of SW polling and with 

It is good way to resolve the overflow issue.

> that the interconnect bus clock control seems to work much better.
>
-- 
Best Regards,
Chanwoo Choi
Samsung Electronics


Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-28 Thread Chanwoo Choi
Hi,

Sorry for late reply because of my perfornal issue. I count not check the email.

On 6/26/20 8:22 PM, Bartlomiej Zolnierkiewicz wrote:
> 
> On 6/25/20 2:12 PM, Kamil Konieczny wrote:
>> On 25.06.2020 14:02, Lukasz Luba wrote:
>>>
>>>
>>> On 6/25/20 12:30 PM, Kamil Konieczny wrote:
 Hi Lukasz,

 On 25.06.2020 12:02, Lukasz Luba wrote:
> Hi Sylwester,
>
> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
>> Hi All,
>>
>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>> I had issues with devfreq governor which wasn't called by devfreq
>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>> DMC status and will not decide to decrease the frequency based on low
>>> busy_time.
>>> The same applies for going up with the frequency. They both are
>>> done by the governor but the workqueue must be scheduled periodically.
>>
>> As I have been working on resolving the video mixer IOMMU fault issue
>> described here: https://patchwork.kernel.org/patch/10861757
>> I did some investigation of the devfreq operation, mostly on Odroid U3.
>>
>> My conclusions are similar to what Lukasz says above. I would like to add
>> that broken scheduling of the performance counters read and the devfreq
>> updates seems to have one more serious implication. In each call, which
>> normally should happen periodically with fixed interval we stop the 
>> counters,
>> read counter values and start the counters again. But if period between
>> calls becomes long enough to let any of the counters overflow, we will
>> get wrong performance measurement results. My observations are that
>> the workqueue job can be suspended for several seconds and conditions for
>> the counter overflow occur sooner or later, depending among others
>> on the CPUs load.
>> Wrong bus load measurement can lead to setting too low interconnect bus
>> clock frequency and then bad things happen in peripheral devices.
>>
>> I agree the workqueue issue needs to be fixed. I have some WIP code to 
>> use
>> the performance counters overflow interrupts instead of SW polling and 
>> with
>> that the interconnect bus clock control seems to work much better.
>>
>
> Thank you for sharing your use case and investigation results. I think
> we are reaching a decent number of developers to maybe address this
> issue: 'workqueue issue needs to be fixed'.
> I have been facing this devfreq workqueue issue ~5 times in different
> platforms.
>
> Regarding the 'performance counters overflow interrupts' there is one
> thing worth to keep in mind: variable utilization and frequency.
> For example, in order to make a conclusion in algorithm deciding that
> the device should increase or decrease the frequency, we fix the period
> of observation, i.e. to 500ms. That can cause the long delay if the
> utilization of the device suddenly drops. For example we set an
> overflow threshold to value i.e. 1000 and we know that at 1000MHz
> and full utilization (100%) the counter will reach that threshold
> after 500ms (which we want, because we don't want too many interrupts
> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
> threshold after 50*500ms = 25s. It is impossible just for the counters
> to predict next utilization and adjust the threshold. [...]

 irq triggers for underflow and overflow, so driver can adjust freq

>>>
>>> Probably possible on some platforms, depends on how many PMU registers
>>> are available, what information can be can assign to them and type of
>>> interrupt. A lot of hassle and still - platform and device specific.
>>> Also, drivers should not adjust the freq, governors (different types
>>> of them with different settings that they can handle) should do it.
>>>
>>> What the framework can do is to take this responsibility and provide
>>> generic way to monitor the devices (or stop if they are suspended).
>>> That should work nicely with the governors, which try to predict the
>>> next best frequency. From my experience the more fluctuating intervals
>>> the governors are called, the more odd decisions they make.
>>> That's why I think having a predictable interval i.e. 100ms is something
>>> desirable. Tuning the governors is easier in this case, statistics
>>> are easier to trace and interpret, solution is not to platform specific,
>>> etc.
>>>
>>> Kamil do you have plans to refresh and push your next version of the
>>> workqueue solution?
>>
>> I do not, as Bartek takes over my work,
>> +CC Bartek
> 
> Hi Lukasz,
> 
> As you remember in January Chanwoo has 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-26 Thread Sylwester Nawrocki
Hi Lukasz,

On 25.06.2020 12:02, Lukasz Luba wrote:
> Regarding the 'performance counters overflow interrupts' there is one
> thing worth to keep in mind: variable utilization and frequency.
> For example, in order to make a conclusion in algorithm deciding that
> the device should increase or decrease the frequency, we fix the period
> of observation, i.e. to 500ms. That can cause the long delay if the
> utilization of the device suddenly drops. For example we set an
> overflow threshold to value i.e. 1000 and we know that at 1000MHz
> and full utilization (100%) the counter will reach that threshold
> after 500ms (which we want, because we don't want too many interrupts
> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
> threshold after 50*500ms = 25s. It is impossible just for the counters
> to predict next utilization and adjust the threshold.

Agreed, that's in case when we use just the performance counter (PMCNT)
overflow interrupts. In my experiments I used the (total) cycle counter
(CCNT) overflow interrupts. As that counter is clocked with fixed rate
between devfreq updates it can be used as a timer by pre-loading it with 
initial value depending on current bus frequency. But we could as well 
use some reliable system timer mechanism to generate periodic events. 
I was hoping to use the cycle counter to generate low frequency monitor 
events and the actual performance counters overflow interrupts to detect 
any sudden changes of utilization. However, it seems it cannot be done 
with as simple performance counters HW architecture as on Exynos4412.
It looks like on Exynos5422 we have all what is needed, there is more 
flexibility in selecting the counter source signal, e.g. each counter
can be a clock cycle counter or can count various bus events related to 
actual utilization. Moreover, we could configure the counter gating period 
and alarm interrupts are available for when the counter value drops below 
configured MIN threshold or exceeds configured MAX value.

So it should be possible to configure the HW to generate the utilization 
monitoring events without excessive continuous CPU intervention.
But I'm rather not going to work on the Exynos5422 SoC support at the moment.

> To address that, we still need to have another mechanism (like watchdog)
> which will be triggered just to check if the threshold needs adjustment.
> This mechanism can be a local timer in the driver or a framework
> timer running kind of 'for loop' on all this type of devices (like
> the scheduled workqueue). In both cases in the system there will be
> interrupts, timers (even at workqueues) and scheduling.
> The approach to force developers to implement their local watchdog
> timers (or workqueues) in drivers is IMHO wrong and that's why we have
> frameworks.

Yes, it should be also possible in the framework to use the counter alarm
events where the hardware is advanced enough, in order to avoid excessive 
SW polling.

--
Regards,
Sylwester


Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-26 Thread Bartlomiej Zolnierkiewicz


On 6/25/20 2:12 PM, Kamil Konieczny wrote:
> On 25.06.2020 14:02, Lukasz Luba wrote:
>>
>>
>> On 6/25/20 12:30 PM, Kamil Konieczny wrote:
>>> Hi Lukasz,
>>>
>>> On 25.06.2020 12:02, Lukasz Luba wrote:
 Hi Sylwester,

 On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
> Hi All,
>
> On 24.06.2020 12:32, Lukasz Luba wrote:
>> I had issues with devfreq governor which wasn't called by devfreq
>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>> for it [1]. If the CPU which scheduled the next work went idle, the
>> devfreq workqueue will not be kicked and devfreq governor won't check
>> DMC status and will not decide to decrease the frequency based on low
>> busy_time.
>> The same applies for going up with the frequency. They both are
>> done by the governor but the workqueue must be scheduled periodically.
>
> As I have been working on resolving the video mixer IOMMU fault issue
> described here: https://patchwork.kernel.org/patch/10861757
> I did some investigation of the devfreq operation, mostly on Odroid U3.
>
> My conclusions are similar to what Lukasz says above. I would like to add
> that broken scheduling of the performance counters read and the devfreq
> updates seems to have one more serious implication. In each call, which
> normally should happen periodically with fixed interval we stop the 
> counters,
> read counter values and start the counters again. But if period between
> calls becomes long enough to let any of the counters overflow, we will
> get wrong performance measurement results. My observations are that
> the workqueue job can be suspended for several seconds and conditions for
> the counter overflow occur sooner or later, depending among others
> on the CPUs load.
> Wrong bus load measurement can lead to setting too low interconnect bus
> clock frequency and then bad things happen in peripheral devices.
>
> I agree the workqueue issue needs to be fixed. I have some WIP code to use
> the performance counters overflow interrupts instead of SW polling and 
> with
> that the interconnect bus clock control seems to work much better.
>

 Thank you for sharing your use case and investigation results. I think
 we are reaching a decent number of developers to maybe address this
 issue: 'workqueue issue needs to be fixed'.
 I have been facing this devfreq workqueue issue ~5 times in different
 platforms.

 Regarding the 'performance counters overflow interrupts' there is one
 thing worth to keep in mind: variable utilization and frequency.
 For example, in order to make a conclusion in algorithm deciding that
 the device should increase or decrease the frequency, we fix the period
 of observation, i.e. to 500ms. That can cause the long delay if the
 utilization of the device suddenly drops. For example we set an
 overflow threshold to value i.e. 1000 and we know that at 1000MHz
 and full utilization (100%) the counter will reach that threshold
 after 500ms (which we want, because we don't want too many interrupts
 per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
 to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
 threshold after 50*500ms = 25s. It is impossible just for the counters
 to predict next utilization and adjust the threshold. [...]
>>>
>>> irq triggers for underflow and overflow, so driver can adjust freq
>>>
>>
>> Probably possible on some platforms, depends on how many PMU registers
>> are available, what information can be can assign to them and type of
>> interrupt. A lot of hassle and still - platform and device specific.
>> Also, drivers should not adjust the freq, governors (different types
>> of them with different settings that they can handle) should do it.
>>
>> What the framework can do is to take this responsibility and provide
>> generic way to monitor the devices (or stop if they are suspended).
>> That should work nicely with the governors, which try to predict the
>> next best frequency. From my experience the more fluctuating intervals
>> the governors are called, the more odd decisions they make.
>> That's why I think having a predictable interval i.e. 100ms is something
>> desirable. Tuning the governors is easier in this case, statistics
>> are easier to trace and interpret, solution is not to platform specific,
>> etc.
>>
>> Kamil do you have plans to refresh and push your next version of the
>> workqueue solution?
> 
> I do not, as Bartek takes over my work,
> +CC Bartek

Hi Lukasz,

As you remember in January Chanwoo has proposed another idea (to allow
selecting workqueue type by devfreq device driver):

"I'm developing the RFC patch and then I'll send it as soon as possible."
(https://lore.kernel.org/linux-pm/6107fa2b-81ad-060d-89a2-d8941ac4d...@samsung.com/)

"After 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-25 Thread Kamil Konieczny
On 25.06.2020 14:02, Lukasz Luba wrote:
> 
> 
> On 6/25/20 12:30 PM, Kamil Konieczny wrote:
>> Hi Lukasz,
>>
>> On 25.06.2020 12:02, Lukasz Luba wrote:
>>> Hi Sylwester,
>>>
>>> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
 Hi All,

 On 24.06.2020 12:32, Lukasz Luba wrote:
> I had issues with devfreq governor which wasn't called by devfreq
> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
> for it [1]. If the CPU which scheduled the next work went idle, the
> devfreq workqueue will not be kicked and devfreq governor won't check
> DMC status and will not decide to decrease the frequency based on low
> busy_time.
> The same applies for going up with the frequency. They both are
> done by the governor but the workqueue must be scheduled periodically.

 As I have been working on resolving the video mixer IOMMU fault issue
 described here: https://patchwork.kernel.org/patch/10861757
 I did some investigation of the devfreq operation, mostly on Odroid U3.

 My conclusions are similar to what Lukasz says above. I would like to add
 that broken scheduling of the performance counters read and the devfreq
 updates seems to have one more serious implication. In each call, which
 normally should happen periodically with fixed interval we stop the 
 counters,
 read counter values and start the counters again. But if period between
 calls becomes long enough to let any of the counters overflow, we will
 get wrong performance measurement results. My observations are that
 the workqueue job can be suspended for several seconds and conditions for
 the counter overflow occur sooner or later, depending among others
 on the CPUs load.
 Wrong bus load measurement can lead to setting too low interconnect bus
 clock frequency and then bad things happen in peripheral devices.

 I agree the workqueue issue needs to be fixed. I have some WIP code to use
 the performance counters overflow interrupts instead of SW polling and with
 that the interconnect bus clock control seems to work much better.

>>>
>>> Thank you for sharing your use case and investigation results. I think
>>> we are reaching a decent number of developers to maybe address this
>>> issue: 'workqueue issue needs to be fixed'.
>>> I have been facing this devfreq workqueue issue ~5 times in different
>>> platforms.
>>>
>>> Regarding the 'performance counters overflow interrupts' there is one
>>> thing worth to keep in mind: variable utilization and frequency.
>>> For example, in order to make a conclusion in algorithm deciding that
>>> the device should increase or decrease the frequency, we fix the period
>>> of observation, i.e. to 500ms. That can cause the long delay if the
>>> utilization of the device suddenly drops. For example we set an
>>> overflow threshold to value i.e. 1000 and we know that at 1000MHz
>>> and full utilization (100%) the counter will reach that threshold
>>> after 500ms (which we want, because we don't want too many interrupts
>>> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
>>> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
>>> threshold after 50*500ms = 25s. It is impossible just for the counters
>>> to predict next utilization and adjust the threshold. [...]
>>
>> irq triggers for underflow and overflow, so driver can adjust freq
>>
> 
> Probably possible on some platforms, depends on how many PMU registers
> are available, what information can be can assign to them and type of
> interrupt. A lot of hassle and still - platform and device specific.
> Also, drivers should not adjust the freq, governors (different types
> of them with different settings that they can handle) should do it.
> 
> What the framework can do is to take this responsibility and provide
> generic way to monitor the devices (or stop if they are suspended).
> That should work nicely with the governors, which try to predict the
> next best frequency. From my experience the more fluctuating intervals
> the governors are called, the more odd decisions they make.
> That's why I think having a predictable interval i.e. 100ms is something
> desirable. Tuning the governors is easier in this case, statistics
> are easier to trace and interpret, solution is not to platform specific,
> etc.
> 
> Kamil do you have plans to refresh and push your next version of the
> workqueue solution?

I do not, as Bartek takes over my work,
+CC Bartek

-- 
Best regards,
Kamil Konieczny
Samsung R Institute Poland



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-25 Thread Lukasz Luba




On 6/25/20 12:30 PM, Kamil Konieczny wrote:

Hi Lukasz,

On 25.06.2020 12:02, Lukasz Luba wrote:

Hi Sylwester,

On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:

Hi All,

On 24.06.2020 12:32, Lukasz Luba wrote:

I had issues with devfreq governor which wasn't called by devfreq
workqueue. The old DELAYED vs DEFERRED work discussions and my patches
for it [1]. If the CPU which scheduled the next work went idle, the
devfreq workqueue will not be kicked and devfreq governor won't check
DMC status and will not decide to decrease the frequency based on low
busy_time.
The same applies for going up with the frequency. They both are
done by the governor but the workqueue must be scheduled periodically.


As I have been working on resolving the video mixer IOMMU fault issue
described here: https://patchwork.kernel.org/patch/10861757
I did some investigation of the devfreq operation, mostly on Odroid U3.

My conclusions are similar to what Lukasz says above. I would like to add
that broken scheduling of the performance counters read and the devfreq
updates seems to have one more serious implication. In each call, which
normally should happen periodically with fixed interval we stop the counters,
read counter values and start the counters again. But if period between
calls becomes long enough to let any of the counters overflow, we will
get wrong performance measurement results. My observations are that
the workqueue job can be suspended for several seconds and conditions for
the counter overflow occur sooner or later, depending among others
on the CPUs load.
Wrong bus load measurement can lead to setting too low interconnect bus
clock frequency and then bad things happen in peripheral devices.

I agree the workqueue issue needs to be fixed. I have some WIP code to use
the performance counters overflow interrupts instead of SW polling and with
that the interconnect bus clock control seems to work much better.



Thank you for sharing your use case and investigation results. I think
we are reaching a decent number of developers to maybe address this
issue: 'workqueue issue needs to be fixed'.
I have been facing this devfreq workqueue issue ~5 times in different
platforms.

Regarding the 'performance counters overflow interrupts' there is one
thing worth to keep in mind: variable utilization and frequency.
For example, in order to make a conclusion in algorithm deciding that
the device should increase or decrease the frequency, we fix the period
of observation, i.e. to 500ms. That can cause the long delay if the
utilization of the device suddenly drops. For example we set an
overflow threshold to value i.e. 1000 and we know that at 1000MHz
and full utilization (100%) the counter will reach that threshold
after 500ms (which we want, because we don't want too many interrupts
per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
threshold after 50*500ms = 25s. It is impossible just for the counters
to predict next utilization and adjust the threshold. [...]


irq triggers for underflow and overflow, so driver can adjust freq



Probably possible on some platforms, depends on how many PMU registers
are available, what information can be can assign to them and type of
interrupt. A lot of hassle and still - platform and device specific.
Also, drivers should not adjust the freq, governors (different types
of them with different settings that they can handle) should do it.

What the framework can do is to take this responsibility and provide
generic way to monitor the devices (or stop if they are suspended).
That should work nicely with the governors, which try to predict the
next best frequency. From my experience the more fluctuating intervals
the governors are called, the more odd decisions they make.
That's why I think having a predictable interval i.e. 100ms is something
desirable. Tuning the governors is easier in this case, statistics
are easier to trace and interpret, solution is not to platform specific,
etc.

Kamil do you have plans to refresh and push your next version of the
workqueue solution?

Regards,
Lukasz



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-25 Thread Kamil Konieczny
Hi Lukasz,

On 25.06.2020 12:02, Lukasz Luba wrote:
> Hi Sylwester,
> 
> On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:
>> Hi All,
>>
>> On 24.06.2020 12:32, Lukasz Luba wrote:
>>> I had issues with devfreq governor which wasn't called by devfreq
>>> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
>>> for it [1]. If the CPU which scheduled the next work went idle, the
>>> devfreq workqueue will not be kicked and devfreq governor won't check
>>> DMC status and will not decide to decrease the frequency based on low
>>> busy_time.
>>> The same applies for going up with the frequency. They both are
>>> done by the governor but the workqueue must be scheduled periodically.
>>
>> As I have been working on resolving the video mixer IOMMU fault issue
>> described here: https://patchwork.kernel.org/patch/10861757
>> I did some investigation of the devfreq operation, mostly on Odroid U3.
>>
>> My conclusions are similar to what Lukasz says above. I would like to add
>> that broken scheduling of the performance counters read and the devfreq
>> updates seems to have one more serious implication. In each call, which
>> normally should happen periodically with fixed interval we stop the counters,
>> read counter values and start the counters again. But if period between
>> calls becomes long enough to let any of the counters overflow, we will
>> get wrong performance measurement results. My observations are that
>> the workqueue job can be suspended for several seconds and conditions for
>> the counter overflow occur sooner or later, depending among others
>> on the CPUs load.
>> Wrong bus load measurement can lead to setting too low interconnect bus
>> clock frequency and then bad things happen in peripheral devices.
>>
>> I agree the workqueue issue needs to be fixed. I have some WIP code to use
>> the performance counters overflow interrupts instead of SW polling and with
>> that the interconnect bus clock control seems to work much better.
>>
> 
> Thank you for sharing your use case and investigation results. I think
> we are reaching a decent number of developers to maybe address this
> issue: 'workqueue issue needs to be fixed'.
> I have been facing this devfreq workqueue issue ~5 times in different
> platforms.
> 
> Regarding the 'performance counters overflow interrupts' there is one
> thing worth to keep in mind: variable utilization and frequency.
> For example, in order to make a conclusion in algorithm deciding that
> the device should increase or decrease the frequency, we fix the period
> of observation, i.e. to 500ms. That can cause the long delay if the
> utilization of the device suddenly drops. For example we set an
> overflow threshold to value i.e. 1000 and we know that at 1000MHz
> and full utilization (100%) the counter will reach that threshold
> after 500ms (which we want, because we don't want too many interrupts
> per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
> to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
> threshold after 50*500ms = 25s. It is impossible just for the counters
> to predict next utilization and adjust the threshold. [...]

irq triggers for underflow and overflow, so driver can adjust freq

-- 
Best regards,
Kamil Konieczny
Samsung R Institute Poland



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-25 Thread Lukasz Luba

Hi Sylwester,

On 6/24/20 4:11 PM, Sylwester Nawrocki wrote:

Hi All,

On 24.06.2020 12:32, Lukasz Luba wrote:

I had issues with devfreq governor which wasn't called by devfreq
workqueue. The old DELAYED vs DEFERRED work discussions and my patches
for it [1]. If the CPU which scheduled the next work went idle, the
devfreq workqueue will not be kicked and devfreq governor won't check
DMC status and will not decide to decrease the frequency based on low
busy_time.
The same applies for going up with the frequency. They both are
done by the governor but the workqueue must be scheduled periodically.


As I have been working on resolving the video mixer IOMMU fault issue
described here: https://patchwork.kernel.org/patch/10861757
I did some investigation of the devfreq operation, mostly on Odroid U3.

My conclusions are similar to what Lukasz says above. I would like to add
that broken scheduling of the performance counters read and the devfreq
updates seems to have one more serious implication. In each call, which
normally should happen periodically with fixed interval we stop the counters,
read counter values and start the counters again. But if period between
calls becomes long enough to let any of the counters overflow, we will
get wrong performance measurement results. My observations are that
the workqueue job can be suspended for several seconds and conditions for
the counter overflow occur sooner or later, depending among others
on the CPUs load.
Wrong bus load measurement can lead to setting too low interconnect bus
clock frequency and then bad things happen in peripheral devices.

I agree the workqueue issue needs to be fixed. I have some WIP code to use
the performance counters overflow interrupts instead of SW polling and with
that the interconnect bus clock control seems to work much better.



Thank you for sharing your use case and investigation results. I think
we are reaching a decent number of developers to maybe address this
issue: 'workqueue issue needs to be fixed'.
I have been facing this devfreq workqueue issue ~5 times in different
platforms.

Regarding the 'performance counters overflow interrupts' there is one
thing worth to keep in mind: variable utilization and frequency.
For example, in order to make a conclusion in algorithm deciding that
the device should increase or decrease the frequency, we fix the period
of observation, i.e. to 500ms. That can cause the long delay if the
utilization of the device suddenly drops. For example we set an
overflow threshold to value i.e. 1000 and we know that at 1000MHz
and full utilization (100%) the counter will reach that threshold
after 500ms (which we want, because we don't want too many interrupts
per sec). What if suddenly utilization drops to 2% (i.e. from 5GB/s
to 250MB/s (what if it drops to 25MB/s?!)), the counter will reach the
threshold after 50*500ms = 25s. It is impossible just for the counters
to predict next utilization and adjust the threshold.
To address that, we still need to have another mechanism (like watchdog)
which will be triggered just to check if the threshold needs adjustment.
This mechanism can be a local timer in the driver or a framework
timer running kind of 'for loop' on all this type of devices (like
the scheduled workqueue). In both cases in the system there will be
interrupts, timers (even at workqueues) and scheduling.
The approach to force developers to implement their local watchdog
timers (or workqueues) in drivers is IMHO wrong and that's why we have
frameworks.

Regards,
Lukasz



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Sylwester Nawrocki
Hi All,

On 24.06.2020 12:32, Lukasz Luba wrote:
> I had issues with devfreq governor which wasn't called by devfreq
> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
> for it [1]. If the CPU which scheduled the next work went idle, the
> devfreq workqueue will not be kicked and devfreq governor won't check
> DMC status and will not decide to decrease the frequency based on low
> busy_time.
> The same applies for going up with the frequency. They both are
> done by the governor but the workqueue must be scheduled periodically.

As I have been working on resolving the video mixer IOMMU fault issue
described here: https://patchwork.kernel.org/patch/10861757
I did some investigation of the devfreq operation, mostly on Odroid U3.

My conclusions are similar to what Lukasz says above. I would like to add
that broken scheduling of the performance counters read and the devfreq 
updates seems to have one more serious implication. In each call, which
normally should happen periodically with fixed interval we stop the counters, 
read counter values and start the counters again. But if period between 
calls becomes long enough to let any of the counters overflow, we will
get wrong performance measurement results. My observations are that 
the workqueue job can be suspended for several seconds and conditions for 
the counter overflow occur sooner or later, depending among others 
on the CPUs load.
Wrong bus load measurement can lead to setting too low interconnect bus 
clock frequency and then bad things happen in peripheral devices.

I agree the workqueue issue needs to be fixed. I have some WIP code to use
the performance counters overflow interrupts instead of SW polling and with 
that the interconnect bus clock control seems to work much better.

-- 
Regards,
Sylwester


Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Lukasz Luba




On 6/24/20 2:13 PM, Krzysztof Kozlowski wrote:

On Wed, Jun 24, 2020 at 02:03:03PM +0100, Lukasz Luba wrote:



On 6/24/20 1:06 PM, Krzysztof Kozlowski wrote:

My case was clearly showing wrong behavior. System was idle but not
sleeping - network working, SSH connection ongoing.  Therefore at least
one CPU was not idle and could adjust the devfreq/DMC... but this did not
happen. The system stayed for like a minute in 633 MHz OPP.

Not-waking up idle processors - ok... so why not using power efficient
workqueue? It is exactly for this purpose - wake up from time to time on
whatever CPU to do the necessary job.


IIRC I've done this experiment, still keeping in devfreq:
INIT_DEFERRABLE_WORK()
just applying patch [1]. It uses a system_wq which should
be the same as system_power_efficient_wq when
CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set (our case).
This wasn't solving the issue for the deferred work. That's
why the patch 2/2 following patch 1/2 [1] was needed.

The deferred work uses TIMER_DEFERRABLE in it's initialization
and this is the problem. When the deferred work was queued on a CPU,
next that CPU went idle, the work was not migrated to some other CPU.
The former cpu is also not woken up according to the documentation [2].


Yes, you need either workqueue.power_efficient kernel param or CONFIG
option to actually enable it.  But at least it could then work on any
CPU.

Another solution is to use directly WQ_UNBOUND.


That's why Kamil's approach should be continue IMHO. It gives more
control over important devices like: bus, dmc, gpu, which utilization
does not strictly correspond to cpu utilization (which might be low or
even 0 and cpu put into idle).

I think Kamil was pointing out also some other issues not only dmc
(buses probably), but I realized too late to help him.


This should not be a configurable option. Why someone would prefer to
use one over another and decide about this during build or run time?
Instead it should be just *right* all the time. Always.


I had the same opinion, as you can see in my explanation to those
patches, but I failed. That's why I agree with Kamil's approach
because had higher chance to get into mainline and fix at least some
of the use cases.



Argument that we want to save power so we will not wake up any CPU is
ridiculous if because of this system stays in high-power mode.

If system is idle and memory going to be idle, someone should be woken
up to save more power and slow down memory controller.

If system is idle but memory going to be busy, the currently busy CPU
(which performs some memory-intensive job) could do the job and ramp up
the devfreq performance.


I agree. I think this devfreq mechanism was designed in the times
where there was/were 1 or 2 CPUs in the system. After a while we got ~8
and not all of them are used. This scenario was probably not
experimented widely on mainline platforms.

That is a good material for improvements, for someone who has time and
power.

Regards,
Lukasz



Best regards,
Krzysztof



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Krzysztof Kozlowski
On Wed, Jun 24, 2020 at 02:03:03PM +0100, Lukasz Luba wrote:
> 
> 
> On 6/24/20 1:06 PM, Krzysztof Kozlowski wrote:
> > My case was clearly showing wrong behavior. System was idle but not
> > sleeping - network working, SSH connection ongoing.  Therefore at least
> > one CPU was not idle and could adjust the devfreq/DMC... but this did not
> > happen. The system stayed for like a minute in 633 MHz OPP.
> > 
> > Not-waking up idle processors - ok... so why not using power efficient
> > workqueue? It is exactly for this purpose - wake up from time to time on
> > whatever CPU to do the necessary job.
> 
> IIRC I've done this experiment, still keeping in devfreq:
> INIT_DEFERRABLE_WORK()
> just applying patch [1]. It uses a system_wq which should
> be the same as system_power_efficient_wq when
> CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set (our case).
> This wasn't solving the issue for the deferred work. That's
> why the patch 2/2 following patch 1/2 [1] was needed.
> 
> The deferred work uses TIMER_DEFERRABLE in it's initialization
> and this is the problem. When the deferred work was queued on a CPU,
> next that CPU went idle, the work was not migrated to some other CPU.
> The former cpu is also not woken up according to the documentation [2].

Yes, you need either workqueue.power_efficient kernel param or CONFIG
option to actually enable it.  But at least it could then work on any
CPU.

Another solution is to use directly WQ_UNBOUND.

> That's why Kamil's approach should be continue IMHO. It gives more
> control over important devices like: bus, dmc, gpu, which utilization
> does not strictly correspond to cpu utilization (which might be low or
> even 0 and cpu put into idle).
> 
> I think Kamil was pointing out also some other issues not only dmc
> (buses probably), but I realized too late to help him.

This should not be a configurable option. Why someone would prefer to
use one over another and decide about this during build or run time?
Instead it should be just *right* all the time. Always.

Argument that we want to save power so we will not wake up any CPU is
ridiculous if because of this system stays in high-power mode.

If system is idle and memory going to be idle, someone should be woken
up to save more power and slow down memory controller.

If system is idle but memory going to be busy, the currently busy CPU
(which performs some memory-intensive job) could do the job and ramp up
the devfreq performance.

Best regards,
Krzysztof



Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Lukasz Luba




On 6/24/20 1:06 PM, Krzysztof Kozlowski wrote:

On Wed, Jun 24, 2020 at 01:18:42PM +0200, Kamil Konieczny wrote:

Hi,

On 24.06.2020 12:32, Lukasz Luba wrote:

Hi Krzysztof and Willy

On 6/23/20 8:11 PM, Krzysztof Kozlowski wrote:

On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:

On Tue, 23 Jun 2020 at 18:47, Willy Wolff  wrote:


Hi everybody,

Is DVFS for memory bus really working on Odroid XU3/4 board?
Using a simple microbenchmark that is doing only memory accesses, memory DVFS
seems to not working properly:

The microbenchmark is doing pointer chasing by following index in an array.
Indices in the array are set to follow a random pattern (cutting prefetcher),
and forcing RAM access.

git clone 
https://protect2.fireeye.com/url?k=c364e88a-9eb6fe2f-c36563c5-0cc47a31bee8-631885f0a63a11a0=1=https%3A%2F%2Fgithub.com%2Fwwilly%2Fbenchmark.git
 \
    && cd benchmark \
    && source env.sh \
    && ./bench_build.sh \
    && bash source/scripts/test_dvfs_mem.sh

Python 3, cmake and sudo rights are required.

Results:
DVFS CPU with performance governor
mem_gov = simple_ondemand at 16500 Hz in idle, should be bumped when the
benchmark is running.
- on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
- on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).

While forcing DVFS memory bus to use performance governor,
mem_gov = performance at 82500 Hz in idle,
- on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
- on the big cluster it takes 1.18448 s to run (243.664 c per memory access).

The kernel used is the last 5.7.5 stable with default exynos_defconfig.


Thanks for the report. Few thoughts:
1. What trans_stat are saying? Except DMC driver you can also check
all other devfreq devices (e.g. wcore) - maybe the devfreq events
(nocp) are not properly assigned?
2. Try running the measurement for ~1 minutes or longer. The counters
might have some delay (which would require probably fixing but the
point is to narrow the problem).
3. What do you understand by "mem_gov"? Which device is it?


+Cc Lukasz who was working on this.


Thanks Krzysztof for adding me here.



I just run memtester and more-or-less ondemand works (at least ramps
up):

Before:
/sys/class/devfreq/10c2.memory-controller$ cat trans_stat
   From  :   To
     : 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
* 16500: 0 0 0 0 0 0
 0 0   1795950
    20600: 1 0 0 0 0 0  
   0 0  4770
    27500: 0 1 0 0 0 0  
   0 0 15540
    41300: 0 0 1 0 0 0  
   0 0 20780
    54300: 0 0 0 1 0 0  
   0 1 10760
    63300: 0 0 0 0 2 0  
   0 0 10310
    72800: 0 0 0 0 0 0  
   0 0 0
    82500: 0 0 0 0 0 2  
   0 0 25920
Total transition : 9


$ sudo memtester 1G

During memtester:
/sys/class/devfreq/10c2.memory-controller$ cat trans_stat
   From  :   To
     : 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
    16500: 0 0 0 0 0 0  
   0 1   1801490
    20600: 1 0 0 0 0 0  
   0 0  4770
    27500: 0 1 0 0 0 0  
   0 0 15540
    41300: 0 0 1 0 0 0  
   0 0 20780
    54300: 0 0 0 1 0 0  
   0 2 11090
    63300: 0 0 0 0 3 0  
   0 0 17210
    72800: 0 0 0 0 0 0  
   0 0 0
* 82500: 0 0 0 0 0 3
 0 0    169020
Total transition : 13

However after killing memtester it stays at 633 MHz for very long time
and does not slow down. This is indeed weird...


I had issues with devfreq governor which wasn't called by devfreq
workqueue. The old DELAYED vs DEFERRED work discussions and my patches
for it [1]. If the CPU which scheduled the next work went idle, the
devfreq workqueue will not be kicked and devfreq governor won't check
DMC status and will not decide to decrease the frequency based on low
busy_time.
The same applies for going up with the frequency. They both are
done by the 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Krzysztof Kozlowski
On Wed, Jun 24, 2020 at 01:18:42PM +0200, Kamil Konieczny wrote:
> Hi,
> 
> On 24.06.2020 12:32, Lukasz Luba wrote:
> > Hi Krzysztof and Willy
> > 
> > On 6/23/20 8:11 PM, Krzysztof Kozlowski wrote:
> >> On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
> >>> On Tue, 23 Jun 2020 at 18:47, Willy Wolff  
> >>> wrote:
> 
>  Hi everybody,
> 
>  Is DVFS for memory bus really working on Odroid XU3/4 board?
>  Using a simple microbenchmark that is doing only memory accesses, memory 
>  DVFS
>  seems to not working properly:
> 
>  The microbenchmark is doing pointer chasing by following index in an 
>  array.
>  Indices in the array are set to follow a random pattern (cutting 
>  prefetcher),
>  and forcing RAM access.
> 
>  git clone 
>  https://protect2.fireeye.com/url?k=c364e88a-9eb6fe2f-c36563c5-0cc47a31bee8-631885f0a63a11a0=1=https%3A%2F%2Fgithub.com%2Fwwilly%2Fbenchmark.git
>   \
>     && cd benchmark \
>     && source env.sh \
>     && ./bench_build.sh \
>     && bash source/scripts/test_dvfs_mem.sh
> 
>  Python 3, cmake and sudo rights are required.
> 
>  Results:
>  DVFS CPU with performance governor
>  mem_gov = simple_ondemand at 16500 Hz in idle, should be bumped when 
>  the
>  benchmark is running.
>  - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory 
>  access),
>  - on the big cluster it takes 4.76556 s to run (980.343 c per moemory 
>  access).
> 
>  While forcing DVFS memory bus to use performance governor,
>  mem_gov = performance at 82500 Hz in idle,
>  - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory 
>  access),
>  - on the big cluster it takes 1.18448 s to run (243.664 c per memory 
>  access).
> 
>  The kernel used is the last 5.7.5 stable with default exynos_defconfig.
> >>>
> >>> Thanks for the report. Few thoughts:
> >>> 1. What trans_stat are saying? Except DMC driver you can also check
> >>> all other devfreq devices (e.g. wcore) - maybe the devfreq events
> >>> (nocp) are not properly assigned?
> >>> 2. Try running the measurement for ~1 minutes or longer. The counters
> >>> might have some delay (which would require probably fixing but the
> >>> point is to narrow the problem).
> >>> 3. What do you understand by "mem_gov"? Which device is it?
> >>
> >> +Cc Lukasz who was working on this.
> > 
> > Thanks Krzysztof for adding me here.
> > 
> >>
> >> I just run memtester and more-or-less ondemand works (at least ramps
> >> up):
> >>
> >> Before:
> >> /sys/class/devfreq/10c2.memory-controller$ cat trans_stat
> >>   From  :   To
> >>     : 16500 20600 27500 41300 54300 63300 
> >> 72800 82500   time(ms)
> >> * 16500: 0 0 0 0 0 0   
> >>   0 0   1795950
> >>    20600: 1 0 0 0 0 0  
> >>    0 0  4770
> >>    27500: 0 1 0 0 0 0  
> >>    0 0 15540
> >>    41300: 0 0 1 0 0 0  
> >>    0 0 20780
> >>    54300: 0 0 0 1 0 0  
> >>    0 1 10760
> >>    63300: 0 0 0 0 2 0  
> >>    0 0 10310
> >>    72800: 0 0 0 0 0 0  
> >>    0 0 0
> >>    82500: 0 0 0 0 0 2  
> >>    0 0 25920
> >> Total transition : 9
> >>
> >>
> >> $ sudo memtester 1G
> >>
> >> During memtester:
> >> /sys/class/devfreq/10c2.memory-controller$ cat trans_stat
> >>   From  :   To
> >>     : 16500 20600 27500 41300 54300 63300 
> >> 72800 82500   time(ms)
> >>    16500: 0 0 0 0 0 0  
> >>    0 1   1801490
> >>    20600: 1 0 0 0 0 0  
> >>    0 0  4770
> >>    27500: 0 1 0 0 0 0  
> >>    0 0 15540
> >>    41300: 0 0 1 0 0 0  
> >>    0 0 20780
> >>    54300: 0 0 0 1 0 0  
> >>    0 2 11090
> >>    63300: 0 0 0 0 3 0  
> >>    0 0 17210
> >>    72800: 0 0 0 0 0 0  
> >>    0 0 0
> >> * 82500: 0 0 0 0 0 3   
> >>   0 0    169020
> >> Total 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Kamil Konieczny
Hi,

On 24.06.2020 12:32, Lukasz Luba wrote:
> Hi Krzysztof and Willy
> 
> On 6/23/20 8:11 PM, Krzysztof Kozlowski wrote:
>> On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
>>> On Tue, 23 Jun 2020 at 18:47, Willy Wolff  
>>> wrote:

 Hi everybody,

 Is DVFS for memory bus really working on Odroid XU3/4 board?
 Using a simple microbenchmark that is doing only memory accesses, memory 
 DVFS
 seems to not working properly:

 The microbenchmark is doing pointer chasing by following index in an array.
 Indices in the array are set to follow a random pattern (cutting 
 prefetcher),
 and forcing RAM access.

 git clone 
 https://protect2.fireeye.com/url?k=c364e88a-9eb6fe2f-c36563c5-0cc47a31bee8-631885f0a63a11a0=1=https%3A%2F%2Fgithub.com%2Fwwilly%2Fbenchmark.git
  \
    && cd benchmark \
    && source env.sh \
    && ./bench_build.sh \
    && bash source/scripts/test_dvfs_mem.sh

 Python 3, cmake and sudo rights are required.

 Results:
 DVFS CPU with performance governor
 mem_gov = simple_ondemand at 16500 Hz in idle, should be bumped when 
 the
 benchmark is running.
 - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory 
 access),
 - on the big cluster it takes 4.76556 s to run (980.343 c per moemory 
 access).

 While forcing DVFS memory bus to use performance governor,
 mem_gov = performance at 82500 Hz in idle,
 - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory 
 access),
 - on the big cluster it takes 1.18448 s to run (243.664 c per memory 
 access).

 The kernel used is the last 5.7.5 stable with default exynos_defconfig.
>>>
>>> Thanks for the report. Few thoughts:
>>> 1. What trans_stat are saying? Except DMC driver you can also check
>>> all other devfreq devices (e.g. wcore) - maybe the devfreq events
>>> (nocp) are not properly assigned?
>>> 2. Try running the measurement for ~1 minutes or longer. The counters
>>> might have some delay (which would require probably fixing but the
>>> point is to narrow the problem).
>>> 3. What do you understand by "mem_gov"? Which device is it?
>>
>> +Cc Lukasz who was working on this.
> 
> Thanks Krzysztof for adding me here.
> 
>>
>> I just run memtester and more-or-less ondemand works (at least ramps
>> up):
>>
>> Before:
>> /sys/class/devfreq/10c2.memory-controller$ cat trans_stat
>>   From  :   To
>>     : 16500 20600 27500 41300 54300 63300 
>> 72800 82500   time(ms)
>> * 16500: 0 0 0 0 0 0 
>>     0 0   1795950
>>    20600: 1 0 0 0 0 0
>>  0 0  4770
>>    27500: 0 1 0 0 0 0
>>  0 0 15540
>>    41300: 0 0 1 0 0 0
>>  0 0 20780
>>    54300: 0 0 0 1 0 0
>>  0 1 10760
>>    63300: 0 0 0 0 2 0
>>  0 0 10310
>>    72800: 0 0 0 0 0 0
>>  0 0 0
>>    82500: 0 0 0 0 0 2
>>  0 0 25920
>> Total transition : 9
>>
>>
>> $ sudo memtester 1G
>>
>> During memtester:
>> /sys/class/devfreq/10c2.memory-controller$ cat trans_stat
>>   From  :   To
>>     : 16500 20600 27500 41300 54300 63300 
>> 72800 82500   time(ms)
>>    16500: 0 0 0 0 0 0
>>  0 1   1801490
>>    20600: 1 0 0 0 0 0
>>  0 0  4770
>>    27500: 0 1 0 0 0 0
>>  0 0 15540
>>    41300: 0 0 1 0 0 0
>>  0 0 20780
>>    54300: 0 0 0 1 0 0
>>  0 2 11090
>>    63300: 0 0 0 0 3 0
>>  0 0 17210
>>    72800: 0 0 0 0 0 0
>>  0 0 0
>> * 82500: 0 0 0 0 0 3 
>>     0 0    169020
>> Total transition : 13
>>
>> However after killing memtester it stays at 633 MHz for very long time
>> and does not slow down. This is indeed weird...
> 
> I had issues with devfreq governor which wasn't called by devfreq
> workqueue. The old DELAYED vs DEFERRED work discussions and my patches
> for it [1]. If 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Lukasz Luba

Hi Krzysztof and Willy

On 6/23/20 8:11 PM, Krzysztof Kozlowski wrote:

On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:

On Tue, 23 Jun 2020 at 18:47, Willy Wolff  wrote:


Hi everybody,

Is DVFS for memory bus really working on Odroid XU3/4 board?
Using a simple microbenchmark that is doing only memory accesses, memory DVFS
seems to not working properly:

The microbenchmark is doing pointer chasing by following index in an array.
Indices in the array are set to follow a random pattern (cutting prefetcher),
and forcing RAM access.

git clone https://github.com/wwilly/benchmark.git \
   && cd benchmark \
   && source env.sh \
   && ./bench_build.sh \
   && bash source/scripts/test_dvfs_mem.sh

Python 3, cmake and sudo rights are required.

Results:
DVFS CPU with performance governor
mem_gov = simple_ondemand at 16500 Hz in idle, should be bumped when the
benchmark is running.
- on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
- on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).

While forcing DVFS memory bus to use performance governor,
mem_gov = performance at 82500 Hz in idle,
- on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
- on the big cluster it takes 1.18448 s to run (243.664 c per memory access).

The kernel used is the last 5.7.5 stable with default exynos_defconfig.


Thanks for the report. Few thoughts:
1. What trans_stat are saying? Except DMC driver you can also check
all other devfreq devices (e.g. wcore) - maybe the devfreq events
(nocp) are not properly assigned?
2. Try running the measurement for ~1 minutes or longer. The counters
might have some delay (which would require probably fixing but the
point is to narrow the problem).
3. What do you understand by "mem_gov"? Which device is it?


+Cc Lukasz who was working on this.


Thanks Krzysztof for adding me here.



I just run memtester and more-or-less ondemand works (at least ramps
up):

Before:
/sys/class/devfreq/10c2.memory-controller$ cat trans_stat
  From  :   To
: 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
* 16500: 0 0 0 0 0 0
 0 0   1795950
   20600: 1 0 0 0 0 0   
  0 0  4770
   27500: 0 1 0 0 0 0   
  0 0 15540
   41300: 0 0 1 0 0 0   
  0 0 20780
   54300: 0 0 0 1 0 0   
  0 1 10760
   63300: 0 0 0 0 2 0   
  0 0 10310
   72800: 0 0 0 0 0 0   
  0 0 0
   82500: 0 0 0 0 0 2   
  0 0 25920
Total transition : 9


$ sudo memtester 1G

During memtester:
/sys/class/devfreq/10c2.memory-controller$ cat trans_stat
  From  :   To
: 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
   16500: 0 0 0 0 0 0   
  0 1   1801490
   20600: 1 0 0 0 0 0   
  0 0  4770
   27500: 0 1 0 0 0 0   
  0 0 15540
   41300: 0 0 1 0 0 0   
  0 0 20780
   54300: 0 0 0 1 0 0   
  0 2 11090
   63300: 0 0 0 0 3 0   
  0 0 17210
   72800: 0 0 0 0 0 0   
  0 0 0
* 82500: 0 0 0 0 0 3
 0 0169020
Total transition : 13

However after killing memtester it stays at 633 MHz for very long time
and does not slow down. This is indeed weird...


I had issues with devfreq governor which wasn't called by devfreq
workqueue. The old DELAYED vs DEFERRED work discussions and my patches
for it [1]. If the CPU which scheduled the next work went idle, the
devfreq workqueue will not be kicked and devfreq governor won't check
DMC status and will not decide to decrease the frequency based on low
busy_time.
The same applies for going up with the frequency. They both are
done by the governor but the workqueue must be scheduled periodically.

I couldn't do much with this back then. I have given the example that
this is causing issues with the DMC [2]. There is also a description
of your situation staying at 633MHz for long time:
' When it is missing opportunity
to change the 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Willy Wolff
On 2020-06-24-10-14-38, Krzysztof Kozlowski wrote:
> On Wed, Jun 24, 2020 at 10:01:17AM +0200, Willy Wolff wrote:
> > Hi Krzysztof,
> > Thanks to look at it.
> > 
> > mem_gov is /sys/class/devfreq/10c2.memory-controller/governor
> > 
> > Here some numbers after increasing the running time:
> > 
> > Running using simple_ondemand:
> > Before:
> >  From  :   To   
> >   
> >: 16500 20600 27500 41300 54300 63300 
> > 72800 82500   time(ms)
> > * 16500: 0 0 0 0 0 0
> >  0 4   4528600
> >   20600: 5 0 0 0 0 0
> >  0 0 57780
> >   27500: 0 5 0 0 0 0
> >  0 0 50060
> >   41300: 0 0 5 0 0 0
> >  0 0 46240
> >   54300: 0 0 0 5 0 0
> >  0 0 48970
> >   63300: 0 0 0 0 5 0
> >  0 0 47330
> >   72800: 0 0 0 0 0 0
> >  0 0 0
> >   82500: 0 0 0 0 0 5
> >  0 0331300
> > Total transition : 34
> > 
> > 
> > After:
> >  From  :   To
> >: 16500 20600 27500 41300 54300 63300 
> > 72800 82500   time(ms)
> > * 16500: 0 0 0 0 0 0
> >  0 4   5098890
> >   20600: 5 0 0 0 0 0
> >  0 0 57780
> >   27500: 0 5 0 0 0 0
> >  0 0 50060
> >   41300: 0 0 5 0 0 0
> >  0 0 46240
> >   54300: 0 0 0 5 0 0
> >  0 0 48970
> >   63300: 0 0 0 0 5 0
> >  0 0 47330
> >   72800: 0 0 0 0 0 0
> >  0 0 0
> >   82500: 0 0 0 0 0 5
> >  0 0331300
> > Total transition : 34
> > 
> > With a running time of:
> > LITTLE => 283.699 s (680.877 c per mem access)
> > big => 284.47 s (975.327 c per mem access)
> 
> I see there were no transitions during your memory test.
> 
> > 
> > And when I set to the performance governor:
> > Before:
> >  From  :   To
> >: 16500 20600 27500 41300 54300 63300 
> > 72800 82500   time(ms)
> >   16500: 0 0 0 0 0 0
> >  0 5   5099040
> >   20600: 5 0 0 0 0 0
> >  0 0 57780
> >   27500: 0 5 0 0 0 0
> >  0 0 50060
> >   41300: 0 0 5 0 0 0
> >  0 0 46240
> >   54300: 0 0 0 5 0 0
> >  0 0 48970
> >   63300: 0 0 0 0 5 0
> >  0 0 47330
> >   72800: 0 0 0 0 0 0
> >  0 0 0
> > * 82500: 0 0 0 0 0 5
> >  0 0331350
> > Total transition : 35
> > 
> > After:
> >  From  :   To
> >: 16500 20600 27500 41300 54300 63300 
> > 72800 82500   time(ms)
> >   16500: 0 0 0 0 0 0
> >  0 5   5099040
> >   20600: 5 0 0 0 0 0
> >  0 0 57780
> >   27500: 0 5 0 0 0 0
> >  0 0 50060
> >   41300: 0 0 5 0 0 0
> >  0 0 46240
> >   54300: 0 0 0 5 0 0
> >  0 0 48970
> >   63300: 0 0 0 0 5 0
> >  0 0 47330
> >   72800: 0 0 0 0 0 0
> >  0 0 0
> > * 82500: 0 0 0 0 0 5
> >  0 0472980
> > Total transition : 35
> > 
> > With a running time of:
> 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Krzysztof Kozlowski
On Wed, Jun 24, 2020 at 10:01:17AM +0200, Willy Wolff wrote:
> Hi Krzysztof,
> Thanks to look at it.
> 
> mem_gov is /sys/class/devfreq/10c2.memory-controller/governor
> 
> Here some numbers after increasing the running time:
> 
> Running using simple_ondemand:
> Before:
>  From  :   To 
> 
>: 16500 20600 27500 41300 54300 63300 
> 72800 82500   time(ms)
> * 16500: 0 0 0 0 0 0  
>0 4   4528600
>   20600: 5 0 0 0 0 0  
>0 0 57780
>   27500: 0 5 0 0 0 0  
>0 0 50060
>   41300: 0 0 5 0 0 0  
>0 0 46240
>   54300: 0 0 0 5 0 0  
>0 0 48970
>   63300: 0 0 0 0 5 0  
>0 0 47330
>   72800: 0 0 0 0 0 0  
>0 0 0
>   82500: 0 0 0 0 0 5  
>0 0331300
> Total transition : 34
> 
> 
> After:
>  From  :   To
>: 16500 20600 27500 41300 54300 63300 
> 72800 82500   time(ms)
> * 16500: 0 0 0 0 0 0  
>0 4   5098890
>   20600: 5 0 0 0 0 0  
>0 0 57780
>   27500: 0 5 0 0 0 0  
>0 0 50060
>   41300: 0 0 5 0 0 0  
>0 0 46240
>   54300: 0 0 0 5 0 0  
>0 0 48970
>   63300: 0 0 0 0 5 0  
>0 0 47330
>   72800: 0 0 0 0 0 0  
>0 0 0
>   82500: 0 0 0 0 0 5  
>0 0331300
> Total transition : 34
> 
> With a running time of:
> LITTLE => 283.699 s (680.877 c per mem access)
> big => 284.47 s (975.327 c per mem access)

I see there were no transitions during your memory test.

> 
> And when I set to the performance governor:
> Before:
>  From  :   To
>: 16500 20600 27500 41300 54300 63300 
> 72800 82500   time(ms)
>   16500: 0 0 0 0 0 0  
>0 5   5099040
>   20600: 5 0 0 0 0 0  
>0 0 57780
>   27500: 0 5 0 0 0 0  
>0 0 50060
>   41300: 0 0 5 0 0 0  
>0 0 46240
>   54300: 0 0 0 5 0 0  
>0 0 48970
>   63300: 0 0 0 0 5 0  
>0 0 47330
>   72800: 0 0 0 0 0 0  
>0 0 0
> * 82500: 0 0 0 0 0 5  
>0 0331350
> Total transition : 35
> 
> After:
>  From  :   To
>: 16500 20600 27500 41300 54300 63300 
> 72800 82500   time(ms)
>   16500: 0 0 0 0 0 0  
>0 5   5099040
>   20600: 5 0 0 0 0 0  
>0 0 57780
>   27500: 0 5 0 0 0 0  
>0 0 50060
>   41300: 0 0 5 0 0 0  
>0 0 46240
>   54300: 0 0 0 5 0 0  
>0 0 48970
>   63300: 0 0 0 0 5 0  
>0 0 47330
>   72800: 0 0 0 0 0 0  
>0 0 0
> * 82500: 0 0 0 0 0 5  
>0 0472980
> Total transition : 35
> 
> With a running time of:
> LITTLE: 68.8428 s (165.223 c per mem access)
> big: 71.3268 s (244.549 c per mem access)
> 
> 
> I see some transition, but not occuring during the benchmark.
> I haven't dive into the code, but maybe it is the heuristic behind that is not
> well defined? If you know 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-24 Thread Willy Wolff
Hi Krzysztof,
Thanks to look at it.

mem_gov is /sys/class/devfreq/10c2.memory-controller/governor

Here some numbers after increasing the running time:

Running using simple_ondemand:
Before:
 From  :   To   
  
   : 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
* 16500: 0 0 0 0 0 0
 0 4   4528600
  20600: 5 0 0 0 0 0
 0 0 57780
  27500: 0 5 0 0 0 0
 0 0 50060
  41300: 0 0 5 0 0 0
 0 0 46240
  54300: 0 0 0 5 0 0
 0 0 48970
  63300: 0 0 0 0 5 0
 0 0 47330
  72800: 0 0 0 0 0 0
 0 0 0
  82500: 0 0 0 0 0 5
 0 0331300
Total transition : 34


After:
 From  :   To
   : 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
* 16500: 0 0 0 0 0 0
 0 4   5098890
  20600: 5 0 0 0 0 0
 0 0 57780
  27500: 0 5 0 0 0 0
 0 0 50060
  41300: 0 0 5 0 0 0
 0 0 46240
  54300: 0 0 0 5 0 0
 0 0 48970
  63300: 0 0 0 0 5 0
 0 0 47330
  72800: 0 0 0 0 0 0
 0 0 0
  82500: 0 0 0 0 0 5
 0 0331300
Total transition : 34

With a running time of:
LITTLE => 283.699 s (680.877 c per mem access)
big => 284.47 s (975.327 c per mem access)


And when I set to the performance governor:
Before:
 From  :   To
   : 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
  16500: 0 0 0 0 0 0
 0 5   5099040
  20600: 5 0 0 0 0 0
 0 0 57780
  27500: 0 5 0 0 0 0
 0 0 50060
  41300: 0 0 5 0 0 0
 0 0 46240
  54300: 0 0 0 5 0 0
 0 0 48970
  63300: 0 0 0 0 5 0
 0 0 47330
  72800: 0 0 0 0 0 0
 0 0 0
* 82500: 0 0 0 0 0 5
 0 0331350
Total transition : 35

After:
 From  :   To
   : 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
  16500: 0 0 0 0 0 0
 0 5   5099040
  20600: 5 0 0 0 0 0
 0 0 57780
  27500: 0 5 0 0 0 0
 0 0 50060
  41300: 0 0 5 0 0 0
 0 0 46240
  54300: 0 0 0 5 0 0
 0 0 48970
  63300: 0 0 0 0 5 0
 0 0 47330
  72800: 0 0 0 0 0 0
 0 0 0
* 82500: 0 0 0 0 0 5
 0 0472980
Total transition : 35

With a running time of:
LITTLE: 68.8428 s (165.223 c per mem access)
big: 71.3268 s (244.549 c per mem access)


I see some transition, but not occuring during the benchmark.
I haven't dive into the code, but maybe it is the heuristic behind that is not
well defined? If you know how it's working that would be helpfull before I dive
in it.

I run your test as well, and indeed, it seems to work for large bunch of memory,
and there is some delay before making a transition (seems to be around 10s).
When you kill memtester, it reduces the freq stepwisely every ~10s.

Note that the timing shown above account for the 

Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-23 Thread Krzysztof Kozlowski
On Tue, Jun 23, 2020 at 09:02:38PM +0200, Krzysztof Kozlowski wrote:
> On Tue, 23 Jun 2020 at 18:47, Willy Wolff  wrote:
> >
> > Hi everybody,
> >
> > Is DVFS for memory bus really working on Odroid XU3/4 board?
> > Using a simple microbenchmark that is doing only memory accesses, memory 
> > DVFS
> > seems to not working properly:
> >
> > The microbenchmark is doing pointer chasing by following index in an array.
> > Indices in the array are set to follow a random pattern (cutting 
> > prefetcher),
> > and forcing RAM access.
> >
> > git clone https://github.com/wwilly/benchmark.git \
> >   && cd benchmark \
> >   && source env.sh \
> >   && ./bench_build.sh \
> >   && bash source/scripts/test_dvfs_mem.sh
> >
> > Python 3, cmake and sudo rights are required.
> >
> > Results:
> > DVFS CPU with performance governor
> > mem_gov = simple_ondemand at 16500 Hz in idle, should be bumped when the
> > benchmark is running.
> > - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory 
> > access),
> > - on the big cluster it takes 4.76556 s to run (980.343 c per moemory 
> > access).
> >
> > While forcing DVFS memory bus to use performance governor,
> > mem_gov = performance at 82500 Hz in idle,
> > - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory 
> > access),
> > - on the big cluster it takes 1.18448 s to run (243.664 c per memory 
> > access).
> >
> > The kernel used is the last 5.7.5 stable with default exynos_defconfig.
> 
> Thanks for the report. Few thoughts:
> 1. What trans_stat are saying? Except DMC driver you can also check
> all other devfreq devices (e.g. wcore) - maybe the devfreq events
> (nocp) are not properly assigned?
> 2. Try running the measurement for ~1 minutes or longer. The counters
> might have some delay (which would require probably fixing but the
> point is to narrow the problem).
> 3. What do you understand by "mem_gov"? Which device is it?

+Cc Lukasz who was working on this.

I just run memtester and more-or-less ondemand works (at least ramps
up):

Before:
/sys/class/devfreq/10c2.memory-controller$ cat trans_stat
 From  :   To
   : 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
* 16500: 0 0 0 0 0 0
 0 0   1795950
  20600: 1 0 0 0 0 0
 0 0  4770
  27500: 0 1 0 0 0 0
 0 0 15540
  41300: 0 0 1 0 0 0
 0 0 20780
  54300: 0 0 0 1 0 0
 0 1 10760
  63300: 0 0 0 0 2 0
 0 0 10310
  72800: 0 0 0 0 0 0
 0 0 0
  82500: 0 0 0 0 0 2
 0 0 25920
Total transition : 9


$ sudo memtester 1G

During memtester:
/sys/class/devfreq/10c2.memory-controller$ cat trans_stat
 From  :   To
   : 16500 20600 27500 41300 54300 63300 
72800 82500   time(ms)
  16500: 0 0 0 0 0 0
 0 1   1801490
  20600: 1 0 0 0 0 0
 0 0  4770
  27500: 0 1 0 0 0 0
 0 0 15540
  41300: 0 0 1 0 0 0
 0 0 20780
  54300: 0 0 0 1 0 0
 0 2 11090
  63300: 0 0 0 0 3 0
 0 0 17210
  72800: 0 0 0 0 0 0
 0 0 0
* 82500: 0 0 0 0 0 3
 0 0169020
Total transition : 13

However after killing memtester it stays at 633 MHz for very long time
and does not slow down. This is indeed weird...

Best regards,
Krzysztof


Re: brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-23 Thread Krzysztof Kozlowski
On Tue, 23 Jun 2020 at 18:47, Willy Wolff  wrote:
>
> Hi everybody,
>
> Is DVFS for memory bus really working on Odroid XU3/4 board?
> Using a simple microbenchmark that is doing only memory accesses, memory DVFS
> seems to not working properly:
>
> The microbenchmark is doing pointer chasing by following index in an array.
> Indices in the array are set to follow a random pattern (cutting prefetcher),
> and forcing RAM access.
>
> git clone https://github.com/wwilly/benchmark.git \
>   && cd benchmark \
>   && source env.sh \
>   && ./bench_build.sh \
>   && bash source/scripts/test_dvfs_mem.sh
>
> Python 3, cmake and sudo rights are required.
>
> Results:
> DVFS CPU with performance governor
> mem_gov = simple_ondemand at 16500 Hz in idle, should be bumped when the
> benchmark is running.
> - on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory 
> access),
> - on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).
>
> While forcing DVFS memory bus to use performance governor,
> mem_gov = performance at 82500 Hz in idle,
> - on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory 
> access),
> - on the big cluster it takes 1.18448 s to run (243.664 c per memory access).
>
> The kernel used is the last 5.7.5 stable with default exynos_defconfig.

Thanks for the report. Few thoughts:
1. What trans_stat are saying? Except DMC driver you can also check
all other devfreq devices (e.g. wcore) - maybe the devfreq events
(nocp) are not properly assigned?
2. Try running the measurement for ~1 minutes or longer. The counters
might have some delay (which would require probably fixing but the
point is to narrow the problem).
3. What do you understand by "mem_gov"? Which device is it?

Best regards,
Krzysztof


brocken devfreq simple_ondemand for Odroid XU3/4?

2020-06-23 Thread Willy Wolff
Hi everybody,

Is DVFS for memory bus really working on Odroid XU3/4 board?
Using a simple microbenchmark that is doing only memory accesses, memory DVFS
seems to not working properly:

The microbenchmark is doing pointer chasing by following index in an array.
Indices in the array are set to follow a random pattern (cutting prefetcher),
and forcing RAM access.

git clone https://github.com/wwilly/benchmark.git \
  && cd benchmark \
  && source env.sh \
  && ./bench_build.sh \
  && bash source/scripts/test_dvfs_mem.sh

Python 3, cmake and sudo rights are required.

Results:
DVFS CPU with performance governor
mem_gov = simple_ondemand at 16500 Hz in idle, should be bumped when the
benchmark is running.
- on the LITTLE cluster it takes 4.74308 s to run (683.004 c per memory access),
- on the big cluster it takes 4.76556 s to run (980.343 c per moemory access).

While forcing DVFS memory bus to use performance governor,
mem_gov = performance at 82500 Hz in idle,
- on the LITTLE cluster it takes 1.1451 s to run (164.894 c per memory access),
- on the big cluster it takes 1.18448 s to run (243.664 c per memory access).

The kernel used is the last 5.7.5 stable with default exynos_defconfig.

Cheers,
Willy