[PATCH 1/1] Makefile: Fix cross building

2021-04-20 Thread Patrick Waterlander
The native pkg-config is always called which may result in incorrect
flags and paths being passed to the cross compiler. Implement
$(CROSS_COMPILE)pkg-config to avoid this issue.

Signed-off-by: Patrick Waterlander 
---
 Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index bc19584fee59..9aa2073f68d6 100644
--- a/Makefile
+++ b/Makefile
@@ -462,6 +462,7 @@ LZMA= lzma
 LZ4= lz4c
 XZ = xz
 ZSTD   = zstd
+PKG_CONFIG ?= $(CROSS_COMPILE)pkg-config
 
 CHECKFLAGS := -D__linux__ -Dlinux -D__STDC__ -Dunix -D__unix__ \
  -Wbitwise -Wno-return-void -Wno-unknown-attribute $(CF)
@@ -1089,7 +1090,7 @@ mod_sign_cmd = true
 endif
 export mod_sign_cmd
 
-HOST_LIBELF_LIBS = $(shell pkg-config libelf --libs 2>/dev/null || echo -lelf)
+HOST_LIBELF_LIBS = $(shell $(PKG_CONFIG) libelf --libs 2>/dev/null || echo 
-lelf)
 
 has_libelf = $(call try-run,\
echo "int main() {}" | $(HOSTCC) -xc -o /dev/null 
$(HOST_LIBELF_LIBS) -,1,0)
-- 
2.17.1



Re: [PATCH 5.10 000/103] 5.10.32-rc1 review

2021-04-19 Thread Patrick Mccormick
We ran tests on this kernel:

commit 32f5704a0a4f7dcc8aa74a49dbcce359d758f6d5 (HEAD -> rc/linux-5.10.y)
Author: Greg Kroah-Hartman 
Date:   Thu Apr 15 16:44:09 2021 +0200

Linux 5.10.31-rc1

No problems found.

Hardware tested on:

model name : Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

Specific tests ran:

ok 1 ltp.py:LTP.test_nptl
ok 2 ltp.py:LTP.test_math
ok 3 ltp.py:LTP.test_dio
ok 4 ltp.py:LTP.test_io
ok 5 ltp.py:LTP.test_power_management_tests
ok 6 ltp.py:LTP.test_can
ok 7 ltp.py:LTP.test_input
ok 8 ltp.py:LTP.test_hugetlb
ok 9 ltp.py:LTP.test_ipc
ok 10 ltp.py:LTP.test_uevent
ok 11 ltp.py:LTP.test_smoketest
ok 12 ltp.py:LTP.test_containers
ok 13 ltp.py:LTP.test_filecaps
ok 14 ltp.py:LTP.test_sched
ok 15 ltp.py:LTP.test_hyperthreading
ok 16 ltp.py:LTP.test_cap_bounds
ok 17 kpatch.sh
ok 18 perf.py:PerfNonPriv.test_perf_help
ok 19 perf.py:PerfNonPriv.test_perf_version
ok 20 perf.py:PerfNonPriv.test_perf_list
ok 21 perf.py:PerfPriv.test_perf_record
ok 22 perf.py:PerfPriv.test_perf_cmd_kallsyms
ok 23 perf.py:PerfPriv.test_perf_cmd_annotate
ok 24 perf.py:PerfPriv.test_perf_cmd_evlist
ok 25 perf.py:PerfPriv.test_perf_cmd_script
ok 26 perf.py:PerfPriv.test_perf_stat
ok 27 perf.py:PerfPriv.test_perf_bench
ok 28 kselftest.py:kselftest.test_sysctl
ok 29 kselftest.py:kselftest.test_size
ok 30 kselftest.py:kselftest.test_sync
ok 31 kselftest.py:kselftest.test_capabilities
ok 32 kselftest.py:kselftest.test_x86
ok 33 kselftest.py:kselftest.test_pidfd
ok 34 kselftest.py:kselftest.test_membarrier
ok 35 kselftest.py:kselftest.test_sigaltstack
ok 36 kselftest.py:kselftest.test_tmpfs
ok 37 kselftest.py:kselftest.test_user
ok 38 kselftest.py:kselftest.test_sched
ok 39 kselftest.py:kselftest.test_timens
ok 40 kselftest.py:kselftest.test_timers

Tested-By: Patrick McCormick 

On Mon, Apr 19, 2021 at 10:27 AM Fox Chen  wrote:
>
> On Mon, 19 Apr 2021 15:05:11 +0200, Greg Kroah-Hartman 
>  wrote:
> > This is the start of the stable review cycle for the 5.10.32 release.
> > There are 103 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Wed, 21 Apr 2021 13:05:09 +.
> > Anything received after that time might be too late.
> >
> > The whole patch series can be found in one patch at:
> >   
> > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.32-rc1.gz
> > or in the git tree and branch at:
> >   
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-5.10.y
> > and the diffstat can be found below.
> >
> > thanks,
> >
> > greg k-h
> >
>
> 5.10.32-rc1 Successfully Compiled and booted on my Raspberry PI 4b (8g) 
> (bcm2711)
>
> Tested-by: Fox Chen 
>


Re: [kbuild-all] Re: include/linux/compiler_types.h:315:38: error: call to '__compiletime_assert_536' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct can_frame, len) != offsetof(st

2021-03-23 Thread Patrick Menschel


Am 23.03.21 um 07:06 schrieb Rong Chen:
>>>> Ugh! Why did the compiler extend the space for the union to 4 bytes?!?
>> Just a random idea but maybe the added padding is due to some
>> kind of odd intrication with the __attribute__((__aligned__(8)))
>> just below? Does this reproduce if we remove the
>> __attribute__((__aligned__(8)))?
> 
> Here is the layout without __attribute__((__aligned__(8))),
> the union is still extended to 4 bytes:
> 
> struct can_frame {
>     canid_t    can_id;   /* 0 4 */
>     union {
>     __u8   len;  /* 4 1 */
>     __u8   can_dlc;  /* 4 1 */
>     };   /* 4 4 */
>     __u8   __pad;    /* 8 1 */
>     __u8   __res0;   /* 9 1 */
>     __u8   len8_dlc; /* 10 1 */
>     __u8   data[8];  /* 11 8 */
> 
>     /* size: 20, cachelines: 1, members: 6 */
>     /* padding: 1 */
>     /* last cacheline: 20 bytes */
> };
> 
> Best Regards,
> Rong Chen

Hi,

I would suggest a try with __attribute__((__aligned__(8))) only on
can_frame, not on data[8].

If the structure length is a multiple of 8, the compiler should
recognize this and keep the union a single byte in favor of an array
configuration of that struct.

The __attribute__((__aligned__(8))) on data[8] has strange propagation
effects upwards.

If the attributes are really necessary, I would suggest to have both

__attribute__((packed))
__attribute__((__aligned__(8)))

on structure level instead of inside, so no padding happens inside the
structure while the structure itself is aligned.
Using aligned and packaged inside a structure may be contradictive to
the compiler.

This reminds me of the alignment/gap issue with my python3
implementation of bcm message while alternating between X86_64 and
ARMHF. Using c_types was a mess but bytestrings worked in the end.
Be aware native alignment apparently is 4 on armhf linux and 8 on X86_64
linux.

https://marc.info/?l=linux-can=161251622904527=2
https://gitlab.com/Menschel/socketcan/-/commit/afc6744129448ae4333629fc0297808dd42e3530#e522710a8423075cfd1147ae6b7f44facac3ffb0_133_132

Best Regards,
Patrick Menschel


Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation

2020-12-10 Thread Patrick Havelange

On 2020-12-10 10:05, Madalin Bucur wrote:

-Original Message-
From: Patrick Havelange 


[snipped]



But then that change would not be compatible with the existing device
trees in already existing hardware. I'm not sure how to handle that case
properly.


One needs to be backwards compatible with the old device trees, so we do not
really have a simple answer, I know.


If we want to hack it,
instead of splitting ioremaps, we can reserve 4 kB in the FMan driver,
and keep the ioremap as it is now, with the benefit of less code churn.


but then the ioremap and the memory reservation do not match. Why bother
at all then with the mem reservation, just ioremap only and be done with
it. What I'm saying is, I don't see the point of having a "fake"
reservation call if it does not correspond that what is being used.


The reservation is not fake, it just covering the first portion of the ioremap.
Another hypothetical FMan driver would presumably reserve and ioremap starting
from the same point, thus the desired error would be met.

Regarding removing reservation altogether, yes, we can do that, in the child
device drivers. That will fix that use after free issue you've found and align
with the custom, hierarchical structure of the FMan devices/drivers. But would
leave them without the double use guard we have when using the reservation.


In the end, what the reservation is trying to achieve is to make sure

there

is a single driver controlling a certain peripeheral, and this basic
requirement would be addressed by that change plus devm_of_iomap() for

child

devices (ports, MACs).


Again, correct me if I'm wrong, but with the fake mem reservation, it
would *not* make sure that a single driver is controlling a certain
peripheral.


Actually, it would. If the current FMan driver reserves the first part of the 
FMan
memory, then another FMan driver (I do not expect a random driver trying to map 
the
FMan register area)


Ha!, now I understand your point. I still think it is not a clean 
solution, as the reservation do not match the ioremap usage.



would likely try to use that same part (with the same or
a different size, it does not matter, there will be an error anyway) and the
reservation attempt will fail. If we fix the child device drivers, then they
will have normal mappings and reservations.


My point is, either have a *correct* mem reservation, or don't have one
at all. There is no point in trying to cheat the system.


Now we do not have correct reservations for the child devices because the
parent takes it all for himself. Reduce the parent reservation and make room
for correct reservations for the child. The two-sections change you've made may
try to be correct but it's overkill for the purpose of detecting double use.


But it is not overkill if we want to detect potential subdrivers mapping 
sections that would not start at the main fman region (but still part of 
the main fman region).



And I also find the patch to obfuscate the already not so readable code so I'd
opt for a simpler fix.


As said already, I'm not in favor of having a reservation that do not 
match the real usage.


And in my opinion, having a mismatch with the mem reservation and the 
mem usage is also the kind of obfuscation that we want to avoid.


Yes now the code is slightly more complex, but it is also slightly more 
correct.


I'm not seeing currently another way on how to make it simpler *and* 
correct at the same time.


Patrick H.



Madalin



Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation

2020-12-10 Thread Patrick Havelange

On 2020-12-09 19:55, Madalin Bucur wrote:

-Original Message-
From: Patrick Havelange 
Sent: 09 December 2020 16:17
To: Madalin Bucur ; David S. Miller
; Jakub Kicinski ;
net...@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: Re: [PATCH net 1/4] net: freescale/fman: Split the main resource
region reservation


area. I'm assuming this is the problem you are trying to address here,
besides the stack corruption issue.


Yes exactly.
I did not add this behaviour (having a main region and subdrivers using
subregions), I'm just trying to correct what is already there.
For example: this is some content of /proc/iomem for one board I'm
working with, with the current existing code:
ffe40-ffe4fdfff : fman
 ffe4e-ffe4e0fff : mac
 ffe4e2000-ffe4e2fff : mac
 ffe4e4000-ffe4e4fff : mac
 ffe4e6000-ffe4e6fff : mac
 ffe4e8000-ffe4e8fff : mac

and now with my patches:
ffe40-ffe4fdfff : /soc@ffe00/fman@40
 ffe40-ffe480fff : fman
 ffe488000-ffe488fff : fman-port
 ffe489000-ffe489fff : fman-port
 ffe48a000-ffe48afff : fman-port
 ffe48b000-ffe48bfff : fman-port
 ffe48c000-ffe48cfff : fman-port
 ffe4a8000-ffe4a8fff : fman-port
 ffe4a9000-ffe4a9fff : fman-port
 ffe4aa000-ffe4aafff : fman-port
 ffe4ab000-ffe4abfff : fman-port
 ffe4ac000-ffe4acfff : fman-port
 ffe4c-ffe4d : fman
 ffe4e-ffe4e0fff : mac
 ffe4e2000-ffe4e2fff : mac
 ffe4e4000-ffe4e4fff : mac
 ffe4e6000-ffe4e6fff : mac
 ffe4e8000-ffe4e8fff : mac


While for the latter I think we can
put together a quick fix, for the former I'd like to take a bit of

time

to select the best fix, if one is really needed. So, please, let's

split

the two problems and first address the incorrect stack memory use.


I have no idea how you can fix it without a (more correct this time)
dummy region passed as parameter (and you don't want to use the first
patch). But then it will be useless to do the call anyway, as it won't
do any proper verification at all, so it could also be removed entirely,
which begs the question, why do it at all in the first place (the
devm_request_mem_region).

I'm not an expert in that part of the code so feel free to correct me

if

I missed something.

BR,

Patrick H.


Hi, Patrick,

the DPAA entities are described in the device tree. Adding some

hardcoding in

the driver is not really the solution for this problem. And I'm not sure

we have

I'm not seeing any problem here, the offsets used by the fman driver
were already there, I just reorganized them in 2 blocks.


a clear problem statement to start with. Can you help me on that part?


- The current call to __devm_request_region in fman_port.c is not correct.

One way to fix this is to use devm_request_mem_region, however this
requires that the main fman would not be reserving the whole region.
This leads to the second problem:
- Make sure the main fman driver is not reserving the whole region.

Is that clearer like this ?

Patrick H.


Hi,



The overlapping IO areas result from the device tree description, that in turn
mimics the HW description in the manual. If we really want to remove the 
nesting,
we should change the device trees, not the drivers.


But then that change would not be compatible with the existing device 
trees in already existing hardware. I'm not sure how to handle that case 
properly.



If we want to hack it,
instead of splitting ioremaps, we can reserve 4 kB in the FMan driver,
and keep the ioremap as it is now, with the benefit of less code churn.


but then the ioremap and the memory reservation do not match. Why bother 
at all then with the mem reservation, just ioremap only and be done with 
it. What I'm saying is, I don't see the point of having a "fake" 
reservation call if it does not correspond that what is being used.



In the end, what the reservation is trying to achieve is to make sure there
is a single driver controlling a certain peripeheral, and this basic
requirement would be addressed by that change plus devm_of_iomap() for child
devices (ports, MACs).


Again, correct me if I'm wrong, but with the fake mem reservation, it 
would *not* make sure that a single driver is controlling a certain 
peripheral.


My point is, either have a *correct* mem reservation, or don't have one 
at all. There is no point in trying to cheat the system.


Patrick H.



Madalin



Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation

2020-12-09 Thread Patrick Havelange

area. I'm assuming this is the problem you are trying to address here,
besides the stack corruption issue.


Yes exactly.
I did not add this behaviour (having a main region and subdrivers using
subregions), I'm just trying to correct what is already there.
For example: this is some content of /proc/iomem for one board I'm
working with, with the current existing code:
ffe40-ffe4fdfff : fman
ffe4e-ffe4e0fff : mac
ffe4e2000-ffe4e2fff : mac
ffe4e4000-ffe4e4fff : mac
ffe4e6000-ffe4e6fff : mac
ffe4e8000-ffe4e8fff : mac

and now with my patches:
ffe40-ffe4fdfff : /soc@ffe00/fman@40
ffe40-ffe480fff : fman
ffe488000-ffe488fff : fman-port
ffe489000-ffe489fff : fman-port
ffe48a000-ffe48afff : fman-port
ffe48b000-ffe48bfff : fman-port
ffe48c000-ffe48cfff : fman-port
ffe4a8000-ffe4a8fff : fman-port
ffe4a9000-ffe4a9fff : fman-port
ffe4aa000-ffe4aafff : fman-port
ffe4ab000-ffe4abfff : fman-port
ffe4ac000-ffe4acfff : fman-port
ffe4c-ffe4d : fman
ffe4e-ffe4e0fff : mac
ffe4e2000-ffe4e2fff : mac
ffe4e4000-ffe4e4fff : mac
ffe4e6000-ffe4e6fff : mac
ffe4e8000-ffe4e8fff : mac


While for the latter I think we can
put together a quick fix, for the former I'd like to take a bit of time
to select the best fix, if one is really needed. So, please, let's split
the two problems and first address the incorrect stack memory use.


I have no idea how you can fix it without a (more correct this time)
dummy region passed as parameter (and you don't want to use the first
patch). But then it will be useless to do the call anyway, as it won't
do any proper verification at all, so it could also be removed entirely,
which begs the question, why do it at all in the first place (the
devm_request_mem_region).

I'm not an expert in that part of the code so feel free to correct me if
I missed something.

BR,

Patrick H.


Hi, Patrick,

the DPAA entities are described in the device tree. Adding some hardcoding in
the driver is not really the solution for this problem. And I'm not sure we have


I'm not seeing any problem here, the offsets used by the fman driver 
were already there, I just reorganized them in 2 blocks.



a clear problem statement to start with. Can you help me on that part?


- The current call to __devm_request_region in fman_port.c is not correct.

One way to fix this is to use devm_request_mem_region, however this 
requires that the main fman would not be reserving the whole region. 
This leads to the second problem:

- Make sure the main fman driver is not reserving the whole region.

Is that clearer like this ?

Patrick H.



Madalin



Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation

2020-12-08 Thread Patrick Havelange

On 2020-12-03 16:47, Madalin Bucur wrote:

-Original Message-
From: Patrick Havelange 
Sent: 03 December 2020 15:51
To: Madalin Bucur ; David S. Miller
; Jakub Kicinski ;
net...@vger.kernel.org; linux-kernel@vger.kernel.org
Cc: Patrick Havelange 
Subject: [PATCH net 1/4] net: freescale/fman: Split the main resource
region reservation

The main fman driver is only using some parts of the fman memory
region.
Split the reservation of the main region in 2, so that the other
regions that will be used by fman-port and fman-mac drivers can
be reserved properly and not be in conflict with the main fman
reservation.

Signed-off-by: Patrick Havelange 


I think the problem you are trying to work on here is that the device
tree entry that describes the FMan IP allots to the parent FMan device the
whole memory-mapped registers area, as described in the device datasheet.
The smaller FMan building blocks (ports, MDIO controllers, etc.) are
each using a nested subset of this memory-mapped registers area.
While this hierarchical depiction of the hardware has not posed a problem
to date, it is possible to cause issues if both the FMan driver and any
of the sub-blocks drivers are trying to exclusively reserve the registers
area. I'm assuming this is the problem you are trying to address here,
besides the stack corruption issue.


Yes exactly.
I did not add this behaviour (having a main region and subdrivers using 
subregions), I'm just trying to correct what is already there.
For example: this is some content of /proc/iomem for one board I'm 
working with, with the current existing code:

ffe40-ffe4fdfff : fman
  ffe4e-ffe4e0fff : mac
  ffe4e2000-ffe4e2fff : mac
  ffe4e4000-ffe4e4fff : mac
  ffe4e6000-ffe4e6fff : mac
  ffe4e8000-ffe4e8fff : mac

and now with my patches:
ffe40-ffe4fdfff : /soc@ffe00/fman@40
  ffe40-ffe480fff : fman
  ffe488000-ffe488fff : fman-port
  ffe489000-ffe489fff : fman-port
  ffe48a000-ffe48afff : fman-port
  ffe48b000-ffe48bfff : fman-port
  ffe48c000-ffe48cfff : fman-port
  ffe4a8000-ffe4a8fff : fman-port
  ffe4a9000-ffe4a9fff : fman-port
  ffe4aa000-ffe4aafff : fman-port
  ffe4ab000-ffe4abfff : fman-port
  ffe4ac000-ffe4acfff : fman-port
  ffe4c-ffe4d : fman
  ffe4e-ffe4e0fff : mac
  ffe4e2000-ffe4e2fff : mac
  ffe4e4000-ffe4e4fff : mac
  ffe4e6000-ffe4e6fff : mac
  ffe4e8000-ffe4e8fff : mac


While for the latter I think we can
put together a quick fix, for the former I'd like to take a bit of time
to select the best fix, if one is really needed. So, please, let's split
the two problems and first address the incorrect stack memory use.


I have no idea how you can fix it without a (more correct this time) 
dummy region passed as parameter (and you don't want to use the first 
patch). But then it will be useless to do the call anyway, as it won't 
do any proper verification at all, so it could also be removed entirely, 
which begs the question, why do it at all in the first place (the 
devm_request_mem_region).


I'm not an expert in that part of the code so feel free to correct me if 
I missed something.


BR,

Patrick H.


Re: [PATCH 2/4] net: freescale/fman-port: remove direct use of __devm_request_region

2020-12-03 Thread Patrick Havelange

On 2020-12-03 09:44, Madalin Bucur wrote:

-Original Message-
From: Patrick Havelange 
Sent: 02 December 2020 18:16
To: Madalin Bucur ; David S. Miller
; Jakub Kicinski ;
net...@vger.kernel.org; linux-kernel@vger.kernel.org
Cc: Patrick Havelange 
Subject: [PATCH 2/4] net: freescale/fman-port: remove direct use of
__devm_request_region

This driver was directly calling __devm_request_region with a specific
resource on the stack as parameter. This is invalid as
__devm_request_region expects the given resource passed as parameter
to live longer than the call itself, as the pointer to the resource
will be stored inside the internal struct used by the devres
management.

In addition to this issue, a related bug has been found by kmemleak
with this trace :
unreferenced object 0xc001efc01880 (size 64):
   comm "swapper/0", pid 1, jiffies 4294669078 (age 3620.536s)
   hex dump (first 32 bytes):
 00 00 00 0f fe 4a c0 00 00 00 00 0f fe 4a cf ff  .J...J..
 c0 00 00 00 00 ee 9d 98 00 00 00 00 80 00 02 00  
   backtrace:
 [] .alloc_resource+0xb8/0xe0
 [] .__request_region+0x70/0x1c4
 [] .__devm_request_region+0x8c/0x138
 [] .fman_port_probe+0x170/0x420
 [] .platform_drv_probe+0x84/0x108
 [] .driver_probe_device+0x2c4/0x394
 [] .__driver_attach+0x124/0x128
 [] .bus_for_each_dev+0xb4/0x110
 [] .driver_attach+0x34/0x4c
 [] .bus_add_driver+0x264/0x2a4
 [] .driver_register+0x94/0x160
 [] .__platform_driver_register+0x60/0x7c
 [] .fman_port_load+0x28/0x64
 [] .do_one_initcall+0xd4/0x1a8
 [] .kernel_init_freeable+0x1bc/0x2a4
 [] .kernel_init+0x24/0x138

Indeed, the new resource (created in __request_region) will be linked
to the given resource living on the stack, which will end its lifetime
after the function calling __devm_request_region has finished.
Meaning the new resource allocated is no longer reachable.

Now that the main fman driver is no longer reserving the region
used by fman-port, this previous hack is no longer needed
and we can use the regular call to devm_request_mem_region instead,
solving those bugs at the same time.

Signed-off-by: Patrick Havelange 
---
  drivers/net/ethernet/freescale/fman/fman_port.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c
b/drivers/net/ethernet/freescale/fman/fman_port.c
index d9baac0dbc7d..354974939d9d 100644
--- a/drivers/net/ethernet/freescale/fman/fman_port.c
+++ b/drivers/net/ethernet/freescale/fman/fman_port.c
@@ -1878,10 +1878,10 @@ static int fman_port_probe(struct platform_device
*of_dev)

of_node_put(port_node);

-   dev_res = __devm_request_region(port->dev, , res.start,
-   resource_size(), "fman-port");
+   dev_res = devm_request_mem_region(port->dev, res.start,
+ resource_size(), "fman-port");
if (!dev_res) {
-   dev_err(port->dev, "%s: __devm_request_region() failed\n",
+   dev_err(port->dev, "%s: devm_request_mem_region() failed\n",
__func__);
err = -EINVAL;
goto free_port;
--
2.17.1


Hi Patrick,

please send a fix for the issue, targeting the net tree, separate from the
other change you are trying to introduce. We need a better explanation for
the latter and it should go through the net-next tree, if accepted.


Hello,

I've resent the series with a cover letter having a bit more 
explanations. I think all the patches should be applied on net, as they 
are linked to the same issue/resolution, and are not independent.


BR,

Patrick H.


[PATCH net 3/4] net: freescale/fman-mac: remove direct use of __devm_request_region

2020-12-03 Thread Patrick Havelange
Since the main fman driver is no longer reserving the complete fman
memory region, it is no longer needed to use a custom call to
__devm_request_region, so replace it with devm_request_mem_region

Signed-off-by: Patrick Havelange 
---
 drivers/net/ethernet/freescale/fman/mac.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/mac.c 
b/drivers/net/ethernet/freescale/fman/mac.c
index 901749a7a318..35ca5aed 100644
--- a/drivers/net/ethernet/freescale/fman/mac.c
+++ b/drivers/net/ethernet/freescale/fman/mac.c
@@ -690,12 +690,10 @@ static int mac_probe(struct platform_device *_of_dev)
goto _return_of_get_parent;
}
 
-   mac_dev->res = __devm_request_region(dev,
-fman_get_mem_region(priv->fman),
-res.start, resource_size(),
-"mac");
+   mac_dev->res = devm_request_mem_region(dev, res.start,
+  resource_size(), "mac");
if (!mac_dev->res) {
-   dev_err(dev, "__devm_request_mem_region(mac) failed\n");
+   dev_err(dev, "devm_request_mem_region(mac) failed\n");
err = -EBUSY;
goto _return_of_get_parent;
}
-- 
2.17.1



[PATCH net 2/4] net: freescale/fman-port: remove direct use of __devm_request_region

2020-12-03 Thread Patrick Havelange
This driver was directly calling __devm_request_region with a specific
resource on the stack as parameter. This is invalid as
__devm_request_region expects the given resource passed as parameter
to live longer than the call itself, as the pointer to the resource
will be stored inside the internal struct used by the devres
management.

In addition to this issue, a related bug has been found by kmemleak
with this trace :
unreferenced object 0xc001efc01880 (size 64):
  comm "swapper/0", pid 1, jiffies 4294669078 (age 3620.536s)
  hex dump (first 32 bytes):
00 00 00 0f fe 4a c0 00 00 00 00 0f fe 4a cf ff  .J...J..
c0 00 00 00 00 ee 9d 98 00 00 00 00 80 00 02 00  
  backtrace:
[] .alloc_resource+0xb8/0xe0
[] .__request_region+0x70/0x1c4
[] .__devm_request_region+0x8c/0x138
[] .fman_port_probe+0x170/0x420
[] .platform_drv_probe+0x84/0x108
[] .driver_probe_device+0x2c4/0x394
[] .__driver_attach+0x124/0x128
[] .bus_for_each_dev+0xb4/0x110
[] .driver_attach+0x34/0x4c
[] .bus_add_driver+0x264/0x2a4
[] .driver_register+0x94/0x160
[] .__platform_driver_register+0x60/0x7c
[] .fman_port_load+0x28/0x64
[] .do_one_initcall+0xd4/0x1a8
[] .kernel_init_freeable+0x1bc/0x2a4
[] .kernel_init+0x24/0x138

Indeed, the new resource (created in __request_region) will be linked
to the given resource living on the stack, which will end its lifetime
after the function calling __devm_request_region has finished.
Meaning the new resource allocated is no longer reachable.

Now that the main fman driver is no longer reserving the region
used by fman-port, this previous hack is no longer needed
and we can use the regular call to devm_request_mem_region instead,
solving those bugs at the same time.

Signed-off-by: Patrick Havelange 
---
 drivers/net/ethernet/freescale/fman/fman_port.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c 
b/drivers/net/ethernet/freescale/fman/fman_port.c
index d9baac0dbc7d..354974939d9d 100644
--- a/drivers/net/ethernet/freescale/fman/fman_port.c
+++ b/drivers/net/ethernet/freescale/fman/fman_port.c
@@ -1878,10 +1878,10 @@ static int fman_port_probe(struct platform_device 
*of_dev)
 
of_node_put(port_node);
 
-   dev_res = __devm_request_region(port->dev, , res.start,
-   resource_size(), "fman-port");
+   dev_res = devm_request_mem_region(port->dev, res.start,
+ resource_size(), "fman-port");
if (!dev_res) {
-   dev_err(port->dev, "%s: __devm_request_region() failed\n",
+   dev_err(port->dev, "%s: devm_request_mem_region() failed\n",
__func__);
err = -EINVAL;
goto free_port;
-- 
2.17.1



[PATCH net 4/4] net: freescale/fman: remove fman_get_mem_region

2020-12-03 Thread Patrick Havelange
This function is no longer used, so we can remove it.
The pointer to the resource that was kept inside
struct fman_state_struct can also be removed for the same reason.

Signed-off-by: Patrick Havelange 
---
 drivers/net/ethernet/freescale/fman/fman.c | 17 -
 drivers/net/ethernet/freescale/fman/fman.h |  2 --
 2 files changed, 19 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.c 
b/drivers/net/ethernet/freescale/fman/fman.c
index 2e85209d560d..bf78e12a1683 100644
--- a/drivers/net/ethernet/freescale/fman/fman.c
+++ b/drivers/net/ethernet/freescale/fman/fman.c
@@ -531,8 +531,6 @@ struct fman_state_struct {
 
u32 qman_channel_base;
u32 num_of_qman_channels;
-
-   struct resource *res;
 };
 
 /* Structure that holds FMan initial configuration */
@@ -1737,7 +1735,6 @@ static int fman_config(struct fman *fman)
fman->state->qman_channel_base = fman->dts_params.qman_channel_base;
fman->state->num_of_qman_channels =
fman->dts_params.num_of_qman_channels;
-   fman->state->res = fman->dts_params.res;
fman->exception_cb = fman_exceptions;
fman->bus_error_cb = fman_bus_error;
fman->fpm_regs = fman->dts_params.base_addr_pol + FPM_OFFSET_FROM_POL;
@@ -2405,20 +2402,6 @@ u32 fman_get_qman_channel_id(struct fman *fman, u32 
port_id)
 }
 EXPORT_SYMBOL(fman_get_qman_channel_id);
 
-/**
- * fman_get_mem_region
- * @fman:  A Pointer to FMan device
- *
- * Get FMan memory region
- *
- * Return: A structure with FMan memory region information
- */
-struct resource *fman_get_mem_region(struct fman *fman)
-{
-   return fman->state->res;
-}
-EXPORT_SYMBOL(fman_get_mem_region);
-
 /* Bootargs defines */
 /* Extra headroom for RX buffers - Default, min and max */
 #define FSL_FM_RX_EXTRA_HEADROOM   64
diff --git a/drivers/net/ethernet/freescale/fman/fman.h 
b/drivers/net/ethernet/freescale/fman/fman.h
index e6b339c57230..e326aa37b8b2 100644
--- a/drivers/net/ethernet/freescale/fman/fman.h
+++ b/drivers/net/ethernet/freescale/fman/fman.h
@@ -398,8 +398,6 @@ int fman_set_mac_max_frame(struct fman *fman, u8 mac_id, 
u16 mfl);
 
 u32 fman_get_qman_channel_id(struct fman *fman, u32 port_id);
 
-struct resource *fman_get_mem_region(struct fman *fman);
-
 u16 fman_get_max_frm(void);
 
 int fman_get_rx_extra_headroom(void);
-- 
2.17.1



[PATCH net 1/4] net: freescale/fman: Split the main resource region reservation

2020-12-03 Thread Patrick Havelange
The main fman driver is only using some parts of the fman memory
region.
Split the reservation of the main region in 2, so that the other
regions that will be used by fman-port and fman-mac drivers can
be reserved properly and not be in conflict with the main fman
reservation.

Signed-off-by: Patrick Havelange 
---
 drivers/net/ethernet/freescale/fman/fman.c | 103 +
 drivers/net/ethernet/freescale/fman/fman.h |   9 +-
 2 files changed, 69 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.c 
b/drivers/net/ethernet/freescale/fman/fman.c
index ce0a121580f6..2e85209d560d 100644
--- a/drivers/net/ethernet/freescale/fman/fman.c
+++ b/drivers/net/ethernet/freescale/fman/fman.c
@@ -58,12 +58,15 @@
 /* Modules registers offsets */
 #define BMI_OFFSET 0x0008
 #define QMI_OFFSET 0x00080400
-#define KG_OFFSET  0x000C1000
-#define DMA_OFFSET 0x000C2000
-#define FPM_OFFSET 0x000C3000
-#define IMEM_OFFSET0x000C4000
-#define HWP_OFFSET 0x000C7000
-#define CGP_OFFSET 0x000DB000
+#define SIZE_REGION_0  0x00081000
+#define POL_OFFSET 0x000C
+#define KG_OFFSET_FROM_POL 0x1000
+#define DMA_OFFSET_FROM_POL0x2000
+#define FPM_OFFSET_FROM_POL0x3000
+#define IMEM_OFFSET_FROM_POL   0x4000
+#define HWP_OFFSET_FROM_POL0x7000
+#define CGP_OFFSET_FROM_POL0x0001B000
+#define SIZE_REGION_FROM_POL   0x0002
 
 /* Exceptions bit map */
 #define EX_DMA_BUS_ERROR   0x8000
@@ -1433,7 +1436,7 @@ static int clear_iram(struct fman *fman)
struct fman_iram_regs __iomem *iram;
int i, count;
 
-   iram = fman->base_addr + IMEM_OFFSET;
+   iram = fman->base_addr_pol + IMEM_OFFSET_FROM_POL;
 
/* Enable the auto-increment */
iowrite32be(IRAM_IADD_AIE, >iadd);
@@ -1710,11 +1713,8 @@ static int set_num_of_open_dmas(struct fman *fman, u8 
port_id,
 
 static int fman_config(struct fman *fman)
 {
-   void __iomem *base_addr;
int err;
 
-   base_addr = fman->dts_params.base_addr;
-
fman->state = kzalloc(sizeof(*fman->state), GFP_KERNEL);
if (!fman->state)
goto err_fm_state;
@@ -1740,13 +1740,14 @@ static int fman_config(struct fman *fman)
fman->state->res = fman->dts_params.res;
fman->exception_cb = fman_exceptions;
fman->bus_error_cb = fman_bus_error;
-   fman->fpm_regs = base_addr + FPM_OFFSET;
-   fman->bmi_regs = base_addr + BMI_OFFSET;
-   fman->qmi_regs = base_addr + QMI_OFFSET;
-   fman->dma_regs = base_addr + DMA_OFFSET;
-   fman->hwp_regs = base_addr + HWP_OFFSET;
-   fman->kg_regs = base_addr + KG_OFFSET;
-   fman->base_addr = base_addr;
+   fman->fpm_regs = fman->dts_params.base_addr_pol + FPM_OFFSET_FROM_POL;
+   fman->bmi_regs = fman->dts_params.base_addr_0 + BMI_OFFSET;
+   fman->qmi_regs = fman->dts_params.base_addr_0 + QMI_OFFSET;
+   fman->dma_regs = fman->dts_params.base_addr_pol + DMA_OFFSET_FROM_POL;
+   fman->hwp_regs = fman->dts_params.base_addr_pol + HWP_OFFSET_FROM_POL;
+   fman->kg_regs = fman->dts_params.base_addr_pol + KG_OFFSET_FROM_POL;
+   fman->base_addr_0 = fman->dts_params.base_addr_0;
+   fman->base_addr_pol = fman->dts_params.base_addr_pol;
 
spin_lock_init(>spinlock);
fman_defconfig(fman->cfg);
@@ -1937,8 +1938,8 @@ static int fman_init(struct fman *fman)
fman->state->exceptions &= ~FMAN_EX_QMI_SINGLE_ECC;
 
/* clear CPG */
-   memset_io((void __iomem *)(fman->base_addr + CGP_OFFSET), 0,
- fman->state->fm_port_num_of_cg);
+   memset_io((void __iomem *)(fman->base_addr_pol + CGP_OFFSET_FROM_POL),
+ 0, fman->state->fm_port_num_of_cg);
 
/* Save LIODN info before FMan reset
 * Skipping non-existent port 0 (i = 1)
@@ -2717,13 +2718,11 @@ static struct fman *read_dts_node(struct 
platform_device *of_dev)
 {
struct fman *fman;
struct device_node *fm_node, *muram_node;
-   struct resource *res;
+   struct resource *tmp_res, *main_res;
u32 val, range[2];
int err, irq;
struct clk *clk;
u32 clk_rate;
-   phys_addr_t phys_base_addr;
-   resource_size_t mem_size;
 
fman = kzalloc(sizeof(*fman), GFP_KERNEL);
if (!fman)
@@ -2740,34 +2739,31 @@ static struct fman *read_dts_node(struct 
platform_device *of_dev)
fman->dts_params.id = (u8)val;
 
/* Get the FM interrupt */
-   res = platform_get_resource(of_dev, IORESOURCE_IRQ, 0);
-   if (!res) {
+   tmp_res = platform_get_resource(of_dev, IORESOURCE_IRQ, 0);
+   if (!tmp_res) {
dev_err(_dev->dev, "%s: Can't get FMan IR

[PATCH net 0/4] freescale/fman: remove usage of __devm_request_region

2020-12-03 Thread Patrick Havelange
Hello, 

This patch series is solving a bug inside 
ethernet/freescale/fman/fman_port.c caused by an incorrect usage of
__devm_request_region.
The bug came from the fact that the resource given as parameter to the
function __devm_request_region is on the stack, leading to an invalid
pointer being stored inside the devres structure, and the new resource
pointer being lost.
In order to solve this, it is first necessary to make the regular call
devm_request_mem_region work.
This call cannot work as is, because the main fman driver has already
reserved the memory region used by the fman_port driver.

Thus the first action is to split the memory region reservation done in
the main fman driver in two, so that the regions used by fman_port will
not be reserved. The main memory regions have also been reduced to not
request the memory regions used by fman_mac.

Once this first step is done, regular usage of devm_request_mem_region
can be done.
This also leads to a bit of code removal done in the last patch.

1. fman: split the memory region of the main fman driver, so the memory that
will be used by fman_port & fman_mac will not be already reserved.

2. fman_port: replace call of __devm_request_region to devm_request_mem_region

3. fman_mac: replace call of __devm_request_region to devm_request_mem_region

4. the code is now simplified a bit, there is no need to store the main region
anymore

The whole point of this series is to be able to apply the patch N°2.
Since N°3 is a similar change, let's do it at the same time.

I think they all should be part of the same series.

Patrick Havelange (4):
  net: freescale/fman: Split the main resource region reservation
  net: freescale/fman-port: remove direct use of __devm_request_region
  net: freescale/fman-mac: remove direct use of __devm_request_region
  net: freescale/fman: remove fman_get_mem_region

 drivers/net/ethernet/freescale/fman/fman.c| 120 +-
 drivers/net/ethernet/freescale/fman/fman.h|  11 +-
 .../net/ethernet/freescale/fman/fman_port.c   |   6 +-
 drivers/net/ethernet/freescale/fman/mac.c |   8 +-
 4 files changed, 75 insertions(+), 70 deletions(-)


base-commit: 832e09798c261cf58de3a68cfcc6556408c16a5a
-- 
2.17.1



[PATCH 4/4] net: freescale/fman: remove fman_get_mem_region

2020-12-02 Thread Patrick Havelange
This function is no longer used, so we can remove it.
The pointer to the resource that was kept inside
struct fman_state_struct can also be removed for the same reason.

Signed-off-by: Patrick Havelange 
---
 drivers/net/ethernet/freescale/fman/fman.c | 17 -
 drivers/net/ethernet/freescale/fman/fman.h |  2 --
 2 files changed, 19 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.c 
b/drivers/net/ethernet/freescale/fman/fman.c
index 2e85209d560d..bf78e12a1683 100644
--- a/drivers/net/ethernet/freescale/fman/fman.c
+++ b/drivers/net/ethernet/freescale/fman/fman.c
@@ -531,8 +531,6 @@ struct fman_state_struct {
 
u32 qman_channel_base;
u32 num_of_qman_channels;
-
-   struct resource *res;
 };
 
 /* Structure that holds FMan initial configuration */
@@ -1737,7 +1735,6 @@ static int fman_config(struct fman *fman)
fman->state->qman_channel_base = fman->dts_params.qman_channel_base;
fman->state->num_of_qman_channels =
fman->dts_params.num_of_qman_channels;
-   fman->state->res = fman->dts_params.res;
fman->exception_cb = fman_exceptions;
fman->bus_error_cb = fman_bus_error;
fman->fpm_regs = fman->dts_params.base_addr_pol + FPM_OFFSET_FROM_POL;
@@ -2405,20 +2402,6 @@ u32 fman_get_qman_channel_id(struct fman *fman, u32 
port_id)
 }
 EXPORT_SYMBOL(fman_get_qman_channel_id);
 
-/**
- * fman_get_mem_region
- * @fman:  A Pointer to FMan device
- *
- * Get FMan memory region
- *
- * Return: A structure with FMan memory region information
- */
-struct resource *fman_get_mem_region(struct fman *fman)
-{
-   return fman->state->res;
-}
-EXPORT_SYMBOL(fman_get_mem_region);
-
 /* Bootargs defines */
 /* Extra headroom for RX buffers - Default, min and max */
 #define FSL_FM_RX_EXTRA_HEADROOM   64
diff --git a/drivers/net/ethernet/freescale/fman/fman.h 
b/drivers/net/ethernet/freescale/fman/fman.h
index e6b339c57230..e326aa37b8b2 100644
--- a/drivers/net/ethernet/freescale/fman/fman.h
+++ b/drivers/net/ethernet/freescale/fman/fman.h
@@ -398,8 +398,6 @@ int fman_set_mac_max_frame(struct fman *fman, u8 mac_id, 
u16 mfl);
 
 u32 fman_get_qman_channel_id(struct fman *fman, u32 port_id);
 
-struct resource *fman_get_mem_region(struct fman *fman);
-
 u16 fman_get_max_frm(void);
 
 int fman_get_rx_extra_headroom(void);
-- 
2.17.1



[PATCH 3/4] net: freescale/fman-mac: remove direct use of __devm_request_region

2020-12-02 Thread Patrick Havelange
Since the main fman driver is no longer reserving the complete fman
memory region, it is no longer needed to use a custom call to
__devm_request_region, so replace it with devm_request_mem_region

Signed-off-by: Patrick Havelange 
---
 drivers/net/ethernet/freescale/fman/mac.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/mac.c 
b/drivers/net/ethernet/freescale/fman/mac.c
index 901749a7a318..35ca5aed 100644
--- a/drivers/net/ethernet/freescale/fman/mac.c
+++ b/drivers/net/ethernet/freescale/fman/mac.c
@@ -690,12 +690,10 @@ static int mac_probe(struct platform_device *_of_dev)
goto _return_of_get_parent;
}
 
-   mac_dev->res = __devm_request_region(dev,
-fman_get_mem_region(priv->fman),
-res.start, resource_size(),
-"mac");
+   mac_dev->res = devm_request_mem_region(dev, res.start,
+  resource_size(), "mac");
if (!mac_dev->res) {
-   dev_err(dev, "__devm_request_mem_region(mac) failed\n");
+   dev_err(dev, "devm_request_mem_region(mac) failed\n");
err = -EBUSY;
goto _return_of_get_parent;
}
-- 
2.17.1



[PATCH 2/4] net: freescale/fman-port: remove direct use of __devm_request_region

2020-12-02 Thread Patrick Havelange
This driver was directly calling __devm_request_region with a specific
resource on the stack as parameter. This is invalid as
__devm_request_region expects the given resource passed as parameter
to live longer than the call itself, as the pointer to the resource
will be stored inside the internal struct used by the devres
management.

In addition to this issue, a related bug has been found by kmemleak
with this trace :
unreferenced object 0xc001efc01880 (size 64):
  comm "swapper/0", pid 1, jiffies 4294669078 (age 3620.536s)
  hex dump (first 32 bytes):
00 00 00 0f fe 4a c0 00 00 00 00 0f fe 4a cf ff  .J...J..
c0 00 00 00 00 ee 9d 98 00 00 00 00 80 00 02 00  
  backtrace:
[] .alloc_resource+0xb8/0xe0
[] .__request_region+0x70/0x1c4
[] .__devm_request_region+0x8c/0x138
[] .fman_port_probe+0x170/0x420
[] .platform_drv_probe+0x84/0x108
[] .driver_probe_device+0x2c4/0x394
[] .__driver_attach+0x124/0x128
[] .bus_for_each_dev+0xb4/0x110
[] .driver_attach+0x34/0x4c
[] .bus_add_driver+0x264/0x2a4
[] .driver_register+0x94/0x160
[] .__platform_driver_register+0x60/0x7c
[] .fman_port_load+0x28/0x64
[] .do_one_initcall+0xd4/0x1a8
[] .kernel_init_freeable+0x1bc/0x2a4
[] .kernel_init+0x24/0x138

Indeed, the new resource (created in __request_region) will be linked
to the given resource living on the stack, which will end its lifetime
after the function calling __devm_request_region has finished.
Meaning the new resource allocated is no longer reachable.

Now that the main fman driver is no longer reserving the region
used by fman-port, this previous hack is no longer needed
and we can use the regular call to devm_request_mem_region instead,
solving those bugs at the same time.

Signed-off-by: Patrick Havelange 
---
 drivers/net/ethernet/freescale/fman/fman_port.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c 
b/drivers/net/ethernet/freescale/fman/fman_port.c
index d9baac0dbc7d..354974939d9d 100644
--- a/drivers/net/ethernet/freescale/fman/fman_port.c
+++ b/drivers/net/ethernet/freescale/fman/fman_port.c
@@ -1878,10 +1878,10 @@ static int fman_port_probe(struct platform_device 
*of_dev)
 
of_node_put(port_node);
 
-   dev_res = __devm_request_region(port->dev, , res.start,
-   resource_size(), "fman-port");
+   dev_res = devm_request_mem_region(port->dev, res.start,
+ resource_size(), "fman-port");
if (!dev_res) {
-   dev_err(port->dev, "%s: __devm_request_region() failed\n",
+   dev_err(port->dev, "%s: devm_request_mem_region() failed\n",
__func__);
err = -EINVAL;
goto free_port;
-- 
2.17.1



[PATCH 1/4] net: freescale/fman: Split the main resource region reservation

2020-12-02 Thread Patrick Havelange
The main fman driver is only using some parts of the fman memory
region.
Split the reservation of the main region in 2, so that the other
regions that will be used by fman-port and fman-mac drivers can
be reserved properly and not be in conflict with the main fman
reservation.

Signed-off-by: Patrick Havelange 
---
 drivers/net/ethernet/freescale/fman/fman.c | 103 +
 drivers/net/ethernet/freescale/fman/fman.h |   9 +-
 2 files changed, 69 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.c 
b/drivers/net/ethernet/freescale/fman/fman.c
index ce0a121580f6..2e85209d560d 100644
--- a/drivers/net/ethernet/freescale/fman/fman.c
+++ b/drivers/net/ethernet/freescale/fman/fman.c
@@ -58,12 +58,15 @@
 /* Modules registers offsets */
 #define BMI_OFFSET 0x0008
 #define QMI_OFFSET 0x00080400
-#define KG_OFFSET  0x000C1000
-#define DMA_OFFSET 0x000C2000
-#define FPM_OFFSET 0x000C3000
-#define IMEM_OFFSET0x000C4000
-#define HWP_OFFSET 0x000C7000
-#define CGP_OFFSET 0x000DB000
+#define SIZE_REGION_0  0x00081000
+#define POL_OFFSET 0x000C
+#define KG_OFFSET_FROM_POL 0x1000
+#define DMA_OFFSET_FROM_POL0x2000
+#define FPM_OFFSET_FROM_POL0x3000
+#define IMEM_OFFSET_FROM_POL   0x4000
+#define HWP_OFFSET_FROM_POL0x7000
+#define CGP_OFFSET_FROM_POL0x0001B000
+#define SIZE_REGION_FROM_POL   0x0002
 
 /* Exceptions bit map */
 #define EX_DMA_BUS_ERROR   0x8000
@@ -1433,7 +1436,7 @@ static int clear_iram(struct fman *fman)
struct fman_iram_regs __iomem *iram;
int i, count;
 
-   iram = fman->base_addr + IMEM_OFFSET;
+   iram = fman->base_addr_pol + IMEM_OFFSET_FROM_POL;
 
/* Enable the auto-increment */
iowrite32be(IRAM_IADD_AIE, >iadd);
@@ -1710,11 +1713,8 @@ static int set_num_of_open_dmas(struct fman *fman, u8 
port_id,
 
 static int fman_config(struct fman *fman)
 {
-   void __iomem *base_addr;
int err;
 
-   base_addr = fman->dts_params.base_addr;
-
fman->state = kzalloc(sizeof(*fman->state), GFP_KERNEL);
if (!fman->state)
goto err_fm_state;
@@ -1740,13 +1740,14 @@ static int fman_config(struct fman *fman)
fman->state->res = fman->dts_params.res;
fman->exception_cb = fman_exceptions;
fman->bus_error_cb = fman_bus_error;
-   fman->fpm_regs = base_addr + FPM_OFFSET;
-   fman->bmi_regs = base_addr + BMI_OFFSET;
-   fman->qmi_regs = base_addr + QMI_OFFSET;
-   fman->dma_regs = base_addr + DMA_OFFSET;
-   fman->hwp_regs = base_addr + HWP_OFFSET;
-   fman->kg_regs = base_addr + KG_OFFSET;
-   fman->base_addr = base_addr;
+   fman->fpm_regs = fman->dts_params.base_addr_pol + FPM_OFFSET_FROM_POL;
+   fman->bmi_regs = fman->dts_params.base_addr_0 + BMI_OFFSET;
+   fman->qmi_regs = fman->dts_params.base_addr_0 + QMI_OFFSET;
+   fman->dma_regs = fman->dts_params.base_addr_pol + DMA_OFFSET_FROM_POL;
+   fman->hwp_regs = fman->dts_params.base_addr_pol + HWP_OFFSET_FROM_POL;
+   fman->kg_regs = fman->dts_params.base_addr_pol + KG_OFFSET_FROM_POL;
+   fman->base_addr_0 = fman->dts_params.base_addr_0;
+   fman->base_addr_pol = fman->dts_params.base_addr_pol;
 
spin_lock_init(>spinlock);
fman_defconfig(fman->cfg);
@@ -1937,8 +1938,8 @@ static int fman_init(struct fman *fman)
fman->state->exceptions &= ~FMAN_EX_QMI_SINGLE_ECC;
 
/* clear CPG */
-   memset_io((void __iomem *)(fman->base_addr + CGP_OFFSET), 0,
- fman->state->fm_port_num_of_cg);
+   memset_io((void __iomem *)(fman->base_addr_pol + CGP_OFFSET_FROM_POL),
+ 0, fman->state->fm_port_num_of_cg);
 
/* Save LIODN info before FMan reset
 * Skipping non-existent port 0 (i = 1)
@@ -2717,13 +2718,11 @@ static struct fman *read_dts_node(struct 
platform_device *of_dev)
 {
struct fman *fman;
struct device_node *fm_node, *muram_node;
-   struct resource *res;
+   struct resource *tmp_res, *main_res;
u32 val, range[2];
int err, irq;
struct clk *clk;
u32 clk_rate;
-   phys_addr_t phys_base_addr;
-   resource_size_t mem_size;
 
fman = kzalloc(sizeof(*fman), GFP_KERNEL);
if (!fman)
@@ -2740,34 +2739,31 @@ static struct fman *read_dts_node(struct 
platform_device *of_dev)
fman->dts_params.id = (u8)val;
 
/* Get the FM interrupt */
-   res = platform_get_resource(of_dev, IORESOURCE_IRQ, 0);
-   if (!res) {
+   tmp_res = platform_get_resource(of_dev, IORESOURCE_IRQ, 0);
+   if (!tmp_res) {
dev_err(_dev->dev, "%s: Can't get FMan IR

Re: [PATCH 0/4] ARM: dts: aspeed: Add Facebook Galaxy100 BMC

2020-11-11 Thread Patrick Williams
On Wed, Nov 11, 2020 at 11:34:10PM +, Joel Stanley wrote:
> On Wed, 11 Nov 2020 at 23:23,  wrote:
> >
> > From: Tao Ren 
> >
> > The patch series adds the initial version of device tree for Facebook
> > Galaxy100 (AST2400) BMC.
> >
> > Patch #1 adds common dtsi to minimize duplicated device entries across
> > Facebook Network AST2400 BMC device trees.
> >
> > Patch #2 simplfies Wedge40 device tree by using the common dtsi.
> >
> > Patch #3 simplfies Wedge100 device tree by using the common dtsi.
> >
> > Patch #4 adds the initial version of device tree for Facebook Galaxy100
> > BMC.
> 
> Nice. They look good to me.
> 
> Reviewed-by: Joel Stanley 
> 
> Is there another person familiar with the design you would like to
> review before I merge?

Also,

Reviewed-by: Patrick Williams 

-- 
Patrick Williams


signature.asc
Description: PGP signature


Re: [PATCH v5 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-11-06 Thread Patrick Bellasi


Hi Yun,
thanks for keep improving this.

I'm replying here but still considering all other reviewers comments.

Best,
Patrick

On Tue, Nov 03, 2020 at 03:37:56 +0100, Yun Hsiang  
wrote...

> If the user wants to stop controlling uclamp and let the task inherit
> the value from the group, we need a method to reset.
>
> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
> sched_setattr syscall.
>
> The policy is
> _CLAMP_RESET   => reset both min and max
> _CLAMP_RESET | _CLAMP_MIN  => reset min value
> _CLAMP_RESET | _CLAMP_MAX  => reset max value
> _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max

This documentation should be added to the uapi header and, most
importantly, in:
  include/uapi/linux/sched/types.h
where the documentation for struct sched_attr is provided.


> Signed-off-by: Yun Hsiang 
> Reported-by: kernel test robot 
> ---
>  include/uapi/linux/sched.h |  7 +++--
>  kernel/sched/core.c| 59 --
>  2 files changed, 49 insertions(+), 17 deletions(-)
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 3bac0a8ceab2..6c823ddb1a1e 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -132,17 +132,20 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS   0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX0x40
> +#define SCHED_FLAG_UTIL_CLAMP_RESET  0x80
>  
>  #define SCHED_FLAG_KEEP_ALL  (SCHED_FLAG_KEEP_POLICY | \
>SCHED_FLAG_KEEP_PARAMS)
>  

(Related to the following discussion point)
What about adding in a comment here to call out that the following
definitions are "internal only"?
Moreover, we could probably wrap the following two define within an
#ifdef __KERNEL__/#endif block


Something like:

+ /*
+ * The following definitions are internal only, do not use them to set
+ * set_{set,get}attr() params. Use instead a valid combination of the
+ * flags defined above.
+ */
+ #ifdef __KERNEL__

>  #define SCHED_FLAG_UTIL_CLAMP(SCHED_FLAG_UTIL_CLAMP_MIN | \
> -  SCHED_FLAG_UTIL_CLAMP_MAX)
> +  SCHED_FLAG_UTIL_CLAMP_MAX | \
> +  SCHED_FLAG_UTIL_CLAMP_RESET)

We need the _RESET flag only here (not below), since this is a UCLAMP
feature and it's worth/useful to have a single "all uclamp flags"
definition...

>  #define SCHED_FLAG_ALL   (SCHED_FLAG_RESET_ON_FORK   | \
>SCHED_FLAG_RECLAIM | \
>SCHED_FLAG_DL_OVERRUN  | \
>SCHED_FLAG_KEEP_ALL| \
> -  SCHED_FLAG_UTIL_CLAMP)
> +  SCHED_FLAG_UTIL_CLAMP  | \
> +  SCHED_FLAG_UTIL_CLAMP_RESET)

... i.e., you can drop the chunk above.

+ #endif /* __KERNEL__ */

Regarding Qais comment, I had the same Dietmar's thought: I doubt there
are apps using _FLAGS_ALL from userspace. For DL tasks, since they
cannot fork, it makes no sense to specify, for example
_RESET_ON_FORK|_RECLAIM. For CFS/RT tasks, where UCLAMP is supported, it
makes no sense to specify DL flags.

It's true however that having this def here when it's supposed to be
used only internally, can be kind of "confusing", but it's also useful
to keep the definition aligned with the flags defined above.
The ifdef wrapping proposed above should make this even more explicit.

Perhaps we can also better call this out also with an additional note
right after:

  
https://elixir.bootlin.com/linux/v5.9.6/source/include/uapi/linux/sched/types.h#L43

In that file, I believe the "Task Utilization Attributes" section can
also be improved by adding a description of the _UCLAMP flags semantics.


>  #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8160ab5263f8..6ae463b64834 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1004,7 +1004,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum 
> uclamp_id clamp_id,
>   return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>  
> -static void __uclamp_update_util_min_rt_default(struct task_struct *p)
> +static inline void __uclamp_update_util_min_rt_default(struct task_struct *p)
>  {

Again, IMO, this is _not_ an unrelated change at all. Actually, I still
would like to do one step more and inline this function in the _only
place_ where it's used. Qais arguments for not doing that where [1]:

  Updating the default rt value is done from different contexts. Hence
  it is important to document the rules under which this update must

From the kindness of our heart!! You have been selected to receive (£1,000,000.00 British Pound)

2020-10-31 Thread Patrick and Frances Connolly
Dear Beneficiary, You have been selected to receive (£1,000,000.00 British 
Pound) as charity donation/grant from Patrick and Frances Connolly. Therefore, 
you are required to contact her through email for more details










patrickcon2...@gmail.com

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



From the kindness of our heart!! You have been selected to receive (£1,000,000.00 British Pound)

2020-10-31 Thread Patrick and Frances Connolly
Dear Beneficiary, You have been selected to receive (£1,000,000.00 British 
Pound) as charity donation/grant from Patrick and Frances Connolly. Therefore, 
you are required to contact her through email for more details










patrickcon2...@gmail.com

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-28 Thread Patrick Bellasi


Hi Dietmar, Yun,
I hope I'm not too late before v4 posting ;)

I think the overall approach is sound, I just added in a couple of
cleanups and a possible fix (user_defined reset).

Best,
Patrick


On Tue, Oct 27, 2020 at 16:58:13 +0100, Yun Hsiang  
wrote...

> Hi Diet mar,
> On Mon, Oct 26, 2020 at 08:00:48PM +0100, Dietmar Eggemann wrote:
>> On 26/10/2020 16:45, Yun Hsiang wrote:

[...]

>> I thought about something like this. Only lightly tested. 
>> 
>> ---8<---
>> 
>> From: Dietmar Eggemann 
>> Date: Mon, 26 Oct 2020 13:52:23 +0100
>> Subject: [PATCH] SCHED_FLAG_UTIL_CLAMP_RESET
>> 
>> Signed-off-by: Dietmar Eggemann 
>> ---
>>  include/uapi/linux/sched.h |  4 +++-
>>  kernel/sched/core.c| 31 +++
>>  2 files changed, 30 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>> index 3bac0a8ceab2..0dd890822751 100644
>> --- a/include/uapi/linux/sched.h
>> +++ b/include/uapi/linux/sched.h
>> @@ -132,12 +132,14 @@ struct clone_args {
>>  #define SCHED_FLAG_KEEP_PARAMS  0x10
>>  #define SCHED_FLAG_UTIL_CLAMP_MIN   0x20
>>  #define SCHED_FLAG_UTIL_CLAMP_MAX   0x40
>> +#define SCHED_FLAG_UTIL_CLAMP_RESET 0x80
>>  
>>  #define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
>>   SCHED_FLAG_KEEP_PARAMS)
>>  
>>  #define SCHED_FLAG_UTIL_CLAMP   (SCHED_FLAG_UTIL_CLAMP_MIN | \
>> - SCHED_FLAG_UTIL_CLAMP_MAX)
>> + SCHED_FLAG_UTIL_CLAMP_MAX | \
>> + SCHED_FLAG_UTIL_CLAMP_RESET)
>>  
>>  #define SCHED_FLAG_ALL  (SCHED_FLAG_RESET_ON_FORK   | \
>>   SCHED_FLAG_RECLAIM | \
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3dc415f58bd7..717b1cf5cf1f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1438,6 +1438,23 @@ static int uclamp_validate(struct task_struct *p,
>>  return 0;
>>  }
>>  
>> +static bool uclamp_reset(enum uclamp_id clamp_id, unsigned long flags)
>> +{

Maybe we can add in some comments?

/* No _UCLAMP_RESET flag set: do not reset */
>> +if (!(flags & SCHED_FLAG_UTIL_CLAMP_RESET))
>> +return false;
>> +

/* Only _UCLAMP_RESET flag set: reset both clamps */
>> +if (!(flags & (SCHED_FLAG_UTIL_CLAMP_MIN | SCHED_FLAG_UTIL_CLAMP_MAX)))
>> +return true;
>> +
/* Both _UCLAMP_RESET and _UCLAMP_MIN flags are set: reset only min */
>> +if ((flags & SCHED_FLAG_UTIL_CLAMP_MIN) && clamp_id == UCLAMP_MIN)
>> +return true;
>> +

/* Both _UCLAMP_RESET and _UCLAMP_MAX flags are set: reset only max */
>> +if ((flags & SCHED_FLAG_UTIL_CLAMP_MAX) && clamp_id == UCLAMP_MAX)
>> +return true;

Since the evaluation ordering is important, do we have to better
_always_ use a READ_ONCE() for all flags accesses above, to ensure it is
preserved?

>> +
>> +return false;
>> +}
>> +
>>  static void __setscheduler_uclamp(struct task_struct *p,
>>const struct sched_attr *attr)
>>  {
>> @@ -1449,24 +1466,30 @@ static void __setscheduler_uclamp(struct task_struct 
>> *p,
>>   */

Perhaps we should update the comment above this loop with something
like:

/*
 * Reset to default clamps on forced _UCLAMP_RESET (always) and
 * for tasks without a task-specific value (on scheduling class change).
 */
>>  for_each_clamp_id(clamp_id) {
>>  struct uclamp_se *uc_se = >uclamp_req[clamp_id];
>> +unsigned int value;
>>  
>>  /* Keep using defined clamps across class changes */
>> -if (uc_se->user_defined)
>> +if (!uclamp_reset(clamp_id, attr->sched_flags) &&
>> +uc_se->user_defined) {
>>  continue;
>> +}

I think we miss to reset the user_defined flag here.

What about replacing the above chunk with:

if (uclamp_reset(clamp_id, attr->sched_flags))
uc_se->user_defined = false;
if (uc-se->user_defined)
continue;

?


>>  
>>  /*
>>   * RT by default have a 100% boost value that could be modified
>>   * at runtime.
>>   */
>>  if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
>&g

Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-28 Thread Patrick Bellasi


On Wed, Oct 28, 2020 at 12:39:43 +0100, Qais Yousef  
wrote...

> On 10/28/20 11:11, Patrick Bellasi wrote:
>> >>  
>> >>   /*
>> >>* RT by default have a 100% boost value that could be modified
>> >>* at runtime.
>> >>*/
>> >>   if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
>> >> - __uclamp_update_util_min_rt_default(p);
>> >> + value = sysctl_sched_uclamp_util_min_rt_default;
>> 
>> By removing this usage of __uclamp_updadate_util_min_rt_default(p),
>> the only other usage remaining is the call from:
>>uclamp_udpate_util_min_rt_default().
>> 
>> What about an additional cleanup by in-lining the only surviving usage?
>
> This is not a cleanup IMO. There is special rule about updating that are
> encoded and documented in this helper function. Namely:
>
>   * p->pi_lock must be held.
>   * p->uclamp_req[].user_defined must be false.

Both these conditions are satisfied in the above call site:
 - user_defined is tested just 4 lines above
 - pi_lock is taken by the caller, i.e. __sched_setscheduler()
Thus, there is no need to test them two times.

Moreover, the same granted pi_lock you check in
__ucalmp_update_util_min_rt_default() is not checked at all in the rest
of __setscheduler_uclamp().

Thus, perhaps we should have just avoided to add
__uclamp_update_util_min_rt_default() since the beginning and:
 - have all its logic in the _only_ place where it's required
 - added the lockdep_assert_held() in __setscheduler_uclamp()

That's why I consider this a very good cleanup opportunity.

> I don't see open coding helps but rather makes the code harder to read and
> prone to introduce bugs if anything gets reshuffled in the future.

It's not open coding IMHO, it's just adding the code that's required.



Good News!!

2020-10-27 Thread Frances &amp; Patrick Connolly
You have a donation of £2,000,000.00 GBP from Frances & Patrick Connolly
and Family. Reply to this email: (francespatrickconnol...@gmail.com) for
more details.
Please Note: This is the third time we have been trying to contact you.
Best Regards
Mrs. Mary Hamilton.



Good News!!

2020-10-26 Thread Frances &amp; Patrick Connolly
You have a donation of £2,000,000.00 GBP from Frances & Patrick Connolly
and Family. Reply to this email: (francespatrickconnol...@gmail.com) for
more details.
Please Note: This is the third time we have been trying to contact you.
Best Regards
Mrs. Mary Hamilton.



[PATCH] ARM: dts: stm32: reorder spi4 within stm32mp15-pinctrl

2020-10-22 Thread Patrick Delaunay
Move spi4 at the right alphabetical place within stm32mp15-pinctrl

Fixes: 4fe663890ac5 ("ARM: dts: stm32: Fix spi4 pins in stm32mp15-pinctrl")
Signed-off-by: Patrick Delaunay 
---

 arch/arm/boot/dts/stm32mp15-pinctrl.dtsi | 28 
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi 
b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
index d84686e00370..c9e514165672 100644
--- a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
+++ b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
@@ -1591,6 +1591,20 @@
};
};
 
+   spi4_pins_a: spi4-0 {
+   pins {
+   pinmux = , /* SPI4_SCK */
+;  /* SPI4_MOSI */
+   bias-disable;
+   drive-push-pull;
+   slew-rate = <1>;
+   };
+   pins2 {
+   pinmux = ; /* SPI4_MISO */
+   bias-disable;
+   };
+   };
+
uart4_pins_a: uart4-0 {
pins1 {
pinmux = ; /* UART4_TX */
@@ -1726,20 +1740,6 @@
};
};
 
-   spi4_pins_a: spi4-0 {
-   pins {
-   pinmux = , /* SPI4_SCK */
-;  /* SPI4_MOSI */
-   bias-disable;
-   drive-push-pull;
-   slew-rate = <1>;
-   };
-   pins2 {
-   pinmux = ; /* SPI4_MISO */
-   bias-disable;
-   };
-   };
-
usart2_pins_a: usart2-0 {
pins1 {
pinmux = , /* USART2_TX */
-- 
2.17.1



Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-14 Thread Patrick Bellasi


On Tue, Oct 13, 2020 at 22:25:48 +0200, Dietmar Eggemann 
 wrote...

Hi Dietmar,

> Hi Yun,
>
> On 12/10/2020 18:31, Yun Hsiang wrote:
>> If the user wants to stop controlling uclamp and let the task inherit
>> the value from the group, we need a method to reset.
>> 
>> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
>> sched_setattr syscall.
>
> before we decide on how to implement the 'uclamp user_defined reset'
> feature, could we come back to your use case in
> https://lkml.kernel.org/r/20201002053812.GA176142@ubuntu ?
>
> Lets just consider uclamp min for now. We have:
>
> (1) system-wide:
>
> # cat /proc/sys/kernel/sched_util_clamp_min
>
> 1024
>
> (2) tg (hierarchy) with top-app's cpu.uclamp.min to ~200 (20% of 1024):
>
> # cat /sys/fs/cgroup/cpu/top-app/cpu.uclamp.min
> 20
>
> (3) and 2 cfs tasks A and B in top-app:
>
> # cat /sys/fs/cgroup/cpu/top-app/tasks
>
> pid_A
> pid_B
>
> Then you set A and B's uclamp min to 100. A and B are now user_defined.
> A and B's effective uclamp min value is 100.
>
> Since the task uclamp min values (3) are less than (1) and (2), their
> uclamp min value is not affected by (1) or (2).
>
> If A doesn't want to control itself anymore, it can set its uclamp min
> to e.g. 300. Now A's effective uclamp min value is ~200, i.e. controlled
> by (2), the one of B stays 100.
>
> So the policy is:
>
> (a) If the user_defined task wants to control it's uclamp, use task
> uclamp value less than the tg (hierarchy) (and the system-wide)
> value.
>
> (b) If the user_defined task doesn't want to control it's uclamp
> anymore, use a uclamp value greater than or equal the tg (hierarchy)
> (and the system-wide) value.
>
> So where exactly is the use case which would require a 'uclamp
> user_defined reset' functionality?

Not sure what's the specific use-case Yun is after, but I have at least
one in my mind.

Let say a task does not need boost at all, independently from
the cgroup it's configured to run into. We can go on and set its task
specific value to util_min=0.

In this case, when the task is running alone on a CPU, it will get
always the minimum OPP, independently from its utilization.

Now, after a while (e.g. some special event happens) we want to relax
this constraint and allow the task to run:
  1. at whatever OPP is required by its utilization
  2. with any additional boost possibly enforced by its cgroup

Right now we have only quite cumbersome or hack solution:
  a) go check the current cgroup util_min value and set for the task
  something higher than that
  b) set task::util_min=1024 thus asking for the maximum possible boost

Solution a) is more code for userspace and it's also racy. Solution b)
is misleading since the task does not really want to run at 1024.
It's also potentially over-killing in case the task should be moved to
the root group, which is normally unbounded and thus the task will get
executed always at the max OPP without any specific reason why.

A simple _UCLAMP_RESET flag will allow user-space to easily switch a
tasks to the default behavior (follow utilization or recommended
boosts) which is what a task usually gets when it does not opt-in to
uclamp.

Looking forward to see if Yun has an even more specific use-case.



Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-13 Thread Patrick Bellasi


On Tue, Oct 13, 2020 at 15:32:46 +0200, Qais Yousef  
wrote...

> On 10/13/20 13:46, Patrick Bellasi wrote:
>> > So IMO you just need a single SCHED_FLAG_UTIL_CLAMP_RESET that if set in 
>> > the
>> > attr, you just execute that loop in __setscheduler_uclamp() + reset
>> > uc_se->user_defined.
>> >
>> > It should be invalid to pass the SCHED_FLAG_UTIL_CLAMP_RESET with
>> > SCHED_FLAG_UTIL_CLAMP_MIN/MAX. Both have contradictory meaning IMO.
>> > If user passes both we should return an EINVAL error.
>> 
>> Passing in  _CLAMP_RESET|_CLAMP_MIN will mean reset the min value while
>> keeping the max at whatever it is. I think there could be cases where
>> this support could be on hand.
>
> I am not convinced personally. I'm anxious about what this fine grained 
> control
> means and how it should be used. I think less is more in this case and we can
> always relax the restriction (appropriately) later if it's *really* required.
>
> Particularly the fact that this user_defined is per uclamp_se and that it
> affects the cgroup behavior is implementation details this API shouldn't rely
> on.

The user_defined flag is an implementation details: true, but since the
beginning uclamp _always_ allowed a task to set only one of its clamp
values.

That's why we have UTIL_CLAMP_{MIN,MAX} as separate flags and all the
logic in place to set only one of the two.


> A generic RESET my uclamp settings makes more sense for me as a long term
> API to maintain.
>
> Actually maybe we should even go for a more explicit
> SCHED_FLAG_UTIL_CLAMP_INHERIT_CGROUP flag instead. If we decide to abandon the
> support for this feature in the future, at least we can make it return an 
> error
> without affecting other functionality because of the implicit nature of
> SCHED_FLAG_UTIL_CLAMP_RESET means inherit cgroup value too.

That's not true and it's an even worst implementation detail what you
want to expose.

A task without a task specific clamp _always_ inherits the system
defaults. Resetting a task specific clamp already makes sense also
_without_ cgroups. It means: just do whatever the system allows you to
do.

Only if you are running with CGRoups enabled and the task happens to be
_not_ in the root group, the "CGroups inheritance" happens.
But that's exactly an internal detail a task should not care about.


> That being said, I am not strongly against the fine grained approach if that's
> what Yun wants now or what you both prefer.

It's not a fine grained approach, it's just adding a reset mechanism for
what uclamp already allows to do: setting min and max clamps
independently.

Regarding use cases, I also believe we have many more use cases of tasks
interested in setting/resetting just one clamp than tasks interested in
"fine grain" controlling both clamps at the same time.


> I just think the name of the flag needs to change to be more explicit
> too then.

I don't agree on that and, again, I see much more fine grained details and
internals exposure in what you propose compared to a single generic
_RESET flag.

> It'd be good to hear what others think.

I agree on that ;)



Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-13 Thread Patrick Bellasi


On Tue, Oct 13, 2020 at 12:29:51 +0200, Qais Yousef  
wrote...

> On 10/13/20 10:21, Patrick Bellasi wrote:
>> 

[...]

>> > +#define SCHED_FLAG_UTIL_CLAMP_RESET (SCHED_FLAG_UTIL_CLAMP_RESET_MIN | \
>> > +  SCHED_FLAG_UTIL_CLAMP_RESET_MAX)
>> > +
>> >  #define SCHED_FLAG_ALL(SCHED_FLAG_RESET_ON_FORK   | \
>> > SCHED_FLAG_RECLAIM | \
>> > SCHED_FLAG_DL_OVERRUN  | \
>> > SCHED_FLAG_KEEP_ALL| \
>> > -   SCHED_FLAG_UTIL_CLAMP)
>> > +   SCHED_FLAG_UTIL_CLAMP  | \
>> > +   SCHED_FLAG_UTIL_CLAMP_RESET)
>> 
>> 
>> ... and use it in conjunction with the existing _CLAMP_{MIN,MAX} to know
>> which clamp should be reset?
>
> I think the RESET should restore *both* MIN and MAX and reset the user_defined
> flag. Since the latter is the main purpose of this interface, I don't think 
> you
> can reset the user_defined flag without resetting both MIN and MAX to
> uclamp_none[UCLAMP_MIN/MAX].

We can certainly set one clamp and not the other, and indeed the
user_defined flag is per-clamp_id, thus we can reset one clamp while
still keeping user-defined the other one.


>> >  #endif /* _UAPI_LINUX_SCHED_H */
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 9a2fbf98fd6f..ed4cb412dde7 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -1207,15 +1207,22 @@ static void __setscheduler_uclamp(struct 
>> > task_struct *p,
>> >uclamp_se_set(uc_se, clamp_value, false);
>> >}
>> >  
>> > -  if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>> > +  if (likely(!(attr->sched_flags &
>> > +  (SCHED_FLAG_UTIL_CLAMP | SCHED_FLAG_UTIL_CLAMP_RESET
>> >return;
>> 
>> This check will not be changed, while we will have to add a bypass in
>> uclamp_validate().
>> 
>> >  
>> > -  if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>> > +  if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MIN) {
>> > +  uclamp_se_set(>uclamp_req[UCLAMP_MIN],
>> > +0, false);
>> > +  } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>> >uclamp_se_set(>uclamp_req[UCLAMP_MIN],
>> >  attr->sched_util_min, true);
>> >}
>> >
>> 
>> These checks also will have to be updated to check _RESET and
>> _{MIN,MAX} combinations.
>> 
>> Bonus point would be to be possible to pass in just the _RESET flag if
>> we want to reset both clamps. IOW, passing in _RESET only should be
>> consumed as if we passed in _RESET|_MIN|_MAX.
>> 
>> Caveat, RT tasks have got a special 'reset value' for _MIN.
>> We should check and ensure __uclamp_update_uti_min_rt_default() is
>> property called for those tasks, which likely will require some
>> additional refactoring :/
>
> Hmm I am probably missing something. But if the SCHED_FLAG_UTIL_CLAMP_RESET is
> set, just reset uc_se->user_defined in the loop in __setscheduler_uclamp().
> This should take care of doing the reset properly then. Including for
> RT tasks.

Yes and no. Yes because in principle we can just reset the flag for a
clamp_id without updating the request values, as it is done by the
snippets above, and the internals should work.

However, we will end up reporting the old values when reading from
user-space. We should better check all those reporting code paths or...
just update the requested values as Yun is proposing above.

I like better Yun approach so that we keep internal data structures
aligned with features.

> So IMO you just need a single SCHED_FLAG_UTIL_CLAMP_RESET that if set in the
> attr, you just execute that loop in __setscheduler_uclamp() + reset
> uc_se->user_defined.
>
> It should be invalid to pass the SCHED_FLAG_UTIL_CLAMP_RESET with
> SCHED_FLAG_UTIL_CLAMP_MIN/MAX. Both have contradictory meaning IMO.
> If user passes both we should return an EINVAL error.

Passing in  _CLAMP_RESET|_CLAMP_MIN will mean reset the min value while
keeping the max at whatever it is. I think there could be cases where
this support could be on hand.

However, as in my previous email, by passing in only _CLAMP_RESET, we
should go an reset both.



Re: [PATCH v2] sched/features: Fix !CONFIG_JUMP_LABEL case

2020-10-13 Thread Patrick Bellasi


On Tue, Oct 13, 2020 at 07:31:14 +0200, Juri Lelli  
wrote...

> Commit 765cc3a4b224e ("sched/core: Optimize sched_feat() for
> !CONFIG_SCHED_DEBUG builds") made sched features static for
> !CONFIG_SCHED_DEBUG configurations, but overlooked the CONFIG_
> SCHED_DEBUG enabled and !CONFIG_JUMP_LABEL cases. For the latter echoing
> changes to /sys/kernel/debug/sched_features has the nasty effect of
> effectively changing what sched_features reports, but without actually
> changing the scheduler behaviour (since different translation units get
> different sysctl_sched_features).

Hops, yes, I think I missed to properly check that config :/
Good spot!

> Fix CONFIG_SCHED_DEBUG and !CONFIG_JUMP_LABEL configurations by properly
> restructuring ifdefs.
>
> Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for 
> !CONFIG_SCHED_DEBUG builds")
> Co-developed-by: Daniel Bristot de Oliveira 
> Signed-off-by: Daniel Bristot de Oliveira 
> Signed-off-by: Juri Lelli 

(did you get some wrong formatting for the changelog above?)

> ---
> v1->v2
>  - use CONFIG_JUMP_LABEL (and not the old HAVE_JUMP_LABEL) [Valentin]
> ---
>  kernel/sched/core.c  |  2 +-
>  kernel/sched/sched.h | 13 ++---
>  2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3dc415f58bd7..a7949e3ed7e7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -44,7 +44,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>  
> -#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
> +#ifdef CONFIG_SCHED_DEBUG
>  /*
>   * Debugging: various feature bits
>   *
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 28709f6b0975..8d1ca65db3b0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1629,7 +1629,7 @@ enum {
>  
>  #undef SCHED_FEAT
>  
> -#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
> +#ifdef CONFIG_SCHED_DEBUG
>  
>  /*
>   * To support run-time toggling of sched features, all the translation units
> @@ -1637,6 +1637,7 @@ enum {
>   */
>  extern const_debug unsigned int sysctl_sched_features;
>  
> +#ifdef CONFIG_JUMP_LABEL
>  #define SCHED_FEAT(name, enabled)\
>  static __always_inline bool static_branch_##name(struct static_key *key) \
>  {\
> @@ -1649,7 +1650,13 @@ static __always_inline bool 
> static_branch_##name(struct static_key *key) \
>  extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
>  #define sched_feat(x) (static_branch_##x(_feat_keys[__SCHED_FEAT_##x]))
>  
> -#else /* !(SCHED_DEBUG && CONFIG_JUMP_LABEL) */
> +#else /* !CONFIG_JUMP_LABEL */
> +
> +#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
> +
> +#endif /* CONFIG_JUMP_LABEL */
> +
> +#else /* !SCHED_DEBUG */
>  
>  /*
>   * Each translation unit has its own copy of sysctl_sched_features to allow
> @@ -1665,7 +1672,7 @@ static const_debug __maybe_unused unsigned int 
> sysctl_sched_features =
>  
>  #define sched_feat(x) !!(sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
>  
> -#endif /* SCHED_DEBUG && CONFIG_JUMP_LABEL */
> +#endif /* SCHED_DEBUG */
>  
>  extern struct static_key_false sched_numa_balancing;
>  extern struct static_key_false sched_schedstats;



Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-13 Thread Patrick Bellasi


Hi Yun,
thanks for sharing this new implementation.

On Mon, Oct 12, 2020 at 18:31:40 +0200, Yun Hsiang  
wrote...

> If the user wants to stop controlling uclamp and let the task inherit
> the value from the group, we need a method to reset.
>
> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
> sched_setattr syscall.

Looks like what you say here is not what you code, since you actually
add two new flags, _RESET_{MIN,MAX}.

I think we value instead a simple user-space interface where just the
additional one flag _RESET should be good enough.

> Signed-off-by: Yun Hsiang 
> ---
>  include/uapi/linux/sched.h |  9 -
>  kernel/sched/core.c| 16 
>  2 files changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 3bac0a8ceab2..a12e88c362d8 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -132,6 +132,9 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS   0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX0x40
> +#define SCHED_FLAG_UTIL_CLAMP_RESET_MIN  0x80
> +#define SCHED_FLAG_UTIL_CLAMP_RESET_MAX  0x100

What about adding just SCHED_FLAG_UTIL_CLAMP_RESET 0x08 ...

>  
>  #define SCHED_FLAG_KEEP_ALL  (SCHED_FLAG_KEEP_POLICY | \
>SCHED_FLAG_KEEP_PARAMS)
> @@ -139,10 +142,14 @@ struct clone_args {
>  #define SCHED_FLAG_UTIL_CLAMP(SCHED_FLAG_UTIL_CLAMP_MIN | \
>SCHED_FLAG_UTIL_CLAMP_MAX)

... making it part of SCHED_FLAG_UTIL_CLAMP ...

>  
> +#define SCHED_FLAG_UTIL_CLAMP_RESET (SCHED_FLAG_UTIL_CLAMP_RESET_MIN | \
> + SCHED_FLAG_UTIL_CLAMP_RESET_MAX)
> +
>  #define SCHED_FLAG_ALL   (SCHED_FLAG_RESET_ON_FORK   | \
>SCHED_FLAG_RECLAIM | \
>SCHED_FLAG_DL_OVERRUN  | \
>SCHED_FLAG_KEEP_ALL| \
> -  SCHED_FLAG_UTIL_CLAMP)
> +  SCHED_FLAG_UTIL_CLAMP  | \
> +  SCHED_FLAG_UTIL_CLAMP_RESET)


... and use it in conjunction with the existing _CLAMP_{MIN,MAX} to know
which clamp should be reset?


>  #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..ed4cb412dde7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1207,15 +1207,22 @@ static void __setscheduler_uclamp(struct task_struct 
> *p,
>   uclamp_se_set(uc_se, clamp_value, false);
>   }
>  
> - if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> + if (likely(!(attr->sched_flags &
> + (SCHED_FLAG_UTIL_CLAMP | SCHED_FLAG_UTIL_CLAMP_RESET
>   return;

This check will not be changed, while we will have to add a bypass in
uclamp_validate().

>  
> - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MIN) {
> + uclamp_se_set(>uclamp_req[UCLAMP_MIN],
> +   0, false);
> + } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
> attr->sched_util_min, true);
>   }
>

These checks also will have to be updated to check _RESET and
_{MIN,MAX} combinations.

Bonus point would be to be possible to pass in just the _RESET flag if
we want to reset both clamps. IOW, passing in _RESET only should be
consumed as if we passed in _RESET|_MIN|_MAX.

Caveat, RT tasks have got a special 'reset value' for _MIN.
We should check and ensure __uclamp_update_uti_min_rt_default() is
property called for those tasks, which likely will require some
additional refactoring :/

> - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MAX) {
> + uclamp_se_set(>uclamp_req[UCLAMP_MAX],
> +   1024, false);
> + } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
>   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
> attr->sched_util_max, true);
>   }
> @@ -4901,7 +4908,8 @@ static int __sched_setscheduler(struct task_struct *p,
>   goto change;
>   if (dl_policy(policy) && dl_param_changed(p, attr))
>   goto change;
> - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> + if (attr->sched_flags &
> + (SCHED_FLAG_UTIL_CLAMP | 
> SCHED_FLAG_UTIL_CLAMP_RESET))
>   goto change;
>  
>   p->sched_reset_on_fork = reset_on_fork;



Re: [PATCH 1/1] sched/uclamp: release per-task uclamp control if user set to default value

2020-10-05 Thread Patrick Bellasi


Hi Yun, Dietmar,

On Mon, Oct 05, 2020 at 14:38:18 +0200, Dietmar Eggemann 
 wrote...

> + Patrick Bellasi 
> + Qais Yousef 
>
> On 02.10.20 07:38, Yun Hsiang wrote:
>> On Wed, Sep 30, 2020 at 03:12:51PM +0200, Dietmar Eggemann wrote:
>
> [...]
>
>>> On 28/09/2020 10:26, Yun Hsiang wrote:
>>>> If the user wants to release the util clamp and let cgroup to control it,
>>>> we need a method to reset.
>>>>
>>>> So if the user set the task uclamp to the default value (0 for UCLAMP_MIN
>>>> and 1024 for UCLAMP_MAX), reset the user_defined flag to release control.
>>>>
>>>> Signed-off-by: Yun Hsiang 
>>>
>>> could you explain with a little bit more detail why you would need this
>>> feature?
>>>
>>> Currently we assume that once the per-task uclamp (user-defined) values
>>> are set, you could only change the effective uclamp values of this task
>>> by (1) moving it into another taskgroup or (2) changing the system
>>> default uclamp values.
>>>
>> 
>> Assume a module that controls group (such as top-app in android) uclamp and
>> task A in the group.
>> Once task A set uclamp, it will not be affected by the group setting.

That's not true, and Dietmar example here after is correct.

We call it uclamp since the values are clamps, which are always
aggregate somehow at different levels. IOW, a task has never a full free
choice of the final effective value.

> This depends on the uclamp values of A and /TG (the task group).
>
> Both uclamp values are max aggregated (max aggregation between
> system-wide, taskgroup and task values for UCLAMP_MIN and UCLAMP_MAX).
>
> (1) system-wide: /proc/sys/kernel/sched_util_clamp_[min,max]
>
> (2) taskgroup (hierarchy): /sys/fs/cgroup/cpu/TG/cpu.uclamp.[min,max]
>
> (3) task A:
>
> Example: [uclamp_min, uclamp_max]
>
> (1)  [1024, 1024]
>
> (2)  [25.00 (256), 75.00 (768)]
>
> (3a) [128, 512] : both values are not affected by /TG
>
> (3b) [384, 896] : both values are affected by /TG
>
>
>> If task A doesn't want to control itself anymore,

To be precise, in this case we should say: "if a task don't want to give
up anymore".

Indeed, the base idea is that a task can always and only
"ask for less". What it really gets (effective value) is the minimum
among its request, what the group allows and the system wide value on
top, i.e ref [4,5]:

   eff_value = MIN(system-wide, MIN(tg, task))


>> it can not go back to the initial state to let the module(group) control.
>
> In case A changes its values e.g. from 3a to 3b it will go back to be
> controlled by /TG again (like it was when it had no user defined
> values).

True, however it's also true that strictly speaking once a task has
defined a per-task value, we will always aggregate/clamp that value wrt
to TG and SystemWide value.

>> But the other tasks in the group will be affected by the group.

This is not clear to me.

All tasks in a group will be treated independently. All the tasks are
subject to the same _individual_ aggregation/clamping policy.

> Yes, in case they have no user defined values or have values greater
> than the one of /TG.
>
>> The policy might be
>> 1) if the task wants to control it's uclamp, use task uclamp value

Again, worth to stress, a task has _never_ full control of it's clamp.
Precisely, a task has the freedom to always ask less than what it's
enforced at TG/System level.

IOW, task-specific uclamp values support only a "nice" policy, where a
task can only give up something. Either be _less_ boosted or _more_
capped, which in both cases corresponds to asking for _less_ CPU
bandwidth.

>> (but under group uclamp constraint)
>
> That would be example 3a.
>
>> 2) if the task doesn't want to control it's uclamp, use group uclamp value.
>
> That would be example 3b.
>
>> If the policy is proper, we need a reset method for per-task uclamp.
>> 
>>>> ---
>>>>  kernel/sched/core.c | 7 +--
>>>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 9a2fbf98fd6f..fa63d70d783a 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -1187,6 +1187,7 @@ static void __setscheduler_uclamp(struct task_struct 
>>>> *p,
>>>>  const struct sched_attr *attr)
>>>>  {
>>>>enum uclamp_id clamp_id;
>>>> +  bool user_defined;
>>>>  
>>>>/*
>>>> * On scheduling class change, reset to defaul

Re: [PATCH 4/5] ARM: dts: aspeed: minipack: Update 64MB FMC flash layout

2020-08-25 Thread Patrick Williams
On Mon, Aug 24, 2020 at 02:19:47PM -0700, rentao.b...@gmail.com wrote:
> From: Tao Ren 
> 
> Set 64Mb FMC flash layout in Minipack device tree explicitly because the
> flash layout was removed from "ast2500-facebook-netbmc-common.dtsi".
> 
> Please note "data0" partition' size is updated to 4MB to be consistent
> with other Facebook OpenBMC platforms.
> 
> Signed-off-by: Tao Ren 
> ---
>  .../boot/dts/aspeed-bmc-facebook-minipack.dts | 47 ++-
>  1 file changed, 45 insertions(+), 2 deletions(-)
> 

Reviewed-by: Patrick Williams 

-- 
Patrick Williams


signature.asc
Description: PGP signature


Re: [PATCH 3/5] ARM: dts: aspeed: yamp: Set 32MB FMC flash layout

2020-08-25 Thread Patrick Williams
On Mon, Aug 24, 2020 at 02:19:46PM -0700, rentao.b...@gmail.com wrote:
> From: Tao Ren 
> 
> Set 32MB FMC flash layout in Yamp device tree explicitly because flash
> layout settings were removed from "ast2500-facebook-netbmc-common.dtsi".
> 
> Signed-off-by: Tao Ren 
> ---
>  arch/arm/boot/dts/aspeed-bmc-facebook-yamp.dts | 17 +
>  1 file changed, 17 insertions(+)
> 

Reviewed-by: Patrick Williams 

-- 
Patrick Williams


signature.asc
Description: PGP signature


Re: [PATCH 2/5] ARM: dts: aspeed: cmm: Set 32MB FMC flash layout

2020-08-25 Thread Patrick Williams
On Mon, Aug 24, 2020 at 02:19:45PM -0700, rentao.b...@gmail.com wrote:
> From: Tao Ren 
> 
> Set 32MB FMC flash layout in CMM device tree explicitly because the flash
> layout settings were removed from "ast2500-facebook-netbmc-common.dtsi".
> 
> Signed-off-by: Tao Ren 
> ---
>  arch/arm/boot/dts/aspeed-bmc-facebook-cmm.dts | 17 +
>  1 file changed, 17 insertions(+)
> 

Reviewed-by: Patrick Williams 

-- 
Patrick Williams


signature.asc
Description: PGP signature


Re: [PATCH 1/5] ARM: dts: aspeed: Remove flash layout from Facebook AST2500 Common dtsi

2020-08-25 Thread Patrick Williams
On Mon, Aug 24, 2020 at 02:19:44PM -0700, rentao.b...@gmail.com wrote:
> From: Tao Ren 
> 
> Remove FMC flash layout from ast2500-facebook-netbmc-common.dtsi because
> flash size and layout varies across different Facebook AST2500 OpenBMC
> platforms.
> 
> Signed-off-by: Tao Ren 
> ---
>  .../boot/dts/ast2500-facebook-netbmc-common.dtsi| 13 -
>  1 file changed, 13 deletions(-)
> 

Reviewed-by: Patrick Williams 

-- 
Patrick Williams


signature.asc
Description: PGP signature


[PATCH V2] USB: serial: ftdi_sio: add IDs for Xsens Mti USB converter

2020-08-06 Thread Patrick Riphagen
From: Patrick Riphagen 

The device added has an FTDI chip inside.
The device is used to connect Xsens USB Motion Trackers.

Cc: sta...@vger.kernel.org
Signed-off-by: Patrick Riphagen 
---
 drivers/usb/serial/ftdi_sio.c | 1 +
 drivers/usb/serial/ftdi_sio_ids.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
index 9ad44a96dfe3..2c08cad32f1d 100644
--- a/drivers/usb/serial/ftdi_sio.c
+++ b/drivers/usb/serial/ftdi_sio.c
@@ -713,6 +713,7 @@ static const struct usb_device_id id_table_combined[] = {
{ USB_DEVICE(XSENS_VID, XSENS_AWINDA_STATION_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_MTDEVBOARD_PID) },
+   { USB_DEVICE(XSENS_VID, XSENS_MTIUSBCONVERTER_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_MTW_PID) },
{ USB_DEVICE(FTDI_VID, FTDI_OMNI1509) },
{ USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) },
diff --git a/drivers/usb/serial/ftdi_sio_ids.h 
b/drivers/usb/serial/ftdi_sio_ids.h
index e8373528264c..b5ca17a5967a 100644
--- a/drivers/usb/serial/ftdi_sio_ids.h
+++ b/drivers/usb/serial/ftdi_sio_ids.h
@@ -160,6 +160,7 @@
 #define XSENS_AWINDA_DONGLE_PID 0x0102
 #define XSENS_MTW_PID  0x0200  /* Xsens MTw */
 #define XSENS_MTDEVBOARD_PID   0x0300  /* Motion Tracker Development Board */
+#define XSENS_MTIUSBCONVERTER_PID  0x0301  /* MTi USB converter */
 #define XSENS_CONVERTER_PID0xD00D  /* Xsens USB-serial converter */
 
 /* Xsens devices using FTDI VID */
-- 
2.25.1
V2: Added CC to stable, From line to match SoB and version in subject



[PATCH] USB: serial: ftdi_sio: add IDs for Xsens Mti USB converter

2020-08-06 Thread Patrick Riphagen
The device added has an FTDI chip inside.
The device is used to connect Xsens USB Motion Trackers.

Cc: sta...@vger.kernel.org
Signed-off-by: Patrick Riphagen 
---
 drivers/usb/serial/ftdi_sio.c | 1 +
 drivers/usb/serial/ftdi_sio_ids.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
index 9ad44a96dfe3..2c08cad32f1d 100644
--- a/drivers/usb/serial/ftdi_sio.c
+++ b/drivers/usb/serial/ftdi_sio.c
@@ -713,6 +713,7 @@ static const struct usb_device_id id_table_combined[] = {
{ USB_DEVICE(XSENS_VID, XSENS_AWINDA_STATION_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_MTDEVBOARD_PID) },
+   { USB_DEVICE(XSENS_VID, XSENS_MTIUSBCONVERTER_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_MTW_PID) },
{ USB_DEVICE(FTDI_VID, FTDI_OMNI1509) },
{ USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) },
diff --git a/drivers/usb/serial/ftdi_sio_ids.h 
b/drivers/usb/serial/ftdi_sio_ids.h
index e8373528264c..b5ca17a5967a 100644
--- a/drivers/usb/serial/ftdi_sio_ids.h
+++ b/drivers/usb/serial/ftdi_sio_ids.h
@@ -160,6 +160,7 @@
 #define XSENS_AWINDA_DONGLE_PID 0x0102
 #define XSENS_MTW_PID  0x0200  /* Xsens MTw */
 #define XSENS_MTDEVBOARD_PID   0x0300  /* Motion Tracker Development Board */
+#define XSENS_MTIUSBCONVERTER_PID  0x0301  /* MTi USB converter */
 #define XSENS_CONVERTER_PID0xD00D  /* Xsens USB-serial converter */
 
 /* Xsens devices using FTDI VID */
-- 
2.25.1



[PATCH] USB: serial: ftdi_sio: add IDs for Xsens Mti USB converter

2020-08-05 Thread Patrick Riphagen
The device added has an FTDI chip inside.
The device is used to connect Xsens USB Motion Trackers.

Signed-off-by: Patrick Riphagen 
---
 drivers/usb/serial/ftdi_sio.c | 1 +
 drivers/usb/serial/ftdi_sio_ids.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
index 9ad44a96dfe3..2c08cad32f1d 100644
--- a/drivers/usb/serial/ftdi_sio.c
+++ b/drivers/usb/serial/ftdi_sio.c
@@ -713,6 +713,7 @@ static const struct usb_device_id id_table_combined[] = {
{ USB_DEVICE(XSENS_VID, XSENS_AWINDA_STATION_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_MTDEVBOARD_PID) },
+   { USB_DEVICE(XSENS_VID, XSENS_MTIUSBCONVERTER_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_MTW_PID) },
{ USB_DEVICE(FTDI_VID, FTDI_OMNI1509) },
{ USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) },
diff --git a/drivers/usb/serial/ftdi_sio_ids.h 
b/drivers/usb/serial/ftdi_sio_ids.h
index e8373528264c..b5ca17a5967a 100644
--- a/drivers/usb/serial/ftdi_sio_ids.h
+++ b/drivers/usb/serial/ftdi_sio_ids.h
@@ -160,6 +160,7 @@
 #define XSENS_AWINDA_DONGLE_PID 0x0102
 #define XSENS_MTW_PID  0x0200  /* Xsens MTw */
 #define XSENS_MTDEVBOARD_PID   0x0300  /* Motion Tracker Development Board */
+#define XSENS_MTIUSBCONVERTER_PID  0x0301  /* MTi USB converter */
 #define XSENS_CONVERTER_PID0xD00D  /* Xsens USB-serial converter */
 
 /* Xsens devices using FTDI VID */
-- 
2.25.1



Dear

2020-08-05 Thread Dossou Cabinet patrick
Dear friend,
Please accept my apologies, I didn't mean to invade your privacy, but
I wrote you an email already, but I'm not sure you have it, so I'm
resending it to you.  I pray that this letter reaches you in good
health.  I am Barrister Dossou Patrick.  I have a US $ 10 million
business proposal in the codicil and last will from a deceased who
happens to be my client.

He also has the same last name with you and he is from your country as
well.  I want to introduce you to his bank as his next of kin so that
the bank will pay the 10 million US dollars to your account in your
country and I will come to your country for my own part.

I will provide more details on the subject once I have received your
response via the attached email address: [dossou.cabin...@gmail.com]
You can also request my phone number for further discussion  On the
question.  We may also take this opportunity to discuss the split
ratio and fund transfer terms on your behalf.
Regards,
Dossou patrick, Esq


..
..
.


Querido amigo,
Accept put exonerated, no quería invadir su privacidad, pero he wrote
an electronic correo anteriormente, pero no estoy seguro de que lo
haya recibido, así that lo enviaré de regreso.  Rezo para que esta
carta te llegue con buena salud.  Soy el abogado Delmedre Daniel.
Tengo una commercial propuesta of $ 10 million dollars of Los Estados
Unidos en el codicilo y el último testimonio of a fallecido that
results ser mi client.

El también lleva el mismo apellido contigo y también es de tu país.
Quiero presentarle has known banco como pariente más cercano para que
ellos paguen los 10 $ million dólares estadounidenses has known cuenta
en su país y yo iré has su país por mi propia parte.

Daré más aclaraciones sobre el thema una vez que reciba su respuesta
through the dirección de correo electrónico
adjunta:[dossou.cabin...@gmail.com] También puede solicitar mi número
de teléfono para una discusión adicional sobre el asunto.  También
podemos aprovechar esa opportunidad para analizar the relación de
intercambio y las modalidades of the transferencia del fondo a su
number.
Saludos,
Dosssou patrick, Esq


[PATCH] USB: serial: ftdi_sio: add IDs for Xsens Mti USB converter

2020-08-05 Thread Patrick Riphagen
The device added has an FTDI chip inside.
The device is used to connect Xsens USB Motion Trackers.

Signed-off-by: Patrick Riphagen 
---
 drivers/usb/serial/ftdi_sio.c | 1 +
 drivers/usb/serial/ftdi_sio_ids.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
index 9ad44a96dfe3..2c08cad32f1d 100644
--- a/drivers/usb/serial/ftdi_sio.c
+++ b/drivers/usb/serial/ftdi_sio.c
@@ -713,6 +713,7 @@ static const struct usb_device_id id_table_combined[] = {
{ USB_DEVICE(XSENS_VID, XSENS_AWINDA_STATION_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_MTDEVBOARD_PID) },
+   { USB_DEVICE(XSENS_VID, XSENS_MTIUSBCONVERTER_PID) },
{ USB_DEVICE(XSENS_VID, XSENS_MTW_PID) },
{ USB_DEVICE(FTDI_VID, FTDI_OMNI1509) },
{ USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) },
diff --git a/drivers/usb/serial/ftdi_sio_ids.h 
b/drivers/usb/serial/ftdi_sio_ids.h
index e8373528264c..b5ca17a5967a 100644
--- a/drivers/usb/serial/ftdi_sio_ids.h
+++ b/drivers/usb/serial/ftdi_sio_ids.h
@@ -160,6 +160,7 @@
 #define XSENS_AWINDA_DONGLE_PID 0x0102
 #define XSENS_MTW_PID  0x0200  /* Xsens MTw */
 #define XSENS_MTDEVBOARD_PID   0x0300  /* Motion Tracker Development Board */
+#define XSENS_MTIUSBCONVERTER_PID  0x0301  /* MTi USB converter */
 #define XSENS_CONVERTER_PID0xD00D  /* Xsens USB-serial converter */
 
 /* Xsens devices using FTDI VID */
-- 
2.25.1



Donation

2020-07-23 Thread Frances Patrick Connolly
Hello,

 We are Frances and Patrick Connolly the mega winner of £115m EuroMillions
lottery, if you get this email then your email was selected after a spin ball.
We have spread most of our wealth over several charities
and organizations. We have voluntarily decided to donate the sum
of £1 Million to you, to verify our winnings please see our YouTube
page below.

WATCH HERE: https://www.youtube.com/watch?v=NdAfkA8QPpg

THIS IS YOUR DONATION CODE: [ 0034219 ]
Reply with the DONATION CODE to this email: francespatrickconnol...@gmail.com

Hope to make you and your family happy.

Regards
Frances and Patrick Connolly.


Re: nouveau regression with 5.7 caused by "PCI/PM: Assume ports without DLL Link Active train links in 100 ms"

2020-07-21 Thread Patrick Volkerding
On 7/21/20 10:27 AM, Mika Westerberg wrote:
> On Tue, Jul 21, 2020 at 11:01:55AM -0400, Lyude Paul wrote:
>> Sure thing. Also, feel free to let me know if you'd like access to one of the
>> systems we saw breaking with this patch - I'm fairly sure I've got one of 
>> them
>> locally at my apartment and don't mind setting up AMT/KVM/SSH
> Probably no need for remote access (thanks for the offer, though). I
> attached a test patch to the bug report:
>
>   https://bugzilla.kernel.org/show_bug.cgi?id=208597
>
> that tries to work it around (based on the ->pm_cap == 0). I wonder if
> anyone would have time to try it out.


Hi Mika,

I can confirm that this patch applied to 5.4.52 fixes the issue with
hybrid graphics on the Thinkpad X1 Extreme gen2.

Thanks,

Pat



Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

2020-07-16 Thread Patrick Bellasi


Hi Vincent,

On Mon, Jul 13, 2020 at 14:59:51 +0200, Vincent Guittot 
 wrote...

> On Fri, 10 Jul 2020 at 21:59, Patrick Bellasi  
> wrote:
>> On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot 
>>  wrote...
>>
>> [...]
>>
>> >> > C) Existing control paths
>> >>
>> >> Assuming:
>> >>
>> >>  C: CFS task currently running on CPUx
>> >>  W: CFS task waking up on the same CPUx
>> >>
>> >> And considering the overall simplified workflow:
>> >>
>> >> core::try_to_wake_up()
>> >>
>> >>   // 1) Select on which CPU W will run
>> >>   core::select_task_rq()
>> >> fair::select_task_rq_fair()
>> >>
>> >>   // 2) Enqueue W on the selected CPU
>> >>   core::ttwu_queue()
>> >> core::ttwu_do_activate()
>> >>   core::activate_task()
>> >> core::enqueue_task()
>> >>   fair::enqueue_task_fair()
>> >> fair::enqueue_entity()
>> >>
>> >>   // 3) Set W's vruntime bonus
>> >>   fair::place_entity()
>> >> se->vruntime = ...
>> >>
>> >>   // 4) Check if C can be preempted by W
>> >>   core::ttwu_do_wakeup()
>> >> core::check_preempt_curr()
>> >>   fair::check_preempt_curr()
>> >> fair::check_preempt_wakeup(curr, se)
>> >>   fair::wakeup_preempt_entity(curr, se)
>> >> vdiff = curr.vruntime - se.vruntime
>> >> return vdiff > wakeup_gran(se)
>> >>
>> >> We see that W preempts C iff:
>> >>
>> >>vdiff > wakeup_gran(se)
>> >>
>> >> Since:
>> >>
>> >> enqueue_entity(cfs_rq, se, flags)
>> >>   place_entity(cfs_rq, se, initial=0)
>> >> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>> >> vruntime = cfs_rq->min_vruntime - thresh
>> >> se->vruntime = max_vruntime(se->vruntime, vruntime)
>> >>
>> >> a waking task's W.vruntime can get a "vruntime bonus" up to:
>> >>  - 1   scheduler latency (w/  GENTLE_FAIR_SLEEPERS)
>> >>  - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
>> >>
>> >>
>> >> > D) Desired behavior
>> >>
>> >> The "vruntime bonus" (thresh) computed in place_entity() should have a
>> >> per-task definition, which defaults to the current implementation.
>> >>
>> >> A bigger vruntime bonus can be configured for latency sensitive tasks.
>> >> A smaller vruntime bonus can be configured for latency tolerant tasks.
>> >
>> > I'm not sure that adjusting what you called "vruntime bonus" is the
>> > right way to provide some latency because it doesn't only provide a
>> > wakeup latency bonus but also provides a runtime bonus.
>>
>> True, however that's what we already do but _just_ in an hard-coded way.
>>
>> A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o
>> FAIR_SLEEPERS. Point is that not all tasks are the same: for some this
>
> From a nice and fair PoV, it's not a bonus but the opposite. In fact
> it's limiting how much credit, the task will keep from its sleep time.

I agree about 'limiting a credit', thus being a _credit_ IMO it's a bonus
and the limiting happens only with GENTLE_FAIR_SLEEPER.

So, in general, tasks _waking up_ get a (limited) credit, i.e. a
vruntime bonus.

Form the FAIR PoV it is even more a bonus since all the machinery AFAIU
it's designed to give some vruntime bonus to _non_ CPU bound / batch
tasks.

That's done to compensate for them being suspended and thus not having a
chance to consume all their fair CPU share in the previous activation.

> Also keep in mind that this value is clamped by its vruntime so a task
> can't get bonus

True, at wakeup we clamped it with the SE (normalized) vruntime.

But still since we do:

   se->vruntime = max(se->vruntime, cfs_rq->min_vruntime-VRUNTIME_BONUS)
  \ A / \--- B ---/

The bigger B is the more likely we are to "penalize" the SE vuntime.


>> bonus can be not really required, for others too small.
>>
>> Regarding the 'runtime bonus' I think it's kind of unavoidable,
>> if we really want a latency sensitive task being scheduled
>> before the others.
>
> That's where I disa

Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

2020-07-10 Thread Patrick Bellasi


On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot 
 wrote...

> Hi Patrick,

Hi Vincent,

[...]

>> > C) Existing control paths
>>
>> Assuming:
>>
>>  C: CFS task currently running on CPUx
>>  W: CFS task waking up on the same CPUx
>>
>> And considering the overall simplified workflow:
>>
>> core::try_to_wake_up()
>>
>>   // 1) Select on which CPU W will run
>>   core::select_task_rq()
>> fair::select_task_rq_fair()
>>
>>   // 2) Enqueue W on the selected CPU
>>   core::ttwu_queue()
>> core::ttwu_do_activate()
>>   core::activate_task()
>> core::enqueue_task()
>>   fair::enqueue_task_fair()
>> fair::enqueue_entity()
>>
>>   // 3) Set W's vruntime bonus
>>   fair::place_entity()
>> se->vruntime = ...
>>
>>   // 4) Check if C can be preempted by W
>>   core::ttwu_do_wakeup()
>> core::check_preempt_curr()
>>   fair::check_preempt_curr()
>> fair::check_preempt_wakeup(curr, se)
>>   fair::wakeup_preempt_entity(curr, se)
>> vdiff = curr.vruntime - se.vruntime
>> return vdiff > wakeup_gran(se)
>>
>> We see that W preempts C iff:
>>
>>vdiff > wakeup_gran(se)
>>
>> Since:
>>
>> enqueue_entity(cfs_rq, se, flags)
>>   place_entity(cfs_rq, se, initial=0)
>> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>> vruntime = cfs_rq->min_vruntime - thresh
>> se->vruntime = max_vruntime(se->vruntime, vruntime)
>>
>> a waking task's W.vruntime can get a "vruntime bonus" up to:
>>  - 1   scheduler latency (w/  GENTLE_FAIR_SLEEPERS)
>>  - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
>>
>>
>> > D) Desired behavior
>>
>> The "vruntime bonus" (thresh) computed in place_entity() should have a
>> per-task definition, which defaults to the current implementation.
>>
>> A bigger vruntime bonus can be configured for latency sensitive tasks.
>> A smaller vruntime bonus can be configured for latency tolerant tasks.
>
> I'm not sure that adjusting what you called "vruntime bonus" is the
> right way to provide some latency because it doesn't only provide a
> wakeup latency bonus but also provides a runtime bonus.

True, however that's what we already do but _just_ in an hard-coded way.

A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o
FAIR_SLEEPERS. Point is that not all tasks are the same: for some this
bonus can be not really required, for others too small.

Regarding the 'runtime bonus' I think it's kind of unavoidable,
if we really want a latency sensitive task being scheduled
before the others.

> It means that one can impact the running time by playing with
> latency_nice whereas the goal is only to impact the wakeup latency.

Well, but I'm not sure how much you can really gain considering that
this bonus is given only at wakeup time: the task should keep
suspending himself. It would get a better result by just asking for a
lower nice value.

Now, asking for a reduced nice value is RLIMIT_NICE and CAP_SYS_NICE
protected. The same will be for latency_nice.

Moreover, considering that by default tasks will get what we already
have as hard-coded or less of a bonus, I don't see how easy should be to
abuse.

To the contrary we can introduce a very useful knob to allow certain
tasks to voluntarily demote themselves and avoid annoying a currently
running task. 

> Instead, it should weight the decision in wakeup_preempt_entity() and
> wakeup_gran()

In those functions we already take the task prio into consideration
(ref details at the end of this message).

Lower nice value tasks have more chances to preempt current since they
will have a smaller wakeup_gran, indeed:

we preempt  IFF   vdiff(se, current) > wakeup_gran(se)
  \/   \-----/
   A  B

While task's prio affects B, in this proposal, lantecy_nice works on the
A side of the equation above by making it a bit more task specific.

That said, it's true that both latency_nice and prio will ultimately
play a role on how much CPU bandwidth a task gets.

Question is: do we deem it useful to have an additional knob working on
the A side of the equation above?

Best,
Patrick



---8<--8<--8<--8<--8<--8<--8<--8<--8<--8<---

TL;DR: The nice value already affects the wakeup latency

As reported above:

   check_preempt_wakeup(rq, p, wake_flags)
 wakeup_preempt_entity(curr, se)
(d) 

[PATCH] ARM: dts: stm32: Correct spi4 pins in stm32mp15-pinctrl.dtsi

2020-07-08 Thread Patrick Delaunay
Move spi4_pins_a nodes from pinctrl_z to pinctrl
as the associated pins are not in BANK Z.

Fixes: 498a7014989dfdd9a47864b55704dc829ed0dc90
Signed-off-by: Patrick Delaunay 
---

 arch/arm/boot/dts/stm32mp15-pinctrl.dtsi | 28 
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi 
b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
index 7eb858732d6d..6aedbd7077ff 100644
--- a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
+++ b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
@@ -1574,6 +1574,20 @@
};
};
 
+   spi4_pins_a: spi4-0 {
+   pins {
+   pinmux = , /* SPI4_SCK */
+;  /* SPI4_MOSI */
+   bias-disable;
+   drive-push-pull;
+   slew-rate = <1>;
+   };
+   pins2 {
+   pinmux = ; /* SPI4_MISO */
+   bias-disable;
+   };
+   };
+
usart2_pins_a: usart2-0 {
pins1 {
pinmux = , /* USART2_TX */
@@ -1776,18 +1790,4 @@
bias-disable;
};
};
-
-   spi4_pins_a: spi4-0 {
-   pins {
-   pinmux = , /* SPI4_SCK */
-;  /* SPI4_MOSI */
-   bias-disable;
-   drive-push-pull;
-   slew-rate = <1>;
-   };
-   pins2 {
-   pinmux = ; /* SPI4_MISO */
-   bias-disable;
-   };
-   };
 };
-- 
2.17.1



Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-30 Thread Patrick Bellasi


On Tue, Jun 30, 2020 at 17:40:34 +0200, Qais Yousef  
wrote...

> Hi Patrick
>
> On 06/30/20 16:55, Patrick Bellasi wrote:
>> 
>> Hi Qais,
>> sorry for commenting on v5 with a v6 already posted, but...
>> ... I cannot keep up with your re-spinning rate ;)
>
> I classified that as a nit really and doesn't affect correctness. We have
> different subjective view on what is better here. I did all the work in the
> past 2 weeks and I think as the author of this patch I have the right to keep
> my preference on subjective matters. I did consider your feedback and didn't
> ignore it and improved the naming and added a comment to make sure there's no
> confusion.
>
> We could nitpick the best name forever, but is it really that important?

Which leans toward confirming the impression I had while reading your
previous response, i.e. you stopped reading at the name change
observation, which would be _just_ a nit-picking, although still worth
IMHO.

Instead, I went further and asked you to consider a different approach:
not adding a new kernel symbol to represent a concept already there.

> I really don't see any added value for one approach or another here to start
> a long debate about it.

Then you could have just called out that instead of silently ignoring
the comment/proposal.

> The comments were small enough that I didn't see any controversy that
> warrants holding the patches longer. I agreed with your proposal to use
> uc_se->active and clarified why your other suggestions don't hold.
>
> You pointed that uclamp_is_enabled() confused you; and I responded that I'll
> change the name.

Perhaps it would not confuse only me having 'something_enabled()'
referring to 'something_used'.

> Sorry for not being explicit about answering the below, but
> I thought my answer implied that I don't prefer it.

Your answer was about a name change, don't see correlation with a
different approach... but should be just me.

>> >> Thus, perhaps we can just use the same pattern used by the
>> >> sched_numa_balancing static key:
>> >> 
>> >>   $ git grep sched_numa_balancing
>> >>   kernel/sched/core.c:DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
>> >>   kernel/sched/core.c:
>> >> static_branch_enable(_numa_balancing);
>> >>   kernel/sched/core.c:
>> >> static_branch_disable(_numa_balancing);
>> >>   kernel/sched/core.c:int state = 
>> >> static_branch_likely(_numa_balancing);
>> >>   kernel/sched/fair.c:if 
>> >> (!static_branch_likely(_numa_balancing))
>> >>   kernel/sched/fair.c:if 
>> >> (!static_branch_likely(_numa_balancing))
>> >>   kernel/sched/fair.c:if 
>> >> (!static_branch_likely(_numa_balancing))
>> >>   kernel/sched/fair.c:if 
>> >> (static_branch_unlikely(_numa_balancing))
>> >>   kernel/sched/sched.h:extern struct static_key_false 
>> >> sched_numa_balancing;
>> >> 
>> >> IOW: unconditionally define sched_uclamp_used as non static in core.c,
>> >> and use it directly on schedutil too.
>> 
>> So, what about this instead of adding the (renamed) method above?
>
> I am sorry there's no written rule that says one should do it in a specific
> way. And AFAIK both way are implemented in the kernel. I appreciate your
> suggestion but as the person who did all the hard work, I think my preference
> matters here too.

You sure know that sometime reviewing code can be an "hard work" too, so I
would not go down that way at all with the discussion. Quite likely I
have a different "subjective" view on how Open Source development works.

> And actually with my approach when uclamp is not compiled in there's no need 
> to
> define an extra variable; and since uclamp_is_used() is defined as false for
> !CONFIG_UCLAMP_TASK, it'll help with DCE, so less likely to end up with dead
> code that'll never run in the final binary.

Good, this is the simple and small reply I've politely asked for.

Best,
Patrick



Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-30 Thread Patrick Bellasi


Hi Qais,
sorry for commenting on v5 with a v6 already posted, but...
... I cannot keep up with your re-spinning rate ;)

More importantly, perhaps you missed to comment on one of my previous
points.

Will have a better look at the rest of v6 later today.

Cheers,
Patrick

On Tue, Jun 30, 2020 at 11:46:24 +0200, Qais Yousef  
wrote...
> On 06/30/20 10:11, Patrick Bellasi wrote:
>> On Mon, Jun 29, 2020 at 18:26:33 +0200, Qais Yousef  
>> wrote...

[...]

>> > +
>> > +static inline bool uclamp_is_enabled(void)
>> > +{
>> > +  return static_branch_likely(_uclamp_used);
>> > +}
>> 
>> Looks like here we mix up terms, which can be confusing.
>> AFAIKS, we use:
>> - *_enabled for the sched class flags (compile time)
>> - *_usedfor the user-space opting in (run time)
>
> I wanted to add a comment here.
>
> I can rename it to uclamp_is_used() if you want.

In my previous message I was mostly asking about this:

>> Thus, perhaps we can just use the same pattern used by the
>> sched_numa_balancing static key:
>> 
>>   $ git grep sched_numa_balancing
>>   kernel/sched/core.c:DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
>>   kernel/sched/core.c:
>> static_branch_enable(_numa_balancing);
>>   kernel/sched/core.c:
>> static_branch_disable(_numa_balancing);
>>   kernel/sched/core.c:int state = 
>> static_branch_likely(_numa_balancing);
>>   kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
>>   kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
>>   kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
>>   kernel/sched/fair.c:if (static_branch_unlikely(_numa_balancing))
>>   kernel/sched/sched.h:extern struct static_key_false sched_numa_balancing;
>> 
>> IOW: unconditionally define sched_uclamp_used as non static in core.c,
>> and use it directly on schedutil too.

So, what about this instead of adding the (renamed) method above?



Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-30 Thread Patrick Bellasi
  return;
>  

[...]

> +/**
> + * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
> + * @rq:  The rq to clamp against. Must not be NULL.
> + * @util:The util value to clamp.
> + * @p:   The task to clamp against. Can be NULL if you want to 
> clamp
> + *   against @rq only.
> + *
> + * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
> + *
> + * If sched_uclamp_used static key is disabled, then just return the util
> + * without any clamping since uclamp aggregation at the rq level in the fast
> + * path is disabled, rendering this operation a NOP.
> + *
> + * Use uclamp_eff_value() if you don't care about uclamp values at rq level. 
> It
> + * will return the correct effective uclamp value of the task even if the
> + * static key is disabled.

Well, if you don't care about rq, you don't call a uclamp_rq_* method.

I would say that the above paragraph is redundant, moreover it adds some
cross-reference to a different method (name) which required maintenance.

What about removing it?

> + */
>  static __always_inline
>  unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> struct task_struct *p)
>  {
> - unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> - unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
> + unsigned long min_util;
> + unsigned long max_util;
> +
> + if (!static_branch_likely(_uclamp_used))
> + return util;
> +
> + min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> + max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);

I think moving the initialization is not required, the compiler should
be smart enough to place theme where's better.

>   if (p) {
>   min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
> @@ -2371,6 +2396,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, 
> unsigned long util,
>  
>   return clamp(util, min_util, max_util);
>  }
> +
> +static inline bool uclamp_is_enabled(void)
> +{
> + return static_branch_likely(_uclamp_used);
> +}

Looks like here we mix up terms, which can be confusing.
AFAIKS, we use:
- *_enabled for the sched class flags (compile time)
- *_usedfor the user-space opting in (run time)

Thus, perhaps we can just use the same pattern used by the
sched_numa_balancing static key:

  $ git grep sched_numa_balancing
  kernel/sched/core.c:DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
  kernel/sched/core.c:static_branch_enable(_numa_balancing);
  kernel/sched/core.c:static_branch_disable(_numa_balancing);
  kernel/sched/core.c:int state = 
static_branch_likely(_numa_balancing);
  kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
  kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
  kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
  kernel/sched/fair.c:if (static_branch_unlikely(_numa_balancing))
  kernel/sched/sched.h:extern struct static_key_false sched_numa_balancing;

IOW: unconditionally define sched_uclamp_used as non static in core.c,
and use it directly on schedutil too.

>  #else /* CONFIG_UCLAMP_TASK */
>  static inline
>  unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> @@ -2378,6 +2408,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, 
> unsigned long util,
>  {
>   return util;
>  }
> +
> +static inline bool uclamp_is_enabled(void)
> +{
> + return false;
> +}
>  #endif /* CONFIG_UCLAMP_TASK */
>  
>  #ifdef arch_scale_freq_capacity

Best,
Patrick


Re: [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-26 Thread Patrick Bellasi


On Thu, Jun 25, 2020 at 17:43:52 +0200, Qais Yousef  
wrote...

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 235b2cae00a0..e2f1fffa013c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -794,6 +794,25 @@ unsigned int sysctl_sched_uclamp_util_max = 
> SCHED_CAPACITY_SCALE;
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> +/*
> + * This static key is used to reduce the uclamp overhead in the fast path. It
> + * only disables the call to uclamp_rq_{inc, dec}() in 
> enqueue/dequeue_task().
> + *
> + * This allows users to continue to enable uclamp in their kernel config with
> + * minimum uclamp overhead in the fast path.
> + *
> + * As soon as userspace modifies any of the uclamp knobs, the static key is
> + * enabled, since we have an actual users that make use of uclamp
> + * functionality.
> + *
> + * The knobs that would enable this static key are:
> + *
> + *   * A task modifying its uclamp value with sched_setattr().
> + *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
> + *   * An admin modifying the cgroup cpu.uclamp.{min, max}
> + */
> +static DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);

This is me being choosy, but given that:

---8<---
% grep -e '_used[ )]' fair.c core.c
fair.c: return static_key_false(&__cfs_bandwidth_used);
fair.c: static_key_slow_inc_cpuslocked(&__cfs_bandwidth_used);
fair.c: static_key_slow_dec_cpuslocked(&__cfs_bandwidth_used);

% grep -e '_enabled[ )]' fair.c core.c
fair.c: if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
fair.c: if (!cfs_rq->runtime_enabled || cfs_rq->nr_running)
fair.c: if (!cfs_rq->runtime_enabled || cfs_rq->curr)
fair.c: if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
fair.c: cfs_rq->runtime_enabled = 0;
fair.c: cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
fair.c: if (!cfs_rq->runtime_enabled)
fair.c: cfs_rq->runtime_enabled = 0;
core.c: if (static_key_false((_steal_rq_enabled))) {
core.c: if (unlikely(!p->sched_class->uclamp_enabled))
core.c: if (unlikely(!p->sched_class->uclamp_enabled))
core.c:  * Prevent race between setting of cfs_rq->runtime_enabled and
core.c: runtime_enabled = quota != RUNTIME_INF;
core.c: runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
core.c: if (runtime_enabled && !runtime_was_enabled)
core.c: if (runtime_enabled)
core.c: cfs_rq->runtime_enabled = runtime_enabled;
core.c: if (runtime_was_enabled && !runtime_enabled)
---8<---

even just for consistency shake, I would still prefer sched_uclamp_enabled for
the static key name.

> +
>  /* Integer rounded range for each bucket */
>  #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, 
> UCLAMP_BUCKETS)
>  
> @@ -994,9 +1013,16 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
> struct task_struct *p,
>   lockdep_assert_held(>lock);
>  
>   bucket = _rq->bucket[uc_se->bucket_id];
> - SCHED_WARN_ON(!bucket->tasks);
> - if (likely(bucket->tasks))
> - bucket->tasks--;
> +
> + /*
> +  * This could happen if sched_uclamp_used was enabled while the
> +  * current task was running, hence we could end up with unbalanced call
> +  * to uclamp_rq_dec_id().
> +  */
> + if (unlikely(!bucket->tasks))
> + return;
> +
> + bucket->tasks--;
>   uc_se->active = false;

In this chunk you are indeed changing the code.

Are we sure there are not issues with patterns like:

  enqueue(taskA)
  // uclamp gets enabled
  enqueue(taskB)
  dequeue(taskA)
  // bucket->tasks is now 0
  dequeue(taskB)

TaskB has been enqueued with with uclamp enabled, thus it
has got uc_se->active=True and enforced its clamp value at RQ level.

But with your change above we don't reset that anymore.

As per my previous proposal: why not just removing the SCHED_WARN_ON?
That's the only real problem in the code above, since now we are not
more granted to have balanced inc/dec.

[...]

> +bool uclamp_is_enabled(void)
> +{
> + return static_branch_likely(_uclamp_used);
> +}
> +

The above and the following changes are not necessary if we just
add the guards at the beginning of uclamp_rq_util_with() instead of what
you do in PATCH1.

I think this will give better overheads removal by avoiding not
necessary min/max clamps for both RT and CFS.
  
> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index 7fbaee24c824..3f4e296ccb67 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long 
> util_cfs,
>   unsigned long dl_util, util, irq;
>   struct rq *rq = cpu_rq(cpu);
>  
> - if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
> + if (!uclamp_is_enabled() &&
>   type == FREQUENCY_UTIL && rt_rq_is_runnable(>rt)) {
>   return max;
>   }
> diff --git 

Re: [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq

2020-06-26 Thread Patrick Bellasi


On Thu, Jun 25, 2020 at 17:43:51 +0200, Qais Yousef  
wrote...

> struct uclamp_rq was zeroed out entirely in assumption that in the first
> call to uclamp_rq_inc() they'd be initialized correctly in accordance to
> default settings.

Perhaps I was not clear in my previous comment:

   https://lore.kernel.org/lkml/87sgekorfq.derkl...@matbug.net/

when I did say:

   Does not this means the problem is more likely with
   uclamp_rq_util_with(), which should be guarded?

I did not mean that we have to guard the calls to that function but
instead that we should just make that function aware of uclamp being
opted in or not.

> But when next patch introduces a static key to skip
> uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
> will fail to perform any frequency changes because the
> rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
> means all rqs are capped to 0 by default.

The initialization you wants to do here it's needed because with the
current approach you keep calling the same uclamp_rq_util_with() and
keep doing min/max aggregations even when uclamp is not opted in.
But this means also that we have min/max aggregation _when not really
required_.

> Fix it by making sure we do proper initialization at init without
> relying on uclamp_rq_inc() doing it later.

My proposal was as simple as:

---8<---
  static __always_inline
  unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
  struct task_struct *p)
  {
unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
  
+   if (!static_branch_unlikely(_uclamp_used))
+   return rt_task(p) ? uclamp_none(UCLAMP_MAX) : util 
  
if (p) {
min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
max_util = max(max_util, uclamp_eff_value(p, UCLAMP_MAX));
}
  
/*
 * Since CPU's {min,max}_util clamps are MAX aggregated considering
 * RUNNABLE tasks with _different_ clamps, we can end up with an
 * inversion. Fix it now when the clamps are applied.
 */
if (unlikely(min_util >= max_util))
return min_util;
  
return clamp(util, min_util, max_util);
  }
---8<---

Such small change is more self-contained IMHO and does not remove
an existing optimizations like this lazy RQ's initialization at first
usage.

Moreover, it can folded in the following patch, with all the other
static keys shortcuts.



Re: [PATCH v2 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-24 Thread Patrick Bellasi
;   /*
>* We update all RUNNABLE tasks only when task groups are in use.
>* Otherwise, keep it simple and do just a lazy update at each next
> @@ -1221,6 +1265,9 @@ static void __setscheduler_uclamp(struct task_struct *p,
>   if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>   return;
>  
> + if (static_branch_unlikely(_uclamp_unused))
> + static_branch_disable(_uclamp_unused);
> +
>   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
>     attr->sched_util_min, true);
> @@ -7315,6 +7362,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file 
> *of, char *buf,
>   if (req.ret)
>   return req.ret;
>  
> + if (static_branch_unlikely(_uclamp_unused))
> + static_branch_disable(_uclamp_unused);
> +
>   mutex_lock(_mutex);
>   rcu_read_lock();

Best,
Patrick


Re: [PATCH v2 1/2] sched/uclamp: Fix initialization of strut uclamp_rq

2020-06-24 Thread Patrick Bellasi


Hi Qais,

On Fri, Jun 19, 2020 at 19:20:10 +0200, Qais Yousef  
wrote...

> struct uclamp_rq was zeroed out entirely in assumption that in the first
> call to uclamp_rq_inc() they'd be initialized correctly in accordance to
> default settings.
>
> But when next patch introduces a static key to skip
> uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
> will fail to perform any frequency changes because the
> rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
> means all rqs are capped to 0 by default.

Does not this means the problem is more likely with uclamp_rq_util_with(),
which should be guarded?

Otherwise, we will also keep doing useless min/max aggregations each
time schedutil calls that function, thus not completely removing
uclamp overheads while user-space has not opted in.

What about dropping this and add the guard in the following patch, along
with the others?

> Fix it by making sure we do proper initialization at init without

>
> Fix it by making sure we do proper initialization at init without
> relying on uclamp_rq_inc() doing it later.
>
> Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
> Signed-off-by: Qais Yousef 
> Cc: Juri Lelli 
> Cc: Vincent Guittot 
> Cc: Dietmar Eggemann 
> Cc: Steven Rostedt 
> Cc: Ben Segall 
> Cc: Mel Gorman 
> CC: Patrick Bellasi 
> Cc: Chris Redpath 
> Cc: Lukasz Luba 
> Cc: linux-kernel@vger.kernel.org
> ---
>  kernel/sched/core.c | 23 ++-
>  1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a43c84c27c6f..4265861e13e9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1248,6 +1248,22 @@ static void uclamp_fork(struct task_struct *p)
>   }
>  }
>  
> +static void __init init_uclamp_rq(struct rq *rq)
> +{
> + enum uclamp_id clamp_id;
> + struct uclamp_rq *uc_rq = rq->uclamp;
> +
> + for_each_clamp_id(clamp_id) {
> + memset(uc_rq[clamp_id].bucket,
> +0,
> +sizeof(struct uclamp_bucket)*UCLAMP_BUCKETS);
> +
> + uc_rq[clamp_id].value = uclamp_none(clamp_id);
> + }
> +
> + rq->uclamp_flags = 0;
> +}
> +
>  static void __init init_uclamp(void)
>  {
>   struct uclamp_se uc_max = {};
> @@ -1256,11 +1272,8 @@ static void __init init_uclamp(void)
>  
>   mutex_init(_mutex);
>  
> - for_each_possible_cpu(cpu) {
> - memset(_rq(cpu)->uclamp, 0,
> - sizeof(struct uclamp_rq)*UCLAMP_CNT);
> - cpu_rq(cpu)->uclamp_flags = 0;
> - }
> + for_each_possible_cpu(cpu)
> + init_uclamp_rq(cpu_rq(cpu));
>  
>   for_each_clamp_id(clamp_id) {
>   uclamp_se_set(_task.uclamp_req[clamp_id],



[SchedulerWakeupLatency] Per-task vruntime wakeup bonus

2020-06-23 Thread Patrick Bellasi


On Tue, Jun 23, 2020 at 09:29:03 +0200, Patrick Bellasi 
 wrote...

> .:: Scheduler Wakeup Path Requirements Collection Template
> ==
>
> A) Name

Runtime tunable vruntime wakeup bonus.


> B) Target behavior

All SCHED_OTHER tasks get the same (max) vruntime wakeup bonus. This
bonus affects the chance the task has to preempt the currently running
task. Some tasks, which are (known to be) latency tolerant, should have
a smaller chance to preempt a (known to be) latency sensitive task. To
the contrary, latency sensitive tasks should have a higher chance to
preempt a currently running latency tolerant task.

This task specific distinction is not provided by the current
implementation and all SCHED_OTHER tasks are handled according to the
same simple, system-wide and not run-time tunable policy.


> C) Existing control paths

Assuming:

 C: CFS task currently running on CPUx
 W: CFS task waking up on the same CPUx

And considering the overall simplified workflow:

core::try_to_wake_up()

  // 1) Select on which CPU W will run
  core::select_task_rq()
fair::select_task_rq_fair()

  // 2) Enqueue W on the selected CPU
  core::ttwu_queue()
core::ttwu_do_activate()
  core::activate_task()
core::enqueue_task()
  fair::enqueue_task_fair()
fair::enqueue_entity()

  // 3) Set W's vruntime bonus
  fair::place_entity()
se->vruntime = ...

  // 4) Check if C can be preempted by W
  core::ttwu_do_wakeup()
core::check_preempt_curr()
  fair::check_preempt_curr()
fair::check_preempt_wakeup(curr, se)
  fair::wakeup_preempt_entity(curr, se)
vdiff = curr.vruntime - se.vruntime
return vdiff > wakeup_gran(se)

We see that W preempts C iff:

   vdiff > wakeup_gran(se)

Since:

enqueue_entity(cfs_rq, se, flags)
  place_entity(cfs_rq, se, initial=0)
thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
vruntime = cfs_rq->min_vruntime - thresh
se->vruntime = max_vruntime(se->vruntime, vruntime)

a waking task's W.vruntime can get a "vruntime bonus" up to:
 - 1   scheduler latency (w/  GENTLE_FAIR_SLEEPERS)
 - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)


> D) Desired behavior

The "vruntime bonus" (thresh) computed in place_entity() should have a
per-task definition, which defaults to the current implementation.

A bigger vruntime bonus can be configured for latency sensitive tasks.
A smaller vruntime bonus can be configured for latency tolerant tasks.

TL;DR

The "vruntime bonus" is meant to give sleepers a compensation for the
service deficit due to them not having (possibly) fully consumed their
assigned fair CPU quota within the current sched_latency interval, see:

  commit 51e0304ce6e5 ("sched: Implement a gentler fair-sleepers feature")

The scheduler does that based on a conservative assumption: when a task
sleeps it gives up a portion (P) of its fair CPU bandwidth share in the
current sched_latency period.
Willing to be FAIR, i.e. each task gets a FAIR quota of the CPU in each
sched_latency period, the scheduler wants to give back P to waking
tasks.

However, striving to minimize overheads and complexity, the CFS
scheduler does that using a simple heuristic: each task waking up gets a
bonus, which is capped at one sched_latency period, independently from
"its nature".

What the scheduler completely disregards is that being completely FAIR
is not always necessary. Depending on the nature of a task, not all
tasks require a bonus. To the contrary:

 - a GENTLE_FAIR_SLEEPERS bonus given to a background task could result
   in preempting a latency sensitive currently running task

 - giving only 1/2 scheduler latency bonus to a latency sensitive task
   could result in that task being preempted before it completes its
   current activation.


> E) Existing knobs

The SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true) defined vruntime bonus value
can be considered the current mainline default value.

This means that "all" CFS tasks waking up will get a

   0.5 * sysctl_sched_latency

vruntime bonus wrt the cfs_rq->min_vruntime.


> F) Existing knobs limitations

GENTLE_FAIR_SLEEPERS is a system-wide knob and it's not run-time
tunable on production systems (being a SCHED_DEBUG feature).

Thus, the sched_feature should be removed and replaced by a per-task
knob.


> G) Proportionality Analysis

The value of the vruntime bonus directly affects the chance a task has
to preempt the currently running task.

Indeed, from the code analysis in C:

  thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)

is the "wakeup bonus", which is used as:

  vruntime = cfs_rq->min_vruntime - thresh
  se->vruntime = max_vruntime(se->vruntime, vruntime)
  vdiff = curr.vrunti

Scheduler wakeup path tuning surface: Use-Cases and Requirements

2020-06-23 Thread Patrick Bellasi


Since last year's OSPM Summit we started conceiving the idea that task
wakeup path could be better tuned for certain classes of workloads
and usage scenarios. Various people showed interest for a possible
tuning interface for the scheduler wakeup path.


.:: The Problem
===

The discussions we had so far [1] have not been effective in clearly
identifying if a common tuning surface is possible. The last discussion
at this year's OSPM Summit [2,3] was also kind of inconclusive and left
us with the message: start by collecting the requirements and then see
what interface fits them the best.

General consensus is that a unified interface can be challenging and
maybe not feasible. However, generalisation is also a value
and we should strive for it whenever it's possible.

Someone might think that we did not put enough effort in the analysis of
requirements. Perhaps what we missed so far is also a structured and
organised way to collect requirements which also can help in factoring
out the things they have in common.


.:: The Proposal


This thread aims at providing a guided template for the description of
different task wakeup use-cases. It does that by setting a series of
questions aimed at precisely describing what's "currently broken", what
we would like to have instead and how we could achieve it.

What we propose here is that, for each wakeup use-case, we start
by replying to this email to provide the required details/comments for
a predefined list of questions. This will generate independent
discussion threads. Each thread will be free to focus on a specific
proposal but still all the thread will be reasoning around a common set
of fundamental concepts.

The hope is that, by representing all the use-cases as sufficiently
detailed responses to a common set of questions, once the discussion
settles down, we can more easily verify if there are common things
surfacing which then can be naturally wrapped into a unified user-space
API.

A first use-case description, following the template guidelines, will
be posted shortly after this message. This also will provide an example
for how to use the template.

NOTE: Whenever required, pseudo-code or simplified C can be used.

I hope this can drive a fruitful discussion in preparation for LPC!

Best,
Patrick


---8<--- For templates submissions: reply only to the following ---8<---


.:: Scheduler Wakeup Path Requirements Collection Template
==

A) Name: unique one-liner name for the proposed use-case

B) Target behaviour: one paragraph to describe the wakeup path issue

C) Existing control paths: reference to code paths to justify B)

   Assuming v5.6 as the reference kernel, this section should provide
   links to code paths such as, e.g.

   fair.c:3917
   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/fair.c?h=v5.6#n3917

   Alternatively code snippets can be added inline, e.g.

/*
 * The 'current' period is already promised to the current tasks,
 * however the extra weight of the new task will slow them down a
 * little, place the new task so that it fits in the slot that
 * stays open at the end.
 */
if (initial && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);

   NOTE: if the use-case exists only outside the mainline Linux kernel
 this section can stay empty

D) Desired behaviour: one paragraph to describe the desired update

   NOTE: the current mainline expression is assumed to be correct
 for existing use-cases. Thus, here we are looking for run-time
 tuning of those existing features.

E) Existing knobs (if any): reference to whatever existing tunable

   Some features can already be tuned, but perhaps only via compile time
   knobs, SCHED_FEATs or system wide tunable.
   If that's the case, we should document them here and explain how they
   currently work and what are (if any) the implicit assumptions, e.g.
   what is the currently implemented scheduling policy/heuristic.

F) Existing knobs (if any): one paragraph description of the limitations

   If the existing knobs are not up to the job for this use-case,
   shortly explain here why. It could be because a tuning surface is
   already there but it's hardcoded (e.g. compile time knob) or too
   coarse grained (e.g. a SCHED_FEAT).

G) Proportionality Analysis: check the nature of the target behavior

   Goal here is to verify and discuss if the behaviour (B) has a
   proportional nature: different values of the control knobs (E) are
   expected to produce different effects for (B).
   
   Special care should be taken to check if the target behaviour has an
   intrinsically "binary nature", i.e. only two values make really
   sense. In this case it would be very useful to argument why a
   generalisation towards a non-binary behavio

[PATCH] ARM: dts: stm32: cosmetic update in stm32mp15-pinctrl.dtsi

2020-06-16 Thread Patrick Delaunay
Use tabs where possible and remove multiple blanks lines.

Signed-off-by: Patrick Delaunay 
---

 arch/arm/boot/dts/stm32mp15-pinctrl.dtsi | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi 
b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
index 7eb858732d6d..7d351757f2f8 100644
--- a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
+++ b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi
@@ -210,8 +210,8 @@
 , /* 
ETH_RGMII_TXD3 */
 , /* 
ETH_RGMII_TX_CTL */
 , /* ETH_MDC */
-, /* ETH_MDIO */
-, /* 
ETH_RGMII_RXD0 */
+, /* ETH_MDIO */
+, /* 
ETH_RGMII_RXD0 */
 , /* 
ETH_RGMII_RXD1 */
 , /* 
ETH_RGMII_RXD2 */
 , /* 
ETH_RGMII_RXD3 */
@@ -453,7 +453,7 @@
i2c5_pins_b: i2c5-1 {
pins {
pinmux = , /* I2C5_SCL */
-; /* I2C5_SDA */
+; /* I2C5_SDA */
bias-disable;
drive-open-drain;
slew-rate = <0>;
@@ -463,7 +463,7 @@
i2c5_sleep_pins_b: i2c5-sleep-1 {
pins {
pinmux = , /* I2C5_SCL */
-; /* I2C5_SDA */
+; /* I2C5_SDA */
};
};
 
@@ -1072,7 +1072,6 @@
};
};
 
-
sai2a_pins_b: sai2a-1 {
pins1 {
pinmux = ,  /* SAI2_SD_A */
-- 
2.17.1



Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-06-05 Thread Patrick Bellasi


On Fri, Jun 05, 2020 at 13:32:04 +0200, Qais Yousef  
wrote...

> On 06/05/20 09:55, Patrick Bellasi wrote:
>> On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef  
>> wrote...

[...]

>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 0464569f26a7..9f48090eb926 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
>> > struct task_struct *p,
>> >  * e.g. due to future modification, warn and fixup the expected 
>> > value.
>> >  */
>> > SCHED_WARN_ON(bucket->value > rq_clamp);
>> > +#if 0
>> > if (bucket->value >= rq_clamp) {
>> > bkt_clamp = uclamp_rq_max_value(rq, clamp_id, 
>> > uc_se->value);
>> > WRITE_ONCE(uc_rq->value, bkt_clamp);
>> > }
>> > +#endif
>> 
>> Yep, that's likely where we have most of the overhead at dequeue time,
>> sine _sometimes_ we need to update the cpu's clamp value.
>> 
>> However, while running perf sched pipe, I expect:
>>  - all tasks to have the same clamp value
>>  - all CPUs to have _always_ at least one RUNNABLE task
>> 
>> Given these two conditions above, if the CPU is never "CFS idle" (i.e.
>> without RUNNABLE CFS tasks), the code above should never be triggered.
>> More on that later...
>
> So the cost is only incurred by idle cpus is what you're saying.

Not really, you pay the cost every time you need to reduce the CPU clamp
value. This can happen also on a busy CPU but only when you dequeue the
last task defining the current uclamp(cpu) value and the remaining
RUNNABLE tasks have a lower value.

>> >  }
>> >
>> >  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>> >
>> >
>> >
>> > uclamp_rq_max_value() could be expensive as it loops over all buckets.
>> 
>> It loops over UCLAMP_CNT values which are defined to fit into a single
>
> I think you meant to say UCLAMP_BUCKETS which is defined 5 by default.

Right, UCLAMP_BUCKETS.

>> $L. That was the optimal space/time complexity compromise we found to
>> get the MAX of a set of values.
>
> It actually covers two cachelines, see below and my other email to
> Mel.

The two cache lines are covered if you consider both min and max clamps.
One single CLAMP_ID has a _size_ which fits into a single cache line.

However, to be precise:
- while uclamp_min spans a single cache line, uclamp_max is likely
  across two
- at enqueue/dequeue time we update both min/max, thus we can touch
  both cache lines

>> > Commenting this whole path out strangely doesn't just 'fix' it,
>> > but produces  better results to no-uclamp kernel :-/
>> >
>> > # ./perf bench -r 20 sched pipe -T -l 5
>> > Without uclamp:5039
>> > With uclamp:   4832
>> > With uclamp+patch: 5729
>> 
>> I explain it below: with that code removed you never decrease the CPU's
>> uclamp value. Thus, the first time you schedule an RT task you go to MAX
>> OPP and stay there forever.
>
> Okay.
>
>> 
>> > It might be because schedutil gets biased differently by uclamp..? If I 
>> > move to
>> > performance governor these numbers almost double.
>> >
>> > I don't know. But this promoted me to look closer and
>> 
>> Just to resume, when a task is dequeued we can have only these cases:
>> 
>> - uclamp(task) < uclamp(cpu):
>>   this happens when the task was co-scheduled with other tasks with
>>   higher clamp values which are still RUNNABLE.
>>   In this case there are no uclamp(cpu) updates.
>> 
>> - uclamp(task) == uclamp(cpu):
>>   this happens when the task was one of the tasks defining the current
>>   uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
>>   tasks clamp values.
>> 
>> In this last case we _not_ always need to do a uclamp(cpu) update.
>> Indeed the update is required _only_ when that task was _the last_ task
>> defining the current uclamp(cpu) value.
>> 
>> In this case we use uclamp_rq_max_value() to do a linear scan of
>> UCLAMP_CNT values which fits into a single cache line.
>
> Again, I think you mean UCLAMP_BUCKETS here. Unless I missed something, they
> span 2 cahcelines on 64bit machines and 64b cacheline size.

Correct:
- s/UCLAMP_CNT/UCLAMP_BUCLKETS/
- 1 cacheline per CLAMP_ID
- the array scan works on 1 CLAMP_ID:
  - spanning 1 cache line for uclamp_min
  - spannin

Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-06-05 Thread Patrick Bellasi


Hi Qais,

On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef  
wrote...

> On 06/03/20 16:59, Vincent Guittot wrote:
>> When I want to stress the fast path i usually use "perf bench sched pipe -T "
>> The tip/sched/core on my arm octo core gives the following results for
>> 20 iterations of perf bench sched pipe -T -l 5
>> 
>> all uclamp config disabled  50035.4(+/- 0.334%)
>> all uclamp config enabled  48749.8(+/- 0.339%)   -2.64%

I use to run the same test but I don't remember such big numbers :/

>> It's quite easy to reproduce and probably easier to study the impact
>
> Thanks Vincent. This is very useful!
>
> I could reproduce that on my Juno.
>
> One of the codepath I was suspecting seems to affect it.
>
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..9f48090eb926 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
> struct task_struct *p,
>  * e.g. due to future modification, warn and fixup the expected value.
>  */
> SCHED_WARN_ON(bucket->value > rq_clamp);
> +#if 0
> if (bucket->value >= rq_clamp) {
> bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> WRITE_ONCE(uc_rq->value, bkt_clamp);
> }
> +#endif

Yep, that's likely where we have most of the overhead at dequeue time,
sine _sometimes_ we need to update the cpu's clamp value.

However, while running perf sched pipe, I expect:
 - all tasks to have the same clamp value
 - all CPUs to have _always_ at least one RUNNABLE task

Given these two conditions above, if the CPU is never "CFS idle" (i.e.
without RUNNABLE CFS tasks), the code above should never be triggered.
More on that later...

>  }
>
>  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>
>
> uclamp_rq_max_value() could be expensive as it loops over all buckets.

It loops over UCLAMP_CNT values which are defined to fit into a single
$L. That was the optimal space/time complexity compromise we found to
get the MAX of a set of values.

> Commenting this whole path out strangely doesn't just 'fix' it,
> but produces  better results to no-uclamp kernel :-/
>
> # ./perf bench -r 20 sched pipe -T -l 5
> Without uclamp:   5039
> With uclamp:  4832
> With uclamp+patch:5729

I explain it below: with that code removed you never decrease the CPU's
uclamp value. Thus, the first time you schedule an RT task you go to MAX
OPP and stay there forever.

> It might be because schedutil gets biased differently by uclamp..? If I move 
> to
> performance governor these numbers almost double.
>
> I don't know. But this promoted me to look closer and

Just to resume, when a task is dequeued we can have only these cases:

- uclamp(task) < uclamp(cpu):
  this happens when the task was co-scheduled with other tasks with
  higher clamp values which are still RUNNABLE.
  In this case there are no uclamp(cpu) updates.

- uclamp(task) == uclamp(cpu):
  this happens when the task was one of the tasks defining the current
  uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
  tasks clamp values.

In this last case we _not_ always need to do a uclamp(cpu) update.
Indeed the update is required _only_ when that task was _the last_ task
defining the current uclamp(cpu) value.

In this case we use uclamp_rq_max_value() to do a linear scan of
UCLAMP_CNT values which fits into a single cache line.

> I think I spotted a bug where in the if condition we check for '>='
> instead of '>', causing us to take the supposedly impossible fail safe
> path.

The fail safe path is when the '>' condition matches, which is what the
SCHED_WARN_ON tell us. Indeed, we never expect uclamp(cpu) to be bigger
than one of its RUNNABLE tasks. If that should happen we WARN and fix
the cpu clamp value for the best.

The normal path is instead '=' and, according to by previous resume,
it's expected to be executed _only_ when we dequeue the last task of the
clamp group defining the current uclamp(cpu) value.

> Mind trying with the below patch please?
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..50d66d4016ff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
> struct task_struct *p,
>  * e.g. due to future modification, warn and fixup the expected value.
>  */
> SCHED_WARN_ON(bucket->value > rq_clamp);
> -   if (bucket->value >= rq_clamp) {
> +   if (bucket->value > rq_clamp) {
> bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> WRITE_ONCE(uc_rq->value, bkt_clamp);
> }

This patch is thus bogus, since we never expect to have uclamp(cpu)
bigger than uclamp(task) and thus we will never reset the clamp value of
a cpu.



Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-06-03 Thread Patrick Bellasi


Hi Dietmar,
thanks for sharing these numbers.

On Tue, Jun 02, 2020 at 18:46:00 +0200, Dietmar Eggemann 
 wrote...

[...]

> I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2
> (2 sockets * 10 cores * 2 threads) with powersave governor as:
>
> $ numactl -N 0 ./run-mmtests.sh XXX

Great setup, it's worth to rule out all possible noise source (freq
scaling, thermal throttling, NUMA scheduler, etc.).
Wondering if disabling HT can also help here in reducing results "noise"?

> w/ config-network-netperf-unbound.
>
> Running w/o 'numactl -N 0' gives slightly worse results.
>
> without-clamp  : CONFIG_UCLAMP_TASK is not set
> with-clamp : CONFIG_UCLAMP_TASK=y,
>  CONFIG_UCLAMP_TASK_GROUP is not set
> with-clamp-tskgrp  : CONFIG_UCLAMP_TASK=y,
>  CONFIG_UCLAMP_TASK_GROUP=y
>
>
> netperf-udp
> ./5.7.0-rc7./5.7.0-rc7
> ./5.7.0-rc7
>   without-clamp with-clamp  
> with-clamp-tskgrp

Can you please specify how to read the following scores? I give it a run
to my local netperf and it reports Throughput, thous I would expect the
higher the better... but... this seems something different.

> Hmean send-64 153.62 (   0.00%)  151.80 *  -1.19%*  
> 155.60 *   1.28%*
> Hmean send-128306.77 (   0.00%)  306.27 *  -0.16%*  
> 309.39 *   0.85%*
> Hmean send-256608.54 (   0.00%)  604.28 *  -0.70%*  
> 613.42 *   0.80%*
> Hmean send-1024  2395.80 (   0.00%) 2365.67 *  -1.26%* 
> 2409.50 *   0.57%*
> Hmean send-2048  4608.70 (   0.00%) 4544.02 *  -1.40%* 
> 4665.96 *   1.24%*
> Hmean send-3312  7223.97 (   0.00%) 7158.88 *  -0.90%* 
> 7331.23 *   1.48%*
> Hmean send-4096  8729.53 (   0.00%) 8598.78 *  -1.50%* 
> 8860.47 *   1.50%*
> Hmean send-8192 14961.77 (   0.00%)14418.92 *  -3.63%*
> 14908.36 *  -0.36%*
> Hmean send-1638425799.50 (   0.00%)25025.64 *  -3.00%*
> 25831.20 *   0.12%*

If I read it as the lower the score the better, all the above results
tell us that with-clamp is even better, while with-clamp-tskgrp
is not that much worst.

The other way around (the higher the score the better) would look odd
since we definitively add in more code and complexity when uclamp has
the TG support enabled we would not expect better scores.

> Hmean recv-64 153.62 (   0.00%)  151.80 *  -1.19%*  
> 155.60 *   1.28%*
> Hmean recv-128306.77 (   0.00%)  306.27 *  -0.16%*  
> 309.39 *   0.85%*
> Hmean recv-256608.54 (   0.00%)  604.28 *  -0.70%*  
> 613.42 *   0.80%*
> Hmean recv-1024  2395.80 (   0.00%) 2365.67 *  -1.26%* 
> 2409.50 *   0.57%*
> Hmean recv-2048  4608.70 (   0.00%) 4544.02 *  -1.40%* 
> 4665.95 *   1.24%*
> Hmean recv-3312  7223.97 (   0.00%) 7158.88 *  -0.90%* 
> 7331.23 *   1.48%*
> Hmean recv-4096  8729.53 (   0.00%) 8598.78 *  -1.50%* 
> 8860.47 *   1.50%*
> Hmean recv-8192 14961.61 (   0.00%)14418.88 *  -3.63%*
> 14908.30 *  -0.36%*
> Hmean recv-1638425799.39 (   0.00%)25025.49 *  -3.00%*
> 25831.00 *   0.12%*
>
> netperf-tcp
>  
> Hmean 64  818.65 (   0.00%)  812.98 *  -0.69%*  
> 826.17 *   0.92%*
> Hmean 1281569.55 (   0.00%) 1555.79 *  -0.88%* 
> 1586.94 *   1.11%*
> Hmean 2562952.86 (   0.00%) 2915.07 *  -1.28%* 
> 2968.15 *   0.52%*
> Hmean 1024  10425.91 (   0.00%)10296.68 *  -1.24%*
> 10418.38 *  -0.07%*
> Hmean 2048  17454.51 (   0.00%)17369.57 *  -0.49%*
> 17419.24 *  -0.20%*
> Hmean 3312  22509.95 (   0.00%)9.69 *  -1.25%*
> 22373.32 *  -0.61%*
> Hmean 4096  25033.23 (   0.00%)24859.59 *  -0.69%*
> 24912.50 *  -0.48%*
> Hmean 8192  32080.51 (   0.00%)31744.51 *  -1.05%*
> 31800.45 *  -0.87%*
> Hmean 16384 36531.86 (   0.00%)37064.68 *   1.46%*
> 37397.71 *   2.37%*
>
> The diffs are smaller than on openSUSE Leap 15.1 and some of the
> uclamp taskgroup results are better?
>
> With this test setup we now can play with the uclamp code in
> enqueue_task() and dequeue_task().
>
> ---
>
> W/ config-network-netperf-unbound (only netperf-udp and buffer size 64):
>
> $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp/perf.data 
> | grep activate_task
>
> # Event 'cycles:ppp'
> #
> # Baseline  Delta Abs  Shared ObjectSymbol
>
>  0.02% +0.54%  [kernel.vmlinux] [k] activate_task
>  0.02% +0.38%  [kernel.vmlinux] [k] deactivate_task
>
> $ perf diff 5.7.0-rc7_without-clamp/perf.data 
> 5.7.0-rc7_with-clamp-tskgrp/perf.data | grep activate_task
>
>  0.02% +0.35%  [kernel.vmlinux] [k] 

Re: [PATCH v2] hwmon:(adm1275) Enable adm1278 ADM1278_TEMP1_EN

2020-05-29 Thread Patrick Williams
Hi Guenter,

Thanks for the initial look at this.

One question for you below...

On Fri, May 29, 2020 at 10:30:16AM -0700, Guenter Roeck wrote:
> On 5/29/20 5:46 AM, Manikandan Elumalai wrote:
> > +   /* Enable TEMP1 by default */
> > +   config |= ADM1278_TEMP1_EN;
> > +   ret = i2c_smbus_write_byte_data(client,
> > +   ADM1275_PMON_CONFIG,
> > +   config);
> > +   if (ret < 0) {
> > +   dev_err(>dev,
> > +   "Failed to enable temperature config\n");
> > +   return -ENODEV;
> > +   }
> 
> This can be handled in a single operation, together with ADM1278_VOUT_EN
> below. There is no need for two separate write operations.

I don't know if you noticed here but the change ends up enabling
TEMP1_EN in all cases.  Is this acceptable?  If not, do you have any
preference on how it is selected for enablement?

> > /* Enable VOUT if not enabled (it is disabled by default) */
> > if (!(config & ADM1278_VOUT_EN)) {
> > @@ -681,9 +697,6 @@ static int adm1275_probe(struct i2c_client *client,
> > }
> > }
> >  
> > -   if (config & ADM1278_TEMP1_EN)
> > -   info->func[0] |=
> > -   PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP;
> > if (config & ADM1278_VIN_EN)
> > info->func[0] |= PMBUS_HAVE_VIN;
> > break;
> > 
> 

-- 
Patrick Williams


signature.asc
Description: PGP signature


Re: [PATCH] hwmon:(adm1275) Enable adm1278 ADM1278_TEMP1_EN

2020-05-29 Thread Patrick Williams
On Thu, May 28, 2020 at 07:45:23PM +0530, Manikandan Elumalai wrote:

Hi Manikandan,

Adding the PMBus maintainers...

> 
> The adm1278 temperature sysfs attribute need it for one of the our openbmc 
> platform . 
> This functionality is not enabled by default, so PMON_CONFIG needs to be 
> modified in order to enable it.

Vijay already mentioned the Signed-off-by here.

Since this is a kernel patch and your first time contributing one, please
read through:
https://www.kernel.org/doc/html/latest/process/1.Intro.html
and the MAINTAINERS file.  Another thing you've missed is using the
get_maintainer.pl script to find out who you're suppose to CC.  It is
fine to have additional CCs but we're missing the pmbus maintainer on
this patch.

> 
> ---
>  drivers/hwmon/pmbus/adm1275.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/hwmon/pmbus/adm1275.c b/drivers/hwmon/pmbus/adm1275.c
> index 5caa37fb..47b293d 100644
> --- a/drivers/hwmon/pmbus/adm1275.c
> +++ b/drivers/hwmon/pmbus/adm1275.c
> @@ -681,6 +681,21 @@ static int adm1275_probe(struct i2c_client *client,
>   }
>   }
>  
> + config = i2c_smbus_read_byte_data(client, ADM1275_PMON_CONFIG);
> + if (config < 0)
> + return config;
> +
> + /* Enable TEMP1 by defult */
> + config |= ADM1278_TEMP1_EN;
> + ret = i2c_smbus_write_byte_data(client,
> + ADM1275_PMON_CONFIG,
> + config);
> + if (ret < 0) {
> + dev_err(>dev,
> + "Failed to enable temperature config\n");
> + return -ENODEV;
> + }
> +

This code might work for your design, but likely doesn't work for
everyone and isn't likely to be accepted in its current state.  I think
you need some kind of detection logic here to know if TEMP1_EN *should*
be enabled.  Do we need a device-tree entry for this?


>   if (config & ADM1278_TEMP1_EN)
>   info->func[0] |=
>   PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP;
> -- 
> 2.7.4
> 

-- 
Patrick Williams


signature.asc
Description: PGP signature


Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-05-28 Thread Patrick Bellasi


[+Giovanni]

On Thu, May 28, 2020 at 20:29:14 +0200, Peter Zijlstra  
wrote...

> On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote:

>> I had a humble try to catch the overhead but wasn't successful. The 
>> observation
>> wasn't missed by us too then.
>
> Right, I remember us doing benchmarks when we introduced all this and
> clearly we missed something. I would be good if Mel can share which
> benchmark hurt most so we can go have a look.

Indeed, would be great to have a description of their test setup and
results. Perhaps Giovanni can also support us on that.



Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-05-15 Thread Patrick Bellasi


I Qais,
I see we are converging toward the final shape. :)

Function wise code looks ok to me now.

Lemme just point out few more remarks and possible nit-picks.
I guess at the end it's up to you to decide if you wanna follow up with
a v6 and to the maintainers to decide how picky they wanna be.

Otherwise, FWIW, feel free to consider this a LGTM.

Best,
Patrick

On Mon, May 11, 2020 at 17:40:52 +0200, Qais Yousef  
wrote...

[...]

> +static inline void uclamp_sync_util_min_rt_default(struct task_struct *p,
> +enum uclamp_id clamp_id)
> +{
> + unsigned int default_util_min = sysctl_sched_uclamp_util_min_rt_default;
> + struct uclamp_se *uc_se;
> +
> + /* Only sync for UCLAMP_MIN and RT tasks */
> + if (clamp_id != UCLAMP_MIN || !rt_task(p))
> + return;
> +
> + uc_se = >uclamp_req[UCLAMP_MIN];

I went back to v3 version, where this was done above:

   https://lore.kernel.org/lkml/20200429113255.ga19...@codeaurora.org/
   Message-ID: 20200429113255.ga19...@codeaurora.org

and still I don't see why we want to keep it after this first check?

IMO it's just not required and it makes to code a tiny uglier.

> +
> + /*
> +  * Only sync if user didn't override the default request and the sysctl
> +  * knob has changed.
> +  */
> + if (uc_se->user_defined || uc_se->value == default_util_min)
> + return;
> +

nit-pick: the two comments above are stating the obvious.

> + uclamp_se_set(uc_se, default_util_min, false);
> +}
> +
>  static inline struct uclamp_se
>  uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> @@ -907,8 +949,13 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id 
> clamp_id)
>  static inline struct uclamp_se
>  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> - struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> - struct uclamp_se uc_max = uclamp_default[clamp_id];
> + struct uclamp_se uc_req, uc_max;
> +
> + /* Sync up any change to sysctl_sched_uclamp_util_min_rt_default. */

same here: the comment is stating the obvious.

Maybe even just by using a different function name we better document
the code, e.g. uclamp_rt_restrict(p, clamp_id);

This will implicitly convey the purpose: RT tasks can be somehow
further restricted, i.e. in addition to the TG restriction following.


> + uclamp_sync_util_min_rt_default(p, clamp_id);
> +
> + uc_req = uclamp_tg_restrict(p, clamp_id);
> + uc_max = uclamp_default[clamp_id];
>  
>   /* System default restrictions always apply */
>   if (unlikely(uc_req.value > uc_max.value))

[...]



Re: file system permissions regression affecting root

2020-05-13 Thread Patrick Donnelly
On Wed, May 13, 2020 at 9:11 AM Al Viro  wrote:
>
> On Wed, May 13, 2020 at 08:00:28AM -0700, Patrick Donnelly wrote:
> > In newer kernels (at least 5.6), it appears root is not able to write
> > to files owned by other users in a sticky directory:
>
> Yes.  Controlled by /proc/sys/fs/protected_regular, which systemd crowd
> has decided to enable in commit 2732587540035227fe59e4b64b60127352611b35
>  [...]

Thanks for the information Al!

However, it seems odd that this depends on the owner of the directory.
i.e. this protection only seems to be enforced if the sticky directory
is owned by root. That's expected?


-- 
Patrick Donnelly


file system permissions regression affecting root

2020-05-13 Thread Patrick Donnelly
> gtmp/foo"],
0x7fff588732f0 /* 15 vars */) = 0
openat(AT_FDCWD, "gtmp/foo", O_WRONLY|O_CREAT|O_TRUNC, 0666) = -1
EACCES (Permission denied)
write(2, "/bin/sh: gtmp/foo: Permission de"..., 37/bin/sh: gtmp/foo:
Permission denied


-- 
Patrick Donnelly


Re: [PATCH v4 2/2] Documentation/sysctl: Document uclamp sysctl knobs

2020-05-11 Thread Patrick Bellasi


Hi Qais,

On Tue, May 05, 2020 at 16:56:37 +0200, Qais Yousef  
wrote...

>> > +sched_util_clamp_min_rt_default:
>> > +
>> > +
>> > +By default Linux is tuned for performance. Which means that RT tasks 
>> > always run
>> > +at the highest frequency and most capable (highest capacity) CPU (in
>> > +heterogeneous systems).
>> > +
>> > +Uclamp achieves this by setting the requested uclamp.min of all RT tasks 
>> > to
>> > +SCHED_CAPACITY_SCALE (1024) by default, which effectively boosts the 
>> > tasks to
>> > +run at the highest frequency and biases them to run on the biggest CPU.
>> > +
>> > +This knob allows admins to change the default behavior when uclamp is 
>> > being
>> > +used. In battery powered devices particularly, running at the maximum
>> > +capacity and frequency will increase energy consumption and shorten the 
>> > battery
>> > +life.
>> > +
>> > +This knob is only effective for RT tasks which the user hasn't modified 
>> > their
>> > +requested uclamp.min value via sched_setattr() syscall.
>> > +
>> > +This knob will not escape the constraint imposed by sched_util_clamp_min
>> > +defined above.
>> 
>> Perhaps it's worth to specify that this value is going to be clamped by
>> the values above? Otherwise it's a bit ambiguous to know what happen
>> when it's bigger than schedu_util_clamp_min.
>
> Hmm for me that sentence says exactly what you're asking for.
>
> So what you want is
>
>   s/will not escape the constraint imposed by/will be clamped by/
>
> ?
>
> I'm not sure if this will help if the above is already ambiguous. Maybe if
> I explicitly say
>
>   ..will not escape the *range* constrained imposed by..
>
> sched_util_clamp_min is already defined as a range constraint, so hopefully it
> should hit the mark better now?

Right, that also can work.

>> 
>> > +Any modification is applied lazily on the next opportunity the scheduler 
>> > needs
>> > +to calculate the effective value of uclamp.min of the task.
>> ^
>> 
>> This is also an implementation detail, I would remove it.
>
> The idea is that this value is not updated 'immediately'/synchronously. So
> currently RUNNING tasks will not see the effect, which could generate 
> confusion
> when users trip over it. IMO giving an idea of how it's updated will help with
> expectation of the users. I doubt any will care, but I think it's an important
> behavior element that is worth conveying and documenting. I'd be happy to
> reword it if necessary.

Right, I agree on giving an hint on the lazy update. What I was pointing
out was mainly the reference to the 'effective' value. Maybe we can just
drop that word.

> I have this now
>
> """
>  984 This knob will not escape the range constraint imposed by 
> sched_util_clamp_min
>  985 defined above.
>  986
>  987 For example if
>  988
>  989 sched_util_clamp_min_rt_default = 800
>  990 sched_util_clamp_min = 600
>  991
>  992 Then the boost will be clamped to 600 because 800 is outside of the 
> permissible
>  993 range of [0:600]. This could happen for instance if a powersave mode will
>  994 restrict all boosts temporarily by modifying sched_util_clamp_min. As 
> soon as
>  995 this restriction is lifted, the requested sched_util_clamp_min_rt_default
>  996 will take effect.
>  997
>  998 Any modification is applied lazily to currently running tasks and should 
> be
>  999 visible by the next wakeup.
> """

That's better IMHO, would just slightly change the last sentence to:

   Any modification is applied lazily to tasks and is effective
   starting from their next wakeup.

Best,
Patrick



Re: [PATCH v4 2/2] Documentation/sysctl: Document uclamp sysctl knobs

2020-05-03 Thread Patrick Bellasi


Hi Qais,

On Fri, May 01, 2020 at 13:49:27 +0200, Qais Yousef  
wrote...

[...]

> diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
> b/Documentation/admin-guide/sysctl/kernel.rst
> index 0d427fd10941..521c18ce3d92 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -940,6 +940,54 @@ Enables/disables scheduler statistics. Enabling this 
> feature
>  incurs a small amount of overhead in the scheduler but is
>  useful for debugging and performance tuning.
>  
> +sched_util_clamp_min:
> +=
> +
> +Max allowed *minimum* utilization.
> +
> +Default value is SCHED_CAPACITY_SCALE (1024), which is the maximum possible
^^^

Mmm... I feel one of the two is an implementation detail which should
probably not be exposed?

The user perhaps needs to know the value (1024) but we don't need to
expose the internal representation.


> +value.
> +
> +It means that any requested uclamp.min value cannot be greater than
> +sched_util_clamp_min, i.e., it is restricted to the range
> +[0:sched_util_clamp_min].
> +
> +sched_util_clamp_max:
> +=
> +
> +Max allowed *maximum* utilization.
> +
> +Default value is SCHED_CAPACITY_SCALE (1024), which is the maximum possible
> +value.
> +
> +It means that any requested uclamp.max value cannot be greater than
> +sched_util_clamp_max, i.e., it is restricted to the range
> +[0:sched_util_clamp_max].
> +
> +sched_util_clamp_min_rt_default:
> +
> +
> +By default Linux is tuned for performance. Which means that RT tasks always 
> run
> +at the highest frequency and most capable (highest capacity) CPU (in
> +heterogeneous systems).
> +
> +Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
> +SCHED_CAPACITY_SCALE (1024) by default, which effectively boosts the tasks to
> +run at the highest frequency and biases them to run on the biggest CPU.
> +
> +This knob allows admins to change the default behavior when uclamp is being
> +used. In battery powered devices particularly, running at the maximum
> +capacity and frequency will increase energy consumption and shorten the 
> battery
> +life.
> +
> +This knob is only effective for RT tasks which the user hasn't modified their
> +requested uclamp.min value via sched_setattr() syscall.
> +
> +This knob will not escape the constraint imposed by sched_util_clamp_min
> +defined above.

Perhaps it's worth to specify that this value is going to be clamped by
the values above? Otherwise it's a bit ambiguous to know what happen
when it's bigger than schedu_util_clamp_min.

> +Any modification is applied lazily on the next opportunity the scheduler 
> needs
> +to calculate the effective value of uclamp.min of the task.
^

This is also an implementation detail, I would remove it.

>  
>  seccomp
>  ===


Best,
Patrick



Re: [PATCH v4 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-05-03 Thread Patrick Bellasi
 
>   /* System default restrictions always apply */
>   if (unlikely(uc_req.value > uc_max.value))
> @@ -1114,12 +1163,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
>   loff_t *ppos)
>  {
>   bool update_root_tg = false;
> - int old_min, old_max;
> + int old_min, old_max, old_min_rt;
>   int result;
>  
>   mutex_lock(_mutex);
>   old_min = sysctl_sched_uclamp_util_min;
>   old_max = sysctl_sched_uclamp_util_max;
> + old_min_rt = sysctl_sched_uclamp_util_min_rt_default;
>  
>   result = proc_dointvec(table, write, buffer, lenp, ppos);
>   if (result)
> @@ -1133,6 +1183,18 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
>   goto undo;
>   }
>  
> + /*
> +  * The new value will be applied to RT tasks the next time the
> +  * scheduler needs to calculate the effective uclamp.min for that task,
> +  * assuming the task is using the system default and not a user
> +  * specified value. In the latter we shall leave the value as the user
> +  * requested.

IMO it does not make sense to explain here what you will do with this
value. This will make even more complicated to maintain the comment
above if the code using it should change in the future.

So, if the code where we use the knob is not clear enough, maybe we can
move this comment to the description of:
   uclamp_sync_util_min_rt_default()
or to be part of the documentation of:
  sysctl_sched_uclamp_util_min_rt_default

By doing that you can also just add this if condition with the previous ones.

> +  */
> + if (sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto undo;
> + }
> +
>   if (old_min != sysctl_sched_uclamp_util_min) {
>   uclamp_se_set(_default[UCLAMP_MIN],
> sysctl_sched_uclamp_util_min, false);
> @@ -1158,6 +1220,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
>  undo:
>   sysctl_sched_uclamp_util_min = old_min;
>   sysctl_sched_uclamp_util_max = old_max;
> + sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
>  done:
>   mutex_unlock(_mutex);
>  
> @@ -1200,9 +1263,13 @@ static void __setscheduler_uclamp(struct task_struct 
> *p,
>   if (uc_se->user_defined)
>   continue;
>  
> - /* By default, RT tasks always get 100% boost */
> + /*
> +  * By default, RT tasks always get 100% boost, which the admins
> +  * are allowed to change via
> +  * sysctl_sched_uclamp_util_min_rt_default knob.
> +  */
>   if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> - clamp_value = uclamp_none(UCLAMP_MAX);
> + clamp_value = sysctl_sched_uclamp_util_min_rt_default;

Mmm... I suspect we don't need this anymore.

If the task has a user_defined value, we skip this anyway.
If the task has not a user_defined value, we will do set this anyway at
each enqueue time.

No?

>  
>   uclamp_se_set(uc_se, clamp_value, false);
>   }
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a176d8727a3..64117363c502 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -453,6 +453,13 @@ static struct ctl_table kern_table[] = {
>   .mode   = 0644,
>   .proc_handler   = sysctl_sched_uclamp_handler,
>   },
> + {
> + .procname   = "sched_util_clamp_min_rt_default",
> + .data   = _sched_uclamp_util_min_rt_default,
> + .maxlen = sizeof(unsigned int),
> + .mode   = 0644,
> + .proc_handler   = sysctl_sched_uclamp_handler,
> + },
>  #endif
>  #ifdef CONFIG_SCHED_AUTOGROUP
>   {

Best,
Patrick



Re: [PATCH v7] ARM: DTS: Aspeed: Add YADRO Nicole BMC

2020-05-01 Thread Patrick Williams
On Wed, Apr 29, 2020 at 02:37:11PM +0300, Alexander Filippov wrote:
> Nicole is an OpenPower machine with an Aspeed 2500 BMC SoC manufactured
> by YADRO.
> 
> Signed-off-by: Alexander Filippov 
> ---
>  arch/arm/boot/dts/Makefile  |   1 +
>  arch/arm/boot/dts/aspeed-bmc-opp-nicole.dts | 326 
>  2 files changed, 327 insertions(+)
>  create mode 100644 arch/arm/boot/dts/aspeed-bmc-opp-nicole.dts
> 

Reviewed-by: Patrick Williams 

-- 
Patrick Williams


signature.asc
Description: PGP signature


Re: [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases

2019-10-21 Thread Patrick Bellasi
Hi Peter,

On 14-Oct 16:52, Peter Zijlstra wrote:
> 
> The energy aware schedutil patches remimded me this was still pending.
> 
> On Fri, Aug 02, 2019 at 10:47:25AM +0100, Patrick Bellasi wrote:
> > Hi Peter, Vincent,
> > is there anything different I can do on this?
> 
> I think both Vincent and me are basically fine with the patch, it was
> the Changelog/explanation for it that sat uneasy.
> 
> Specifically I think the 'confusion' around the PELT invariance stuff
> doesn't help.
> 
> I think that if you present it simply as making util_est directly follow
> upward motion and only decay on downward -- and the rationale for it --
> then it should be fine.

Ok, I'll update the commit message to remove the PELT related
ambiguity and post a new version soon.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi


Re: [PATCH 2/2] firmware: coreboot: Export active CBFS partition

2019-10-10 Thread patrick . rudolph
On Wed, 2019-10-09 at 14:19 -0700, Julius Werner wrote:
> > Somehow we've gotten /sys/firmware/log to be the coreboot log, and
> > quite
> > frankly that blows my mind that this path was accepted upstream.
> > Userspace has to know it's running on coreboot firmware to know
> > that
> > /sys/firmware/log is actually the coreboot log.
> 
> Not really sure I understand your concern here? That's the generic
> node for the log from the mainboard firmware, whatever it is. It was
> originally added for non-coreboot firmware and that use is still
> supported. If some other non-coreboot firmware wants to join in, it's
> welcome to do so -- the interface is separated out enough to make it
> easy to add more backends.
> 
> I do agree that if we want to add other, more coreboot-specific
> nodes,
> they should be explicitly namespaced.
> 

I'll add a new namespace called 'coreboot'.

> > But I also wonder why this is being exposed by the kernel at all?
> 

It's difficult for userspace tools to find out how to access the flash
and then to find the FMAP, which resides somewhere in it on 4KiB boundary.
The "boot media params" usually give the offset to the FMAP from beginning
of the flash, but tell nothing about how to access it.

> I share Stephen's concern that I'm not sure this belongs in the
> kernel
> at all. There are existing ways for userspace to access this
> information like the cbmem utility does... if you want it accessible
> from fwupd, it could chain-call into cbmem or we could factor that
> functionality out into a library. If you want to get away from using
> /dev/mem for this, we could maybe add a driver that exports CBMEM or
> coreboot table areas via sysfs... but then I think that should be a
> generic driver which makes them all accessible in one go, rather than
> having to add yet another driver whenever someone needs to parse
> another coreboot table blob for some reason. We could design an
> interface like /sys/firmware/coreboot/table/ where every entry
> in
> the table gets exported as a binary file.

I don't even consider using binaries that operate on /dev/mem.
In my opinion CBMEM is something coreboot internal, the OS or userspace
shouldn't even known about.

> I think a specific sysfs driver only makes sense for things that are
> human readable and that you'd actually expect a human to want to go
> read directly, like the log. Maybe exporting FMAP entries one by one
> like Stephen suggests could be such a case, but I doubt that there's
> a
> common enough need for that since there are plenty of existing ways
> to
> show FMAP entries from userspace (and if there was a generic
> interface
> like /sys/firmware/coreboot/table/37 to access it, we could just add
> a
> new flag to the dump_fmap utility to read it from there).

I'll expose the coreboot tables using a sysfs driver, which then can be
used by coreboot tools instead of accessing /dev/mem. As it holds the
FMAP and "boot media params" that's all I need for now.

The downside is that the userspace tools need to be keep in sync with
the binary interface the coreboot tables are providing.



[PATCH 1/2] firmware: coreboot: Export the binary FMAP

2019-10-08 Thread patrick . rudolph
From: Patrick Rudolph 

Expose coreboot's binary FMAP[1] to /sys/firmware/fmap.

coreboot copies the FMAP to a CBMEM buffer at boot since CB:35377[2],
allowing an architecture independ way of exposing the FMAP to userspace.

Will be used by fwupd[3] to determine the current firmware layout.

[1]: https://doc.coreboot.org/lib/flashmap.html
[2]: https://review.coreboot.org/c/coreboot/+/35377
[3]: https://fwupd.org/

Signed-off-by: Patrick Rudolph 
---
 drivers/firmware/google/Kconfig   |   8 ++
 drivers/firmware/google/Makefile  |   1 +
 drivers/firmware/google/fmap-coreboot.c   | 156 ++
 drivers/firmware/google/fmap-coreboot.h   |  13 ++
 drivers/firmware/google/fmap_serialized.h |  59 
 5 files changed, 237 insertions(+)
 create mode 100644 drivers/firmware/google/fmap-coreboot.c
 create mode 100644 drivers/firmware/google/fmap-coreboot.h
 create mode 100644 drivers/firmware/google/fmap_serialized.h

diff --git a/drivers/firmware/google/Kconfig b/drivers/firmware/google/Kconfig
index a3a6ca659ffa..5fbbd7b8fef6 100644
--- a/drivers/firmware/google/Kconfig
+++ b/drivers/firmware/google/Kconfig
@@ -74,4 +74,12 @@ config GOOGLE_VPD
  This option enables the kernel to expose the content of Google VPD
  under /sys/firmware/vpd.
 
+config GOOGLE_FMAP
+   tristate "Coreboot FMAP access"
+   depends on GOOGLE_COREBOOT_TABLE
+   help
+ This option enables the kernel to search for a Google FMAP in
+ the coreboot table.  If found, this binary file is exported to 
userland
+ in the file /sys/firmware/fmap.
+
 endif # GOOGLE_FIRMWARE
diff --git a/drivers/firmware/google/Makefile b/drivers/firmware/google/Makefile
index d17caded5d88..6d31fe167700 100644
--- a/drivers/firmware/google/Makefile
+++ b/drivers/firmware/google/Makefile
@@ -6,6 +6,7 @@ obj-$(CONFIG_GOOGLE_FRAMEBUFFER_COREBOOT)  += 
framebuffer-coreboot.o
 obj-$(CONFIG_GOOGLE_MEMCONSOLE)+= memconsole.o
 obj-$(CONFIG_GOOGLE_MEMCONSOLE_COREBOOT)   += memconsole-coreboot.o
 obj-$(CONFIG_GOOGLE_MEMCONSOLE_X86_LEGACY) += memconsole-x86-legacy.o
+obj-$(CONFIG_GOOGLE_FMAP)  += fmap-coreboot.o
 
 vpd-sysfs-y := vpd.o vpd_decode.o
 obj-$(CONFIG_GOOGLE_VPD)   += vpd-sysfs.o
diff --git a/drivers/firmware/google/fmap-coreboot.c 
b/drivers/firmware/google/fmap-coreboot.c
new file mode 100644
index ..14050030ebc6
--- /dev/null
+++ b/drivers/firmware/google/fmap-coreboot.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * fmap-coreboot.c
+ *
+ * Exports the binary FMAP through coreboot table.
+ *
+ * Copyright 2012-2013 David Herrmann 
+ * Copyright 2017 Google Inc.
+ * Copyright 2019 9elements Agency GmbH 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "coreboot_table.h"
+#include "fmap_serialized.h"
+
+#define CB_TAG_FMAP 0x37
+
+static void *fmap;
+static u32 fmap_size;
+
+/*
+ * Convert FMAP region to name.
+ * The caller has to free the string.
+ * Return NULL if no containing region was found.
+ */
+const char *coreboot_fmap_region_to_name(const u32 start, const u32 size)
+{
+   const char *name = NULL;
+   struct fmap *iter;
+   u32 size_old = ~0;
+   int i;
+
+   iter = fmap;
+   /* Find smallest containing region */
+   for (i = 0; i < iter->nareas && fmap; i++) {
+   if (iter->areas[i].offset <= start &&
+   iter->areas[i].size >= size &&
+   iter->areas[i].size <= size_old) {
+   size_old = iter->areas[i].size;
+   name = iter->areas[i].name;
+   }
+   }
+
+   if (name)
+   return kstrdup(name, GFP_KERNEL);
+   return NULL;
+}
+EXPORT_SYMBOL(coreboot_fmap_region_to_name);
+
+static ssize_t fmap_read(struct file *filp, struct kobject *kobp,
+struct bin_attribute *bin_attr, char *buf,
+loff_t pos, size_t count)
+{
+   if (!fmap)
+   return -ENODEV;
+
+   return memory_read_from_buffer(buf, count, , fmap, fmap_size);
+}
+
+static int fmap_probe(struct coreboot_device *dev)
+{
+   struct lb_cbmem_ref *cbmem_ref = >cbmem_ref;
+   struct fmap *header;
+
+   if (!cbmem_ref)
+   return -ENODEV;
+
+   header = memremap(cbmem_ref->cbmem_addr, sizeof(*header), MEMREMAP_WB);
+   if (!header) {
+   pr_warn("coreboot: Failed to remap FMAP\n");
+   return -ENOMEM;
+   }
+
+   /* Validate FMAP signature */
+   if (memcmp(header->signature, FMAP_SIGNATURE,
+  sizeof(header->signature))) {
+   pr_warn("coreboot: FMAP signature mismatch\n");
+   memunmap(header);
+   return -ENODEV;
+   }
+
+   /* Validate FMAP version *

[PATCH 2/2] firmware: coreboot: Export active CBFS partition

2019-10-08 Thread patrick . rudolph
From: Patrick Rudolph 

Expose the name of the active CBFS partition under
/sys/firmware/cbfs_active_partition

In case of Google's VBOOT[1] that currently can be one of:
"FW_MAIN_A", "FW_MAIN_B" or "COREBOOT"

Will be used by fwupd[2] to determinde the active partition in the
VBOOT A/B partition update scheme.

[1]: https://doc.coreboot.org/security/vboot/index.html
[2]: https://fwupd.org/

Signed-off-by: Patrick Rudolph 
---
 drivers/firmware/google/Kconfig  |  10 ++
 drivers/firmware/google/Makefile |   1 +
 drivers/firmware/google/bootmedia-coreboot.c | 115 +++
 drivers/firmware/google/coreboot_table.h |  13 +++
 4 files changed, 139 insertions(+)
 create mode 100644 drivers/firmware/google/bootmedia-coreboot.c

diff --git a/drivers/firmware/google/Kconfig b/drivers/firmware/google/Kconfig
index 5fbbd7b8fef6..f58b723d77de 100644
--- a/drivers/firmware/google/Kconfig
+++ b/drivers/firmware/google/Kconfig
@@ -82,4 +82,14 @@ config GOOGLE_FMAP
  the coreboot table.  If found, this binary file is exported to 
userland
  in the file /sys/firmware/fmap.
 
+config GOOGLE_BOOTMEDIA
+   tristate "Coreboot bootmedia params access"
+   depends on GOOGLE_COREBOOT_TABLE
+   depends on GOOGLE_FMAP
+   help
+ This option enables the kernel to search for a boot media params
+ table, providing the active CBFS partition.  If found, the name of
+ the partition is exported to userland in the file
+ /sys/firmware/cbfs_active_partition.
+
 endif # GOOGLE_FIRMWARE
diff --git a/drivers/firmware/google/Makefile b/drivers/firmware/google/Makefile
index 6d31fe167700..e47c59376566 100644
--- a/drivers/firmware/google/Makefile
+++ b/drivers/firmware/google/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_GOOGLE_MEMCONSOLE)+= memconsole.o
 obj-$(CONFIG_GOOGLE_MEMCONSOLE_COREBOOT)   += memconsole-coreboot.o
 obj-$(CONFIG_GOOGLE_MEMCONSOLE_X86_LEGACY) += memconsole-x86-legacy.o
 obj-$(CONFIG_GOOGLE_FMAP)  += fmap-coreboot.o
+obj-$(CONFIG_GOOGLE_BOOTMEDIA) += bootmedia-coreboot.o
 
 vpd-sysfs-y := vpd.o vpd_decode.o
 obj-$(CONFIG_GOOGLE_VPD)   += vpd-sysfs.o
diff --git a/drivers/firmware/google/bootmedia-coreboot.c 
b/drivers/firmware/google/bootmedia-coreboot.c
new file mode 100644
index ..09c0caedaf05
--- /dev/null
+++ b/drivers/firmware/google/bootmedia-coreboot.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * bootmedia-coreboot.c
+ *
+ * Exports the active VBOOT partition name through boot media params.
+ *
+ * Copyright 2012-2013 David Herrmann 
+ * Copyright 2017 Google Inc.
+ * Copyright 2019 Patrick Rudolph 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "coreboot_table.h"
+#include "fmap-coreboot.h"
+
+#define CB_TAG_BOOT_MEDIA_PARAMS 0x30
+
+/* The current CBFS partition */
+static char *name;
+
+static ssize_t cbfs_active_partition_read(struct file *filp,
+ struct kobject *kobp,
+ struct bin_attribute *bin_attr,
+ char *buf,
+ loff_t pos, size_t count)
+{
+   if (!name)
+   return -ENODEV;
+
+   return memory_read_from_buffer(buf, count, , name, strlen(name));
+}
+
+static int bmp_probe(struct coreboot_device *dev)
+{
+   struct lb_boot_media_params *b = >bmp;
+   const char *tmp;
+
+   /* Sanity checks on the data we got */
+   if ((b->cbfs_offset == ~0) ||
+   b->cbfs_size == 0 ||
+   ((b->cbfs_offset + b->cbfs_size) > b->boot_media_size)) {
+   pr_warn("coreboot: Boot media params contains invalid data\n");
+   return -ENODEV;
+   }
+
+   tmp = coreboot_fmap_region_to_name(b->cbfs_offset, b->cbfs_size);
+   if (!tmp) {
+   pr_warn("coreboot: Active CBFS region not found in FMAP\n");
+   return -ENODEV;
+   }
+
+   name = devm_kmalloc(>dev, strlen(tmp) + 2, GFP_KERNEL);
+   if (!name) {
+   kfree(tmp);
+   return -ENODEV;
+   }
+   snprintf(name, strlen(tmp) + 2, "%s\n", tmp);
+
+   kfree(tmp);
+
+   return 0;
+}
+
+static int bmp_remove(struct coreboot_device *dev)
+{
+   struct platform_device *pdev = dev_get_drvdata(>dev);
+
+   platform_device_unregister(pdev);
+
+   return 0;
+}
+
+static struct coreboot_driver bmp_driver = {
+   .probe = bmp_probe,
+   .remove = bmp_remove,
+   .drv = {
+   .name = "bootmediaparams",
+   },
+   .tag = CB_TAG_BOOT_MEDIA_PARAMS,
+};
+
+static struct bin_attribute cbfs_partition_bin_attr = {
+   .attr = {.name = "cbfs_active_partition"

Re: [PATCH 1/2] i2c: pxa: migrate to new i2c_slave APIs

2019-10-01 Thread Patrick Williams
Thanks for the review Andy.

On Tue, Oct 01, 2019 at 07:29:13PM +0300, Andy Shevchenko wrote:
> 
> 
> On Tue, Oct 01, 2019 at 10:59:59AM -0500, Patrick Williams wrote:
> There are quite a few people in the Cc list. I'm not sure they all are
> interested in this. I deliberately dropped few names, sorry, if I was 
> mistaken.

Agree it was kind of a big list.  Just chose what was given to me by
get_maintainer.pl.  It seems like there isn't a direct identified
maintainer of this file.

> 
> > +   if (isr & ISR_RWM) {
> > +   u8 byte = 0;
> > +
> > +   i2c_slave_event(i2c->slave, I2C_SLAVE_READ_REQUESTED,
> > +   );
> > +   writel(byte, _IDBR(i2c));
> > +   } else {
> > +   i2c_slave_event(i2c->slave, I2C_SLAVE_WRITE_REQUESTED,
> > +   NULL);
> > +   }
> 
> Hmm... Perhaps
> 
>   u8 byte = 0;
> 
>   i2c_slave_event(i2c->slave, I2C_SLAVE_READ_REQUESTED, );
>   if (isr & ISR_RWM)
>   writel(byte, _IDBR(i2c));
> 

The two different paths also require READ_REQUEST vs WRITE_REQUESTED.  I
could do a ternary there but it seemed more obvious to just unroll the
logic.

-- 
- Patrick


[PATCH 2/2] i2c: slave-eeprom: support additional models

2019-10-01 Thread Patrick Williams
Add support for emulating the following EEPROMs:
* 24c01  - 1024 bit
* 24c128 - 128k bit
* 24c256 - 256k bit
* 24c512 - 512k bit

The flag bits in the device id were shifted up 1 bit to make
room for saving the 24c512's size.  24c512 uses the full 16-bit
address space of a 2-byte addressable EEPROM.

Signed-off-by: Patrick Williams 
---
 drivers/i2c/i2c-slave-eeprom.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/i2c/i2c-slave-eeprom.c b/drivers/i2c/i2c-slave-eeprom.c
index efee56106251..65419441313b 100644
--- a/drivers/i2c/i2c-slave-eeprom.c
+++ b/drivers/i2c/i2c-slave-eeprom.c
@@ -37,9 +37,9 @@ struct eeprom_data {
u8 buffer[];
 };
 
-#define I2C_SLAVE_BYTELEN GENMASK(15, 0)
-#define I2C_SLAVE_FLAG_ADDR16 BIT(16)
-#define I2C_SLAVE_FLAG_RO BIT(17)
+#define I2C_SLAVE_BYTELEN GENMASK(16, 0)
+#define I2C_SLAVE_FLAG_ADDR16 BIT(17)
+#define I2C_SLAVE_FLAG_RO BIT(18)
 #define I2C_SLAVE_DEVICE_MAGIC(_len, _flags) ((_flags) | (_len))
 
 static int i2c_slave_eeprom_slave_cb(struct i2c_client *client,
@@ -171,12 +171,20 @@ static int i2c_slave_eeprom_remove(struct i2c_client 
*client)
 }
 
 static const struct i2c_device_id i2c_slave_eeprom_id[] = {
+   { "slave-24c01", I2C_SLAVE_DEVICE_MAGIC(1024 / 8,  0) },
+   { "slave-24c01ro", I2C_SLAVE_DEVICE_MAGIC(1024 / 8,  I2C_SLAVE_FLAG_RO) 
},
{ "slave-24c02", I2C_SLAVE_DEVICE_MAGIC(2048 / 8,  0) },
{ "slave-24c02ro", I2C_SLAVE_DEVICE_MAGIC(2048 / 8,  I2C_SLAVE_FLAG_RO) 
},
{ "slave-24c32", I2C_SLAVE_DEVICE_MAGIC(32768 / 8, 
I2C_SLAVE_FLAG_ADDR16) },
{ "slave-24c32ro", I2C_SLAVE_DEVICE_MAGIC(32768 / 8, 
I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) },
{ "slave-24c64", I2C_SLAVE_DEVICE_MAGIC(65536 / 8, 
I2C_SLAVE_FLAG_ADDR16) },
{ "slave-24c64ro", I2C_SLAVE_DEVICE_MAGIC(65536 / 8, 
I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) },
+   { "slave-24c128", I2C_SLAVE_DEVICE_MAGIC(131072 / 8, 
I2C_SLAVE_FLAG_ADDR16) },
+   { "slave-24c128ro", I2C_SLAVE_DEVICE_MAGIC(131072 / 8, 
I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) },
+   { "slave-24c256", I2C_SLAVE_DEVICE_MAGIC(262144 / 8, 
I2C_SLAVE_FLAG_ADDR16) },
+   { "slave-24c256ro", I2C_SLAVE_DEVICE_MAGIC(262144 / 8, 
I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) },
+   { "slave-24c512", I2C_SLAVE_DEVICE_MAGIC(524288 / 8, 
I2C_SLAVE_FLAG_ADDR16) },
+   { "slave-24c512ro", I2C_SLAVE_DEVICE_MAGIC(524288 / 8, 
I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) },
{ }
 };
 MODULE_DEVICE_TABLE(i2c, i2c_slave_eeprom_id);
-- 
2.17.2 (Apple Git-113)



[PATCH 1/2] i2c: slave-eeprom: initialize empty eeprom properly

2019-10-01 Thread Patrick Williams
The i2c-slave-eeprom driver emulates at24 class eeprom devices,
which come initialized with all 1s.  Do the same in the software
emulation.

Signed-off-by: Patrick Williams 
---
 drivers/i2c/i2c-slave-eeprom.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/i2c/i2c-slave-eeprom.c b/drivers/i2c/i2c-slave-eeprom.c
index db9763cb4dae..efee56106251 100644
--- a/drivers/i2c/i2c-slave-eeprom.c
+++ b/drivers/i2c/i2c-slave-eeprom.c
@@ -131,6 +131,8 @@ static int i2c_slave_eeprom_probe(struct i2c_client 
*client, const struct i2c_de
if (!eeprom)
return -ENOMEM;
 
+   memset(eeprom->buffer, 0xFF, size);
+
eeprom->idx_write_cnt = 0;
eeprom->num_address_bytes = flag_addr16 ? 2 : 1;
eeprom->address_mask = size - 1;
-- 
2.17.2 (Apple Git-113)



[PATCH 2/2] i2c: pxa: remove unused i2c-slave APIs

2019-10-01 Thread Patrick Williams
With the i2c-pxa driver migrated to the standard i2c-slave
APIs, the custom APIs and structures are no longer needed
or used.  Remove them.

Signed-off-by: Patrick Williams 
---
 drivers/i2c/busses/i2c-pxa.c  |  1 -
 include/linux/i2c-pxa.h   | 18 --
 include/linux/platform_data/i2c-pxa.h |  4 
 3 files changed, 23 deletions(-)
 delete mode 100644 include/linux/i2c-pxa.h

diff --git a/drivers/i2c/busses/i2c-pxa.c b/drivers/i2c/busses/i2c-pxa.c
index c811646e809f..466e4f681d7a 100644
--- a/drivers/i2c/busses/i2c-pxa.c
+++ b/drivers/i2c/busses/i2c-pxa.c
@@ -25,7 +25,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/i2c-pxa.h b/include/linux/i2c-pxa.h
deleted file mode 100644
index a897e2b507b6..
--- a/include/linux/i2c-pxa.h
+++ /dev/null
@@ -1,18 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_I2C_ALGO_PXA_H
-#define _LINUX_I2C_ALGO_PXA_H
-
-typedef enum i2c_slave_event_e {
-   I2C_SLAVE_EVENT_START_READ,
-   I2C_SLAVE_EVENT_START_WRITE,
-   I2C_SLAVE_EVENT_STOP
-} i2c_slave_event_t;
-
-struct i2c_slave_client {
-   void *data;
-   void (*event)(void *ptr, i2c_slave_event_t event);
-   int  (*read) (void *ptr);
-   void (*write)(void *ptr, unsigned int val);
-};
-
-#endif /* _LINUX_I2C_ALGO_PXA_H */
diff --git a/include/linux/platform_data/i2c-pxa.h 
b/include/linux/platform_data/i2c-pxa.h
index cb290092599c..6a9b28399b39 100644
--- a/include/linux/platform_data/i2c-pxa.h
+++ b/include/linux/platform_data/i2c-pxa.h
@@ -55,11 +55,7 @@
  */
 #define I2C_ISR_INIT   0x7FF  /* status register init */
 
-struct i2c_slave_client;
-
 struct i2c_pxa_platform_data {
-   unsigned intslave_addr;
-   struct i2c_slave_client *slave;
unsigned intclass;
unsigned intuse_pio :1;
unsigned intfast_mode :1;
-- 
2.17.2 (Apple Git-113)



[PATCH 1/2] i2c: pxa: migrate to new i2c_slave APIs

2019-10-01 Thread Patrick Williams
The i2c subsystem was enhanced circa 2015 to support operating as
an i2c-slave device.  Prior to that, the i2c-pxa driver supported
an i2c-slave but had its own APIs.  There are no existing in-kernel
drivers or platforms that utilize the i2c-pxa APIs.

Migrate the i2c-pxa driver to the general i2c-slave APIs so that
existing drivers, such as the i2c-slave-eeprom, can be used.

This has been tested with a Marvell EspressoBin, using i2c-pxa and
i2c-slave-eeprom, acting as a slave, and a RaspeberryPi 3, using the
at24 driver, acting as a master.

Signed-off-by: Patrick Williams 
---
 drivers/i2c/busses/Kconfig   |  1 +
 drivers/i2c/busses/i2c-pxa.c | 74 +---
 2 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig
index 146ce40d8e0a..d0c79ac9ffdb 100644
--- a/drivers/i2c/busses/Kconfig
+++ b/drivers/i2c/busses/Kconfig
@@ -874,6 +874,7 @@ config I2C_PXA_PCI
 config I2C_PXA_SLAVE
bool "Intel PXA2XX I2C Slave comms support"
depends on I2C_PXA && !X86_32
+   select I2C_SLAVE
help
  Support I2C slave mode communications on the PXA I2C bus.  This
  is necessary for systems where the PXA may be a target on the
diff --git a/drivers/i2c/busses/i2c-pxa.c b/drivers/i2c/busses/i2c-pxa.c
index 2c3c3d6935c0..c811646e809f 100644
--- a/drivers/i2c/busses/i2c-pxa.c
+++ b/drivers/i2c/busses/i2c-pxa.c
@@ -180,7 +180,7 @@ struct pxa_i2c {
struct i2c_adapter  adap;
struct clk  *clk;
 #ifdef CONFIG_I2C_PXA_SLAVE
-   struct i2c_slave_client *slave;
+   struct i2c_client   *slave;
 #endif
 
unsigned intirqlogidx;
@@ -544,22 +544,23 @@ static void i2c_pxa_slave_txempty(struct pxa_i2c *i2c, 
u32 isr)
if (isr & ISR_BED) {
/* what should we do here? */
} else {
-   int ret = 0;
+   u8 byte = 0;
 
if (i2c->slave != NULL)
-   ret = i2c->slave->read(i2c->slave->data);
+   i2c_slave_event(i2c->slave, I2C_SLAVE_READ_PROCESSED,
+   );
 
-   writel(ret, _IDBR(i2c));
+   writel(byte, _IDBR(i2c));
writel(readl(_ICR(i2c)) | ICR_TB, _ICR(i2c));   /* allow next 
byte */
}
 }
 
 static void i2c_pxa_slave_rxfull(struct pxa_i2c *i2c, u32 isr)
 {
-   unsigned int byte = readl(_IDBR(i2c));
+   u8 byte = readl(_IDBR(i2c));
 
if (i2c->slave != NULL)
-   i2c->slave->write(i2c->slave->data, byte);
+   i2c_slave_event(i2c->slave, I2C_SLAVE_WRITE_RECEIVED, );
 
writel(readl(_ICR(i2c)) | ICR_TB, _ICR(i2c));
 }
@@ -572,9 +573,18 @@ static void i2c_pxa_slave_start(struct pxa_i2c *i2c, u32 
isr)
dev_dbg(>adap.dev, "SAD, mode is slave-%cx\n",
   (isr & ISR_RWM) ? 'r' : 't');
 
-   if (i2c->slave != NULL)
-   i2c->slave->event(i2c->slave->data,
-(isr & ISR_RWM) ? I2C_SLAVE_EVENT_START_READ : 
I2C_SLAVE_EVENT_START_WRITE);
+   if (i2c->slave != NULL) {
+   if (isr & ISR_RWM) {
+   u8 byte = 0;
+
+   i2c_slave_event(i2c->slave, I2C_SLAVE_READ_REQUESTED,
+   );
+   writel(byte, _IDBR(i2c));
+   } else {
+   i2c_slave_event(i2c->slave, I2C_SLAVE_WRITE_REQUESTED,
+   NULL);
+   }
+   }
 
/*
 * slave could interrupt in the middle of us generating a
@@ -607,7 +617,7 @@ static void i2c_pxa_slave_stop(struct pxa_i2c *i2c)
dev_dbg(>adap.dev, "ISR: SSD (Slave Stop)\n");
 
if (i2c->slave != NULL)
-   i2c->slave->event(i2c->slave->data, I2C_SLAVE_EVENT_STOP);
+   i2c_slave_event(i2c->slave, I2C_SLAVE_STOP, NULL);
 
if (i2c_debug > 2)
dev_dbg(>adap.dev, "ISR: SSD (Slave Stop) acked\n");
@@ -619,6 +629,38 @@ static void i2c_pxa_slave_stop(struct pxa_i2c *i2c)
if (i2c->msg)
i2c_pxa_master_complete(i2c, I2C_RETRY);
 }
+
+static int i2c_pxa_slave_reg(struct i2c_client *slave)
+{
+   struct pxa_i2c *i2c = slave->adapter->algo_data;
+
+   if (i2c->slave)
+   return -EBUSY;
+
+   if (!i2c->reg_isar)
+   return -EAFNOSUPPORT;
+
+   i2c->slave = slave;
+   i2c->slave_addr = slave->addr;
+
+   writel(i2c->slave_addr, _ISAR(i2c));
+
+   return 0;
+}
+
+static int i2c_pxa_slave_unreg(struct i2c_client *slave)
+{
+   struct pxa_i2c *i2c = slave->adapter->algo_data;
+
+   WARN_ON(!i2c->slave);
+
+   i2c->slave_a

[PATCH 0/2] i2c: pxa: migrate to i2c-core-slave APIs

2019-10-01 Thread Patrick Williams
i2c-pxa has its own i2c slave APIs rather than using the i2c-core
APIs.  There are no in-tree drivers currently using this custom
slave API set.

Migrate the i2c-pxa driver from its own i2c slave APIs to the
i2c-core slave APIs so that in-tree slave devices can be used with
the i2c-pxa hardware (ex. i2c-slave-eeprom).

Patrick Williams (2):
  i2c: pxa: migrate to new i2c_slave APIs
  i2c: pxa: remove unused i2c-slave APIs

 drivers/i2c/busses/Kconfig|  1 +
 drivers/i2c/busses/i2c-pxa.c  | 75 +--
 include/linux/i2c-pxa.h   | 18 ---
 include/linux/platform_data/i2c-pxa.h |  4 --
 4 files changed, 61 insertions(+), 37 deletions(-)
 delete mode 100644 include/linux/i2c-pxa.h

-- 
2.17.2 (Apple Git-113)



[PATCH] pinctrl: armada-37xx: swap polarity on LED group

2019-10-01 Thread Patrick Williams
The configuration registers for the LED group have inverted
polarity, which puts the GPIO into open-drain state when used in
GPIO mode.  Switch to '0' for GPIO and '1' for LED modes.

Fixes: 87466ccd9401 ("pinctrl: armada-37xx: Add pin controller support for 
Armada 37xx")
Signed-off-by: Patrick Williams 
Cc: 
---
 drivers/pinctrl/mvebu/pinctrl-armada-37xx.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c 
b/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c
index 6462d3ca7ceb..6310963ce5f0 100644
--- a/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c
+++ b/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c
@@ -183,10 +183,10 @@ static struct armada_37xx_pin_group 
armada_37xx_nb_groups[] = {
PIN_GRP_EXTRA("uart2", 9, 2, BIT(1) | BIT(13) | BIT(14) | BIT(19),
  BIT(1) | BIT(13) | BIT(14), BIT(1) | BIT(19),
  18, 2, "gpio", "uart"),
-   PIN_GRP_GPIO("led0_od", 11, 1, BIT(20), "led"),
-   PIN_GRP_GPIO("led1_od", 12, 1, BIT(21), "led"),
-   PIN_GRP_GPIO("led2_od", 13, 1, BIT(22), "led"),
-   PIN_GRP_GPIO("led3_od", 14, 1, BIT(23), "led"),
+   PIN_GRP_GPIO_2("led0_od", 11, 1, BIT(20), BIT(20), 0, "led"),
+   PIN_GRP_GPIO_2("led1_od", 12, 1, BIT(21), BIT(21), 0, "led"),
+   PIN_GRP_GPIO_2("led2_od", 13, 1, BIT(22), BIT(22), 0, "led"),
+   PIN_GRP_GPIO_2("led3_od", 14, 1, BIT(23), BIT(23), 0, "led"),
 
 };
 
-- 
2.17.2 (Apple Git-113)



[PATCH v2] pinctrl: armada-37xx: fix control of pins 32 and up

2019-10-01 Thread Patrick Williams
The 37xx configuration registers are only 32 bits long, so
pins 32-35 spill over into the next register.  The calculation
for the register address was done, but the bitmask was not, so
any configuration to pin 32 or above resulted in a bitmask that
overflowed and performed no action.

Fix the register / offset calculation to also adjust the offset.

Fixes: 5715092a458c ("pinctrl: armada-37xx: Add gpio support")
Signed-off-by: Patrick Williams 
Acked-by: Gregory CLEMENT 
Cc: 
---
 drivers/pinctrl/mvebu/pinctrl-armada-37xx.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c 
b/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c
index 6462d3ca7ceb..34c1fee52cbe 100644
--- a/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c
+++ b/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c
@@ -221,11 +221,11 @@ static const struct armada_37xx_pin_data 
armada_37xx_pin_sb = {
 };
 
 static inline void armada_37xx_update_reg(unsigned int *reg,
- unsigned int offset)
+ unsigned int *offset)
 {
/* We never have more than 2 registers */
-   if (offset >= GPIO_PER_REG) {
-   offset -= GPIO_PER_REG;
+   if (*offset >= GPIO_PER_REG) {
+   *offset -= GPIO_PER_REG;
*reg += sizeof(u32);
}
 }
@@ -376,7 +376,7 @@ static inline void armada_37xx_irq_update_reg(unsigned int 
*reg,
 {
int offset = irqd_to_hwirq(d);
 
-   armada_37xx_update_reg(reg, offset);
+   armada_37xx_update_reg(reg, );
 }
 
 static int armada_37xx_gpio_direction_input(struct gpio_chip *chip,
@@ -386,7 +386,7 @@ static int armada_37xx_gpio_direction_input(struct 
gpio_chip *chip,
unsigned int reg = OUTPUT_EN;
unsigned int mask;
 
-   armada_37xx_update_reg(, offset);
+   armada_37xx_update_reg(, );
mask = BIT(offset);
 
return regmap_update_bits(info->regmap, reg, mask, 0);
@@ -399,7 +399,7 @@ static int armada_37xx_gpio_get_direction(struct gpio_chip 
*chip,
unsigned int reg = OUTPUT_EN;
unsigned int val, mask;
 
-   armada_37xx_update_reg(, offset);
+   armada_37xx_update_reg(, );
mask = BIT(offset);
regmap_read(info->regmap, reg, );
 
@@ -413,7 +413,7 @@ static int armada_37xx_gpio_direction_output(struct 
gpio_chip *chip,
unsigned int reg = OUTPUT_EN;
unsigned int mask, val, ret;
 
-   armada_37xx_update_reg(, offset);
+   armada_37xx_update_reg(, );
mask = BIT(offset);
 
ret = regmap_update_bits(info->regmap, reg, mask, mask);
@@ -434,7 +434,7 @@ static int armada_37xx_gpio_get(struct gpio_chip *chip, 
unsigned int offset)
unsigned int reg = INPUT_VAL;
unsigned int val, mask;
 
-   armada_37xx_update_reg(, offset);
+   armada_37xx_update_reg(, );
mask = BIT(offset);
 
regmap_read(info->regmap, reg, );
@@ -449,7 +449,7 @@ static void armada_37xx_gpio_set(struct gpio_chip *chip, 
unsigned int offset,
unsigned int reg = OUTPUT_VAL;
unsigned int mask, val;
 
-   armada_37xx_update_reg(, offset);
+   armada_37xx_update_reg(, );
mask = BIT(offset);
val = value ? mask : 0;
 
-- 
2.17.2 (Apple Git-113)



[no subject]

2019-09-23 Thread Mr/Mrs Frances. Patrick
-- 
One million two hundred thousand euros (1,200,000 €) has been donated
to you by Frances and Patrick Connolly, we are from County Armagh in
Northern Ireland, We won the EuroMillions lottery jackpot of 115
million euros. Email us for more details: frances.connoll...@gmail.com


Re: Usecases for the per-task latency-nice attribute

2019-09-18 Thread Patrick Bellasi


On Wed, Sep 18, 2019 at 16:22:32 +0100, Vincent Guittot wrote...

> On Wed, 18 Sep 2019 at 16:19, Patrick Bellasi  wrote:

[...]

>> $> Wakeup path tunings
>> ==
>>
>> Some additional possible use-cases was already discussed in [3]:
>>
>>  - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
>>depending on crossing certain pre-configured threshold of latency
>>niceness.
>>
>>  - dynamically bias the vruntime updates we do in place_entity()
>>depending on the actual latency niceness of a task.
>>
>>PeterZ thinks this is dangerous but that we can "(carefully) fumble a
>>bit there."
>
> I agree with Peter that we can easily break the fairness if we bias vruntime

Just to be more precise here and also to better understand, here I'm
talking about turning the tweaks we already have for:

 - START_DEBIT
 - GENTLE_FAIR_SLEEPERS

a bit more parametric and proportional to the latency-nice of a task.

In principle, if a task declares a positive latency niceness, could we
not read this also as "I accept to be a bit penalised in terms of
fairness at wakeup time"?

Whatever tweaks we do there should affect anyway only one sched_latency
period... although I'm not yet sure if that's possible and how.

>>  - bias the decisions we take in check_preempt_tick() still depending
>>on a relative comparison of the current and wakeup task latency
>>niceness values.
>
> This one seems possible as it will mainly enable a task to preempt
> "earlier" the running task but will not break the fairness
> So the main impact will be the number of context switch between tasks
> to favor or not the scheduling latency

Preempting before is definitively a nice-to-have feature.

At the same time it's interesting a support where a low latency-nice
task (e.g. TOP_APP) RUNNABLE on a CPU has better chances to be executed
up to completion without being preempted by an high latency-nice task
(e.g. BACKGROUND) waking up on its CPU.

For that to happen, we need a mechanism to "delay" the execution of a
less important RUNNABLE task up to a certain period.

It's impacting the fairness, true, but latency-nice in this case will
means that we want to "complete faster", not just "start faster".

Is this definition something we can reason about?

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: Usecases for the per-task latency-nice attribute

2019-09-18 Thread Patrick Bellasi
ith a different name, since jitters clashes
with other RT related concepts.

Maybe we don't even need a name at all, the other two attributes you
specify are good enough to identify those tasks: they are just "small
background" tasks.

  small  : because on their small util_est value
  background : because of their high latency-nice value

> already active core and thus refrains from waking up of a new core if
> possible. This requires tagging of tasks from the userspace hinting which
> tasks are un-important and thus waking-up a new core to minimize the
> latency is un-necessary for such tasks.
> As per the discussion on the posted RFC, it will be appropriate to use the
> task latency property where a task with the highest latency-nice value can
> be packed.

We should better defined here what you mean with "highest" latency-nice
value, do you really mean the top of the range, e.g. 19?

Or...

> But for this specific use-cases, having just a binary value to know which
> task is latency-sensitive and which not is sufficient enough, but having a
> range is also a good way to go where above some threshold the task can be
> packed.

... yes, maybe we can reason about a "threshold based profiling" where
something like for example:

   /proc/sys/kernel/sched_packing_util_max: 200
   /proc/sys/kernel/sched_packing_latency_min : 17

means that a task with latency-nice >= 17 and util_est <= 200 will be packed?


$> Wakeup path tunings
==

Some additional possible use-cases was already discussed in [3]:

 - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
   depending on crossing certain pre-configured threshold of latency
   niceness.
  
 - dynamically bias the vruntime updates we do in place_entity()
   depending on the actual latency niceness of a task.
  
   PeterZ thinks this is dangerous but that we can "(carefully) fumble a
   bit there."
  
 - bias the decisions we take in check_preempt_tick() still depending
   on a relative comparison of the current and wakeup task latency
   niceness values.

> References:
> ===
> [1]. https://lkml.org/lkml/2019/8/30/829
> [2]. https://lkml.org/lkml/2019/7/25/296

  [3]. Message-ID: <20190905114709.gm2...@hirez.programming.kicks-ass.net>
   
https://lore.kernel.org/lkml/20190905114709.gm2...@hirez.programming.kicks-ass.net/


Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: linux-next: Tree for Sep 16 (kernel/sched/core.c)

2019-09-18 Thread Patrick Bellasi


On Wed, Sep 18, 2019 at 07:05:53 +0100, Ingo Molnar wrote...

> * Randy Dunlap  wrote:
>
>> On 9/17/19 6:38 AM, Patrick Bellasi wrote:
>> > 
>> > On Tue, Sep 17, 2019 at 08:52:42 +0100, Ingo Molnar wrote...
>> > 
>> >> * Randy Dunlap  wrote:
>> >>
>> >>> On 9/16/19 3:38 PM, Mark Brown wrote:
>> >>>> Hi all,
>> >>>>
>> >>>> Changes since 20190915:
>> >>>>
>> >>>
>> >>> on x86_64:
>> >>>
>> >>> when CONFIG_CGROUPS is not set:
>> > 
>> > Hi Randy,
>> > thanks for the report.
>> > 
>> >>>   CC  kernel/sched/core.o
>> >>> ../kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
>> >>> ../kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
>> >>>   struct css_task_iter it;
>> >>>^~
>> >>>   CC  kernel/printk/printk_safe.o
>> >>> ../kernel/sched/core.c:1084:2: error: implicit declaration of function 
>> >>> ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? 
>> >>> [-Werror=implicit-function-declaration]
>> >>>   css_task_iter_start(css, 0, );
>> >>>   ^~~
>> >>>   __sg_page_iter_start
>> >>> ../kernel/sched/core.c:1085:14: error: implicit declaration of function 
>> >>> ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? 
>> >>> [-Werror=implicit-function-declaration]
>> >>>   while ((p = css_task_iter_next())) {
>> >>>   ^~
>> >>>   __sg_page_iter_next
>> >>> ../kernel/sched/core.c:1091:2: error: implicit declaration of function 
>> >>> ‘css_task_iter_end’; did you mean ‘get_task_cred’? 
>> >>> [-Werror=implicit-function-declaration]
>> >>>   css_task_iter_end();
>> >>>   ^
>> >>>   get_task_cred
>> >>> ../kernel/sched/core.c:1081:23: warning: unused variable ‘it’ 
>> >>> [-Wunused-variable]
>> >>>   struct css_task_iter it;
>> >>>^~
>> >>>
>> >>
>> >> I cannot reproduce this build failue: I took Linus's latest which has all 
>> >> the -next scheduler commits included (ad062195731b), and an x86-64 "make 
>> >> defconfig" and a disabling of CONFIG_CGROUPS still resuls in a kernel 
>> >> that builds fine.
>> > 
>> > Same here Ingo, I cannot reproduce on arm64 and !CONFIG_CGROUPS and
>> > testing on tip/sched/core.
>> > 
>> > However, if you like, the following patch can make that code a
>> > bit more "robust".
>> > 
>> > Best,
>> > Patrick
>> > 
>> > ---8<---
>> > From 7e17b7bb08dd8dfc57e01c2a7b6875439eb47cbe Mon Sep 17 00:00:00 2001
>> > From: Patrick Bellasi 
>> > Date: Tue, 17 Sep 2019 14:12:10 +0100
>> > Subject: [PATCH 1/1] sched/core: uclamp: Fix compile error on 
>> > !CONFIG_CGROUPS
>> > 
>> > Randy reported a compiler error on x86_64 and !CONFIG_CGROUPS which is due
>> > to uclamp_update_active_tasks() using the undefined css_task_iter().
>> > 
>> > Since uclamp_update_active_tasks() is used only when cgroup support is
>> > enabled, fix that by properly guarding that function at compile time.
>> > 
>> > Signed-off-by: Patrick Bellasi 
>> > Link: 
>> > https://lore.kernel.org/lkml/1898d3c9-1997-17ce-a022-a5e28c8dc...@infradead.org/
>> > Fixes: commit babbe170e05 ("sched/uclamp: Update CPU's refcount on TG's 
>> > clamp changes")
>> 
>> Acked-by: Randy Dunlap  # build-tested
>> 
>> Thanks.
>
> Build failures like this one shouldn't depend on the compiler version - 
> and it's still a mystery how and why this build bug triggered - we cannot 
> apply the fix without knowing the answer to those questions.

Right, but it's also quite strange it's not triggering without the
guarding above. The only definition of struct css_task_iter I can see is
the one
provided in:

   include/linux/cgroup.h:50
   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/cgroup.h?h=35f7a95266153b1cf0caca3aa9661cb721864527#n50

which is CONFIG_CGROUPS guarded.

> Can you reproduce the build bug with Linus's latest tree? If not, which 
> part of -next triggers the build failure?

I tried again using this morning's Linus tree headed at:

  commit 35f7a9526615 ("Merge tag 'devprop-5.4-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")

and compilation actually fails for me too.

Everything is fine in v5.3 with !CONFIG_CGROUPS and a git bisect
between v5.3 and Linus master points to:

  commit babbe170e053c ("sched/uclamp: Update CPU's refcount on TG's clamp 
changes")

So, I think it's really my fault not properly testing !CONFIG_CGROUP,
which is enforced by default from CONFIG_SCHED_AUTOGROUP.

The patch above fixes the compilation error, hope this helps.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi


INQUIRY

2019-09-17 Thread Patrick Lewis
Dear Sir,

My name is Patrick Lewis, I am the Partnership Director of the famous 
brand John Lewis Plc.

John Lewis Plc is is a UK's largest multi-channel retailer with over 
50 shops furnished with European products.

Could you also send to us all information required to become one of 
your regular distributors. Please, 
we would appreciate if you could send us your stock list availability 
via email?.

Indeed we are interested in your products, we would like to know if 
you can provide them.
.
Regards,

Patrick Lewis

Managing Director - Partnership Services
www.johnlewis.com 
www.johnlewispartnership.co.uk
Direct-line +44(0) 7404024964
Fax : +44 0208253410
sa...@salonsdirect.co.uk
__
Powered by Mach5 Mailer: http://mach5-mailer.com


Re: linux-next: Tree for Sep 16 (kernel/sched/core.c)

2019-09-17 Thread Patrick Bellasi


On Tue, Sep 17, 2019 at 08:52:42 +0100, Ingo Molnar wrote...

> * Randy Dunlap  wrote:
>
>> On 9/16/19 3:38 PM, Mark Brown wrote:
>> > Hi all,
>> > 
>> > Changes since 20190915:
>> > 
>> 
>> on x86_64:
>> 
>> when CONFIG_CGROUPS is not set:

Hi Randy,
thanks for the report.

>>   CC  kernel/sched/core.o
>> ../kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
>> ../kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
>>   struct css_task_iter it;
>>^~
>>   CC  kernel/printk/printk_safe.o
>> ../kernel/sched/core.c:1084:2: error: implicit declaration of function 
>> ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? 
>> [-Werror=implicit-function-declaration]
>>   css_task_iter_start(css, 0, );
>>   ^~~
>>   __sg_page_iter_start
>> ../kernel/sched/core.c:1085:14: error: implicit declaration of function 
>> ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? 
>> [-Werror=implicit-function-declaration]
>>   while ((p = css_task_iter_next())) {
>>   ^~
>>   __sg_page_iter_next
>> ../kernel/sched/core.c:1091:2: error: implicit declaration of function 
>> ‘css_task_iter_end’; did you mean ‘get_task_cred’? 
>> [-Werror=implicit-function-declaration]
>>   css_task_iter_end();
>>   ^
>>   get_task_cred
>> ../kernel/sched/core.c:1081:23: warning: unused variable ‘it’ 
>> [-Wunused-variable]
>>   struct css_task_iter it;
>>^~
>> 
>
> I cannot reproduce this build failue: I took Linus's latest which has all 
> the -next scheduler commits included (ad062195731b), and an x86-64 "make 
> defconfig" and a disabling of CONFIG_CGROUPS still resuls in a kernel 
> that builds fine.

Same here Ingo, I cannot reproduce on arm64 and !CONFIG_CGROUPS and
testing on tip/sched/core.

However, if you like, the following patch can make that code a
bit more "robust".

Best,
Patrick

---8<---
>From 7e17b7bb08dd8dfc57e01c2a7b6875439eb47cbe Mon Sep 17 00:00:00 2001
From: Patrick Bellasi 
Date: Tue, 17 Sep 2019 14:12:10 +0100
Subject: [PATCH 1/1] sched/core: uclamp: Fix compile error on !CONFIG_CGROUPS

Randy reported a compiler error on x86_64 and !CONFIG_CGROUPS which is due
to uclamp_update_active_tasks() using the undefined css_task_iter().

Since uclamp_update_active_tasks() is used only when cgroup support is
enabled, fix that by properly guarding that function at compile time.

Signed-off-by: Patrick Bellasi 
Link: 
https://lore.kernel.org/lkml/1898d3c9-1997-17ce-a022-a5e28c8dc...@infradead.org/
Fixes: commit babbe170e05 ("sched/uclamp: Update CPU's refcount on TG's clamp 
changes")
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c7b90bcbe4e..14873ad4b34a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,7 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
 static inline void
 uclamp_update_active(struct task_struct *p, enum uclamp_id clamp_id)
 {
@@ -1091,7 +1092,6 @@ uclamp_update_active_tasks(struct cgroup_subsys_state 
*css,
css_task_iter_end();
 }
 
-#ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
 {
-- 
2.22.0
---8<---

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...

> On 05/09/2019 12:18, Patrick Bellasi wrote:
>>> There's a few things wrong there; I really feel that if we call it nice,
>>> it should be like nice. Otherwise we should call it latency-bias and not
>>> have the association with nice to confuse people.
>>>
>>> Secondly; the default should be in the middle of the range. Naturally
>>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>>> want [0,1024], then the default really should be 512, but personally I
>>> like 0 better as a default, in which case we need negative numbers.
>>>
>>> This is important because we want to be able to bias towards less
>>> importance to (tail) latency as well as more importantance to (tail)
>>> latency.
>>>
>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>> 
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>> 
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>> 
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>> 
>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>> 
>
> For something like latency-, I don't see the point of having
> such a wide range. The nice range is probably more than enough - and before
> even bothering about the range, we should probably agree on what the range
> should represent.
>
> If it's niceness, I read it as: positive latency-nice value means we're
> nice to latency, means we reduce it. So the further up you go, the more you
> restrict your wakeup scan. I think it's quite easy to map that into the
> code: current behaviour at 0, with a decreasing scan mask size as we go
> towards +19. I don't think anyone needs 512 steps to tune this.
>
> I don't know what logic we'd follow for negative values though. Maybe
> latency-nice -20 means always going through the slowpath, but what of the
> intermediate values?

Yep, I think so fare we are all converging towards the idea to use the
a signed range. Regarding the range itself, yes: 1024 looks very
oversized, but +-20 is still something which leave room for a bit of
flexibility and it also better matches the idea that we don't want to
"enumerate behaviours" but just expose a knob. To map certain "bias" we
could benefit from a slightly larger range.

> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be

For the wakeup path there is also the TurboSched proposal by Parth:

   Message-ID: <20190725070857.6639-1-pa...@linux.ibm.com> 
   https://lore.kernel.org/lkml/20190725070857.6639-1-pa...@linux.ibm.com/

we should keep in mind.

> applied elsewhere (e.g. load-balance, something like task_hot() and its
> use of sysctl_sched_migration_cost).

For LB can you come up with some better description of what usages you
see could benefit from a "per task" or "per task-group" latency niceness?

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 12:40:30 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote:
>
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>> 
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>> 
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>
> I'm arguing that signed _is_ more human friendly ;-)

... but you are not human. :)

>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>
> See, I find the signed thing more natural, negative is a bias away from
> latency sensitive, positive is a bias towards latency sensitive.
>
> Also; 0 is a good default value ;-)

Yes, that's appealing indeed.

>> Moreover, we will never match completely the nice semantic, give that
>> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
>> usage change for each step?
>
> Only because we were nice when implementing it. Posix leaves it
> unspecified and we could change it at any time. The only real semantics
> is a relative 'weight' (opengroup uses the term 'favourable').

Good to know, I was considering it a POXIS requirement.

>> Could changing the name to "latency-tolerance" break the tie by marking
>> its difference wrt prior/nice levels? AFAIR, that was also the original
>> proposal [1] by PaulT during the OSPM discussion.
>
> latency torrerance could still be a signed entity, positive would
> signify we're more tolerant of latency (ie. less sensitive) while
> negative would be less tolerant (ie. more sensitive).

Right.

>> For latency-nice instead we will likely base our biasing strategies on
>> some predefined (maybe system-wide configurable) const thresholds.
>
> I'm not quite sure; yes, for some of these things, like the idle search
> on wakeup, certainly. But say for wakeup-preemption, we could definitely
> make it a task relative attribute.

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 12:30:02 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>> On 09/05/19 12:46, Peter Zijlstra wrote:
>
>> > This is important because we want to be able to bias towards less
>> > importance to (tail) latency as well as more importantance to (tail)
>> > latency.
>> > 
>> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>> 
>> Another use case I'm considering is using latency-nice to prefer an idle CPU 
>> if
>> latency-nice is set otherwise go for the most energy efficient CPU.
>> 
>> Ie: sacrifice (some) energy for latency.
>> 
>> The way I see interpreting latency-nice here as a binary switch. But
>> maybe we can use the range to select what (some) energy to sacrifice
>> mean here. Hmmm.
>
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).
>
> In your case, I'm thinking you mean >0, we want to lower the latency.
>
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
>
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.

Right, although I think behaviours could still be exported but via a
different and configurable interface, using thresholds.

Either at compile time or via procfs maybe we can expose and properly
document what happen in the scheduler if/when a task has a "latency
niceness" crossing a given threshold.

For example, by setting something like:

   /proc/sys/kernel/sched_cfs_latency_idle = 1000

we state that the task is going to be scheduled according to the
SCHED_IDLE policy.

  ( ( (tomatoes target here) ) )

Not sure also if we wanna commit to user-space APIs how we internally
map/translate a "latency niceness" value into a scheduler behaviour
bias. Maybe better not at least at the very beginning.

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 12:13:47 +0100, Qais Yousef wrote...

> On 09/05/19 12:46, Peter Zijlstra wrote:
>> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>> 
>> > > From just reading the above, I would expect it to have the range
>> > > [-20,19] just like normal nice. Apparently this is not so.
>> > 
>> > Regarding the range for the latency-nice values, I guess we have two
>> > options:
>> > 
>> >   - [-20..19], which makes it similar to priorities
>> >   downside: we quite likely end up with a kernel space representation
>> >   which does not match the user-space one, e.g. look at
>> >   task_struct::prio.
>> > 
>> >   - [0..1024], which makes it more similar to a "percentage"
>> > 
>> > Being latency-nice a new concept, we are not constrained by POSIX and
>> > IMHO the [0..1024] scale is a better fit.
>> > 
>> > That will translate into:
>> > 
>> >   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>> >   policies are disabled and we wakeup up as fast as possible
>> > 
>> >   latency-nice=1024 : maximum niceness, where for example we can imaging
>> >   to turn switch a CFS task to be SCHED_IDLE?
>> 
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>> 
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>> 
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>> 
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>
> Another use case I'm considering is using latency-nice to prefer an idle CPU 
> if
> latency-nice is set otherwise go for the most energy efficient CPU.
>
> Ie: sacrifice (some) energy for latency.
>
> The way I see interpreting latency-nice here as a binary switch. But maybe we
> can use the range to select what (some) energy to sacrifice mean here. Hmmm.

I see this concept possibly evolving into something more then just a
binary switch. Not yet convinced if it make sense and/or it's possible
but, in principle, I was thinking about these possible usages for CFS
tasks:

 - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
   depending on crossing certain pre-configured threshold of latency
   niceness.

 - dynamically bias the vruntime updates we do in place_entity()
   depending on the actual latency niceness of a task.

 - bias the decisions we take in check_preempt_tick() still depending
   on a relative comparison of the current and wakeup task latency
   niceness values.

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 11:46:16 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>
>> > From just reading the above, I would expect it to have the range
>> > [-20,19] just like normal nice. Apparently this is not so.
>> 
>> Regarding the range for the latency-nice values, I guess we have two
>> options:
>> 
>>   - [-20..19], which makes it similar to priorities
>>   downside: we quite likely end up with a kernel space representation
>>   which does not match the user-space one, e.g. look at
>>   task_struct::prio.
>> 
>>   - [0..1024], which makes it more similar to a "percentage"
>> 
>> Being latency-nice a new concept, we are not constrained by POSIX and
>> IMHO the [0..1024] scale is a better fit.
>> 
>> That will translate into:
>> 
>>   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>>   policies are disabled and we wakeup up as fast as possible
>> 
>>   latency-nice=1024 : maximum niceness, where for example we can imaging
>>   to turn switch a CFS task to be SCHED_IDLE?
>
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
>
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
>
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
>
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.

Right, we have this dualism to deal with and current mainline behaviour
is somehow in the middle.

BTW, the FB requirement is the same we have in Android.
We want some CFS tasks to have very small latency and a low chance
to be preempted by the wake-up of less-important "background" tasks.

I'm not totally against the usage of a signed range, but I'm thinking
that since we are introducing a new (non POSIX) concept we can get the
chance to make it more human friendly.

Give the two extremes above, would not be much simpler and intuitive to
have 0 implementing the FB/Android (no latency) case and 1024 the
(max latency) Oracle case?

Moreover, we will never match completely the nice semantic, give that
a 1 nice unit has a proper math meaning, isn't something like 10% CPU
usage change for each step?

For latency-nice instead we will likely base our biasing strategies on
some predefined (maybe system-wide configurable) const thresholds.

Could changing the name to "latency-tolerance" break the tie by marking
its difference wrt prior/nice levels? AFAIR, that was also the original
proposal [1] by PaulT during the OSPM discussion.

Best,
Patrick

[1] https://youtu.be/oz43thSFqmk?t=1302

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 0/9] Task latency-nice

2019-09-05 Thread Patrick Bellasi


On Fri, Aug 30, 2019 at 18:49:35 +0100, subhra mazumdar wrote...

> Introduce new per task property latency-nice for controlling scalability
> in scheduler idle CPU search path.

As per my comments in other message, we should try to better split the
"latency nice" concept introduction (API and mechanisms) from its usage
in different scheduler code paths.

This distinction should be evident from the cover letter too. What you
use "latency nice" for is just one possible use-case, thus it's
important to make sure it's generic enough to fit other usages too.

> Valid latency-nice values are from 1 to
> 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
> CPU cgroup file cpu.latency-nice is added as an interface to set and get.
> All tasks in the same cgroup share the same latency-nice value. Using a
> lower latency-nice value can help latency intolerant tasks e.g very short
> running OLTP threads where full LLC search cost can be significant compared
> to run time of the threads. The default latency-nice value is 5.
>
> In addition to latency-nice, it also adds a new sched feature SIS_CORE to
> be able to disable idle core search altogether which is costly and hurts
> more than it helps in short running workloads.

I don't see why you latency-nice cannot be used to implement what you
get with NO_SIS_CORE.

IMHO, "latency nice" should be exposed as a pretty generic concept which
progressively enables different "levels of biasing" both at wake-up time
and load-balance time.

Why it's not possible to have SIS_CORE/NO_SIS_CORE switch implemented
just as different threshold values for the latency-nice value of a task?

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 4/9] sched: SIS_CORE to disable idle core search

2019-09-05 Thread Patrick Bellasi


On Fri, Aug 30, 2019 at 18:49:39 +0100, subhra mazumdar wrote...

> Use SIS_CORE to disable idle core search. For some workloads
> select_idle_core becomes a scalability bottleneck, removing it improves
> throughput. Also there are workloads where disabling it can hurt latency,
> so need to have an option.
>
> Signed-off-by: subhra mazumdar 
> ---
>  kernel/sched/fair.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c31082d..23ec9c6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, 
> int prev, int target)
>   if (!sd)
>   return target;
>  
> - i = select_idle_core(p, sd, target);
> - if ((unsigned)i < nr_cpumask_bits)
> - return i;
> + if (sched_feat(SIS_CORE)) {
> + i = select_idle_core(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
>  
>   i = select_idle_cpu(p, sd, target);
>   if ((unsigned)i < nr_cpumask_bits)

This looks like should be squashed with the previous one, or whatever
code you'll add to define when this "biasing" is to be used or not.

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search

2019-09-05 Thread Patrick Bellasi


On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote...

> Add a new sched feature SIS_CORE to have an option to disable idle core
> search (select_idle_core).
>
> Signed-off-by: subhra mazumdar 
> ---
>  kernel/sched/features.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 858589b..de4d506 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_AVG_CPU, false)
>  SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_CORE, true)

Why do we need a sched_feature? If you think there are systems in
which the usage of latency-nice does not make sense for in "Select Idle
Sibling", then we should probably better add a new Kconfig option.

If that's the case, you can probably use the init/Kconfig's
"Scheduler features" section, recently added by:

  commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")

>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote...

> On 9/4/19 11:02 PM, Tim Chen wrote:
>> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>> "latency-nice" which is shared by all the threads in that Cgroup.
>> 
>> 
>> Subhra,
>> 
>> Thanks for posting the patchset.  Having a latency nice hint
>> is useful beyond idle load balancing.  I can think of other
>> application scenarios, like scheduling batch machine learning AVX 512
>> processes with latency sensitive processes.  AVX512 limits the frequency
>> of the CPU and it is best to avoid latency sensitive task on the
>> same core with AVX512.  So latency nice hint allows the scheduler
>> to have a criteria to determine the latency sensitivity of a task
>> and arrange latency sensitive tasks away from AVX512 tasks.
>> 
>
>
> Hi Tim and Subhra,
>
> This patchset seems to be interesting for my TurboSched patches as well
> where I try to pack jitter tasks on fewer cores to get higher Turbo 
> Frequencies.
> Well, the problem I face is that we sometime end up putting multiple jitter 
> tasks on a core
> running some latency sensitive application which may see performance 
> degradation.
> So my plan was to classify such tasks to be latency sensitive thereby hinting 
> the load
> balancer to not put tasks on such cores.
>
> TurboSched: https://lkml.org/lkml/2019/7/25/296
>
>> You configure the latency hint on a cgroup basis.
>> But I think not all tasks in a cgroup necessarily have the same
>> latency sensitivity.
>> 
>> For example, I can see that cgroup can be applied on a per user basis,
>> and the user could run different tasks that have different latency 
>> sensitivity.
>> We may also need a way to configure latency sensitivity on a per task basis 
>> instead on
>> a per cgroup basis.
>> 
>
> AFAIU, the problem defined above intersects with my patches as well where the 
> interface
> is required to classify the jitter tasks. I have already tried few methods 
> like
> syscall and cgroup to classify such tasks and maybe something like that can 
> be adopted
> with these patchset as well.

Agree, these two patchest are definitively overlapping in terms of
mechanisms and APIs to expose to userspace. You to guys seems to target
different goals but the general approach should be:

 - expose a single and abstract concept to user-space
   latency-nice or latency-tolerant as PaulT proposed at OSPM

 - map this concept in kernel-space to different kind of bias, both at
   wakeup time and load-balance time, and use both for RT and CFS tasks.

That's my understanding at least ;)

I guess we will have interesting discussions at the upcoming LPC to
figure out a solution fitting all needs.

> Thanks,
> Parth

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi
ency_nice_write_u64,
> + },
>   { } /* Terminate */
>  };
>  
> @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
>   .write = cpu_max_write,
>   },
>  #endif
> + {
> + .name = "latency-nice",
> + .read_u64 = cpu_latency_nice_read_u64,
> + .write_u64 = cpu_latency_nice_write_u64,
> + },
>   { } /* terminate */
>  };
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..b08d00c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, 
> struct task_group *parent)
>   goto err;
>  
>   tg->shares = NICE_0_LOAD;
> + tg->latency_nice = LATENCY_NICE_DEFAULT;
   
Maybe better NICE_0_LATENCY to be more consistent?


>   init_cfs_bandwidth(tg_cfs_bandwidth(tg));
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..365c928 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq 
> *this_rq) { }
>  #define NICE_0_LOAD  (1L << NICE_0_LOAD_SHIFT)
>  
>  /*
> + * Latency-nice default value
> + */
> +#define  LATENCY_NICE_DEFAULT5
> +#define  LATENCY_NICE_MIN1
> +#define  LATENCY_NICE_MAX100

Values 1 and 5 looks kind of arbitrary.
For the range specifically, I already commented in this other message:

   Message-ID: <87r24v2i14@arm.com>
   https://lore.kernel.org/lkml/87r24v2i14@arm.com/

> +
> +/*
>   * Single value that decides SCHED_DEADLINE internal math precision.
>   * 10 -> just above 1us
>   * 9  -> just above 0.5us
> @@ -362,6 +369,7 @@ struct cfs_bandwidth {
>  /* Task group related information */
>  struct task_group {
>   struct cgroup_subsys_state css;
> + u64 latency_nice;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   /* schedulable entities of this group on each CPU */


Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote...

> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>> "latency-nice" which is shared by all the threads in that Cgroup.
>
> *sigh*, no. We start with a normal per task attribute, and then later,
> if it is needed and makes sense, we add it to cgroups.

FWIW, to add on top of what Peter says, we used this same approach for
uclamp and it proved to be a very effective way to come up with a good
design. General principles have been:

 - a system wide API [1] (under /proc/sys/kernel/sched_*) defines
   default values for all tasks affected by that feature.
   This interface has to define also upper bounds for task specific
   values. Thus, in the case of latency-nice, it should be set by
   default to the MIN value, since that's the current mainline
   behaviour: all tasks are latency sensitive.

 - a per-task API [2] (via the sched_setattr() syscall) can be used to
   relax the system wide setting thus implementing a "nice" policy.

 - a per-taskgroup API [3] (via cpu controller's attributes) can be used
   to relax the system-wide settings and restrict the per-task API.

The above features are worth to be added in that exact order.

> Also, your Changelog fails on pretty much every point. It doesn't
> explain why, it doesn't describe anything and so on.

On the description side, I guess it's worth to mention somewhere to
which scheduling classes this feature can be useful for. It's worth to
mention that it can apply only to:

 - CFS tasks: for example, at wakeup time a task with an high
   latency-nice should avoid to preempt a low latency-nice task.
   Maybe by mapping the latency nice value into proper vruntime
   normalization value?

 - RT tasks: for example, at wakeup time a task with an high
   latency-nice value could avoid to preempt a CFS task.

I'm sure there will be discussion about some of these features, that's
why it's important in the proposal presentation to keep a well defined
distinction among the "mechanisms and API" and how we use the new
concept to "bias" some scheduler policies.

> From just reading the above, I would expect it to have the range
> [-20,19] just like normal nice. Apparently this is not so.

Regarding the range for the latency-nice values, I guess we have two
options:

  - [-20..19], which makes it similar to priorities
  downside: we quite likely end up with a kernel space representation
  which does not match the user-space one, e.g. look at
  task_struct::prio.

  - [0..1024], which makes it more similar to a "percentage"

Being latency-nice a new concept, we are not constrained by POSIX and
IMHO the [0..1024] scale is a better fit.

That will translate into:

  latency-nice=0 : default (current mainline) behaviour, all "biasing"
  policies are disabled and we wakeup up as fast as possible

  latency-nice=1024 : maximum niceness, where for example we can imaging
  to turn switch a CFS task to be SCHED_IDLE?

Best,
Patrick

[1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps")
[2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support 
utilization clamping")
[3] 5 patches in today's tip/sched/core up to:
commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp 
changes")

-- 
#include 

Patrick Bellasi


[tip: sched/core] sched/uclamp: Propagate parent clamps

2019-09-03 Thread tip-bot2 for Patrick Bellasi
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0b60ba2dd342016e4e717dbaa4ca9af3a43f4434
Gitweb:
https://git.kernel.org/tip/0b60ba2dd342016e4e717dbaa4ca9af3a43f4434
Author:Patrick Bellasi 
AuthorDate:Thu, 22 Aug 2019 14:28:07 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:38 +02:00

sched/uclamp: Propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Signed-off-by: Patrick Bellasi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Alessio Balsini 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Quentin Perret 
Cc: Rafael J . Wysocki 
Cc: Steve Muckle 
Cc: Suren Baghdasaryan 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bell...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  | 44 +++-
 kernel/sched/sched.h |  2 ++-
 2 files changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c186abe..8855481 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1166,6 +1166,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+   root_task_group.uclamp[clamp_id] = uc_max;
 #endif
}
 }
@@ -6824,6 +6825,7 @@ static inline void alloc_uclamp_sched_group(struct 
task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(>uclamp_req[clamp_id],
  uclamp_none(clamp_id), false);
+   tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
 #endif
 }
@@ -7070,6 +7072,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset 
*tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+   struct cgroup_subsys_state *top_css = css;
+   struct uclamp_se *uc_parent = NULL;
+   struct uclamp_se *uc_se = NULL;
+   unsigned int eff[UCLAMP_CNT];
+   unsigned int clamp_id;
+   unsigned int clamps;
+
+   css_for_each_descendant_pre(css, top_css) {
+   uc_parent = css_tg(css)->parent
+   ? css_tg(css)->parent->uclamp : NULL;
+
+   for_each_clamp_id(clamp_id) {
+   /* Assume effective clamps matches requested clamps */
+   eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+   /* Cap effective clamps with parent's effective clamps 
*/
+   if (uc_parent &&
+   eff[clamp_id] > uc_parent[clamp_id].value) {
+   eff[clamp_id] = uc_parent[clamp_id].value;
+   }
+   }
+   /* Ensure protection is always capped by limit */
+   eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+   /* Propagate most restrictive effective clamps */
+   clamps = 0x0;
+   uc_se = css_tg(css)->uclamp;
+   for_each_clamp_id(clamp_id) {
+   if (eff[clamp_id] == uc_se[clamp_id].value)
+   continue;
+   uc_se[clamp_id].value = eff[clamp_id];
+   uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
+   clamps |= (0x1 << clamp_id);
+   }
+   if (!clamps)
+   css = css_rightmost_descendant(css);
+   }
+}
 
 /*
  * Integer 10^N with a given N exponent by casting to integer the literal "1eN"
@@ -7138,6 +7179,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file 
*of, char *buf,
 */
   

[tip: sched/core] sched/uclamp: Extend CPU's cgroup controller

2019-09-03 Thread tip-bot2 for Patrick Bellasi
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2480c093130f64ac3a410504fa8b3db1fc4b87ce
Gitweb:
https://git.kernel.org/tip/2480c093130f64ac3a410504fa8b3db1fc4b87ce
Author:Patrick Bellasi 
AuthorDate:Thu, 22 Aug 2019 14:28:06 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:37 +02:00

sched/uclamp: Extend CPU's cgroup controller

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run at least at a
  minimum frequency which corresponds to the uclamp.min
  utilization

- uclamp.max: defines the maximum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run up to a
  maximum frequency which corresponds to the uclamp.max
  utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Alessio Balsini 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Quentin Perret 
Cc: Rafael J . Wysocki 
Cc: Steve Muckle 
Cc: Suren Baghdasaryan 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bell...@arm.com
Signed-off-by: Ingo Molnar 
---
 Documentation/admin-guide/cgroup-v2.rst |  34 -
 init/Kconfig|  22 +++-
 kernel/sched/core.c | 193 ++-
 kernel/sched/sched.h|   8 +-
 4 files changed, 253 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005..5f1c266 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit 
models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the freq

  1   2   3   4   5   6   7   8   9   10   >