[PATCH 1/1] Makefile: Fix cross building
The native pkg-config is always called which may result in incorrect flags and paths being passed to the cross compiler. Implement $(CROSS_COMPILE)pkg-config to avoid this issue. Signed-off-by: Patrick Waterlander --- Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index bc19584fee59..9aa2073f68d6 100644 --- a/Makefile +++ b/Makefile @@ -462,6 +462,7 @@ LZMA= lzma LZ4= lz4c XZ = xz ZSTD = zstd +PKG_CONFIG ?= $(CROSS_COMPILE)pkg-config CHECKFLAGS := -D__linux__ -Dlinux -D__STDC__ -Dunix -D__unix__ \ -Wbitwise -Wno-return-void -Wno-unknown-attribute $(CF) @@ -1089,7 +1090,7 @@ mod_sign_cmd = true endif export mod_sign_cmd -HOST_LIBELF_LIBS = $(shell pkg-config libelf --libs 2>/dev/null || echo -lelf) +HOST_LIBELF_LIBS = $(shell $(PKG_CONFIG) libelf --libs 2>/dev/null || echo -lelf) has_libelf = $(call try-run,\ echo "int main() {}" | $(HOSTCC) -xc -o /dev/null $(HOST_LIBELF_LIBS) -,1,0) -- 2.17.1
Re: [PATCH 5.10 000/103] 5.10.32-rc1 review
We ran tests on this kernel: commit 32f5704a0a4f7dcc8aa74a49dbcce359d758f6d5 (HEAD -> rc/linux-5.10.y) Author: Greg Kroah-Hartman Date: Thu Apr 15 16:44:09 2021 +0200 Linux 5.10.31-rc1 No problems found. Hardware tested on: model name : Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz Specific tests ran: ok 1 ltp.py:LTP.test_nptl ok 2 ltp.py:LTP.test_math ok 3 ltp.py:LTP.test_dio ok 4 ltp.py:LTP.test_io ok 5 ltp.py:LTP.test_power_management_tests ok 6 ltp.py:LTP.test_can ok 7 ltp.py:LTP.test_input ok 8 ltp.py:LTP.test_hugetlb ok 9 ltp.py:LTP.test_ipc ok 10 ltp.py:LTP.test_uevent ok 11 ltp.py:LTP.test_smoketest ok 12 ltp.py:LTP.test_containers ok 13 ltp.py:LTP.test_filecaps ok 14 ltp.py:LTP.test_sched ok 15 ltp.py:LTP.test_hyperthreading ok 16 ltp.py:LTP.test_cap_bounds ok 17 kpatch.sh ok 18 perf.py:PerfNonPriv.test_perf_help ok 19 perf.py:PerfNonPriv.test_perf_version ok 20 perf.py:PerfNonPriv.test_perf_list ok 21 perf.py:PerfPriv.test_perf_record ok 22 perf.py:PerfPriv.test_perf_cmd_kallsyms ok 23 perf.py:PerfPriv.test_perf_cmd_annotate ok 24 perf.py:PerfPriv.test_perf_cmd_evlist ok 25 perf.py:PerfPriv.test_perf_cmd_script ok 26 perf.py:PerfPriv.test_perf_stat ok 27 perf.py:PerfPriv.test_perf_bench ok 28 kselftest.py:kselftest.test_sysctl ok 29 kselftest.py:kselftest.test_size ok 30 kselftest.py:kselftest.test_sync ok 31 kselftest.py:kselftest.test_capabilities ok 32 kselftest.py:kselftest.test_x86 ok 33 kselftest.py:kselftest.test_pidfd ok 34 kselftest.py:kselftest.test_membarrier ok 35 kselftest.py:kselftest.test_sigaltstack ok 36 kselftest.py:kselftest.test_tmpfs ok 37 kselftest.py:kselftest.test_user ok 38 kselftest.py:kselftest.test_sched ok 39 kselftest.py:kselftest.test_timens ok 40 kselftest.py:kselftest.test_timers Tested-By: Patrick McCormick On Mon, Apr 19, 2021 at 10:27 AM Fox Chen wrote: > > On Mon, 19 Apr 2021 15:05:11 +0200, Greg Kroah-Hartman > wrote: > > This is the start of the stable review cycle for the 5.10.32 release. > > There are 103 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Wed, 21 Apr 2021 13:05:09 +. > > Anything received after that time might be too late. > > > > The whole patch series can be found in one patch at: > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.32-rc1.gz > > or in the git tree and branch at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > linux-5.10.y > > and the diffstat can be found below. > > > > thanks, > > > > greg k-h > > > > 5.10.32-rc1 Successfully Compiled and booted on my Raspberry PI 4b (8g) > (bcm2711) > > Tested-by: Fox Chen >
Re: [kbuild-all] Re: include/linux/compiler_types.h:315:38: error: call to '__compiletime_assert_536' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct can_frame, len) != offsetof(st
Am 23.03.21 um 07:06 schrieb Rong Chen: >>>> Ugh! Why did the compiler extend the space for the union to 4 bytes?!? >> Just a random idea but maybe the added padding is due to some >> kind of odd intrication with the __attribute__((__aligned__(8))) >> just below? Does this reproduce if we remove the >> __attribute__((__aligned__(8)))? > > Here is the layout without __attribute__((__aligned__(8))), > the union is still extended to 4 bytes: > > struct can_frame { > canid_t can_id; /* 0 4 */ > union { > __u8 len; /* 4 1 */ > __u8 can_dlc; /* 4 1 */ > }; /* 4 4 */ > __u8 __pad; /* 8 1 */ > __u8 __res0; /* 9 1 */ > __u8 len8_dlc; /* 10 1 */ > __u8 data[8]; /* 11 8 */ > > /* size: 20, cachelines: 1, members: 6 */ > /* padding: 1 */ > /* last cacheline: 20 bytes */ > }; > > Best Regards, > Rong Chen Hi, I would suggest a try with __attribute__((__aligned__(8))) only on can_frame, not on data[8]. If the structure length is a multiple of 8, the compiler should recognize this and keep the union a single byte in favor of an array configuration of that struct. The __attribute__((__aligned__(8))) on data[8] has strange propagation effects upwards. If the attributes are really necessary, I would suggest to have both __attribute__((packed)) __attribute__((__aligned__(8))) on structure level instead of inside, so no padding happens inside the structure while the structure itself is aligned. Using aligned and packaged inside a structure may be contradictive to the compiler. This reminds me of the alignment/gap issue with my python3 implementation of bcm message while alternating between X86_64 and ARMHF. Using c_types was a mess but bytestrings worked in the end. Be aware native alignment apparently is 4 on armhf linux and 8 on X86_64 linux. https://marc.info/?l=linux-can=161251622904527=2 https://gitlab.com/Menschel/socketcan/-/commit/afc6744129448ae4333629fc0297808dd42e3530#e522710a8423075cfd1147ae6b7f44facac3ffb0_133_132 Best Regards, Patrick Menschel
Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation
On 2020-12-10 10:05, Madalin Bucur wrote: -Original Message- From: Patrick Havelange [snipped] But then that change would not be compatible with the existing device trees in already existing hardware. I'm not sure how to handle that case properly. One needs to be backwards compatible with the old device trees, so we do not really have a simple answer, I know. If we want to hack it, instead of splitting ioremaps, we can reserve 4 kB in the FMan driver, and keep the ioremap as it is now, with the benefit of less code churn. but then the ioremap and the memory reservation do not match. Why bother at all then with the mem reservation, just ioremap only and be done with it. What I'm saying is, I don't see the point of having a "fake" reservation call if it does not correspond that what is being used. The reservation is not fake, it just covering the first portion of the ioremap. Another hypothetical FMan driver would presumably reserve and ioremap starting from the same point, thus the desired error would be met. Regarding removing reservation altogether, yes, we can do that, in the child device drivers. That will fix that use after free issue you've found and align with the custom, hierarchical structure of the FMan devices/drivers. But would leave them without the double use guard we have when using the reservation. In the end, what the reservation is trying to achieve is to make sure there is a single driver controlling a certain peripeheral, and this basic requirement would be addressed by that change plus devm_of_iomap() for child devices (ports, MACs). Again, correct me if I'm wrong, but with the fake mem reservation, it would *not* make sure that a single driver is controlling a certain peripheral. Actually, it would. If the current FMan driver reserves the first part of the FMan memory, then another FMan driver (I do not expect a random driver trying to map the FMan register area) Ha!, now I understand your point. I still think it is not a clean solution, as the reservation do not match the ioremap usage. would likely try to use that same part (with the same or a different size, it does not matter, there will be an error anyway) and the reservation attempt will fail. If we fix the child device drivers, then they will have normal mappings and reservations. My point is, either have a *correct* mem reservation, or don't have one at all. There is no point in trying to cheat the system. Now we do not have correct reservations for the child devices because the parent takes it all for himself. Reduce the parent reservation and make room for correct reservations for the child. The two-sections change you've made may try to be correct but it's overkill for the purpose of detecting double use. But it is not overkill if we want to detect potential subdrivers mapping sections that would not start at the main fman region (but still part of the main fman region). And I also find the patch to obfuscate the already not so readable code so I'd opt for a simpler fix. As said already, I'm not in favor of having a reservation that do not match the real usage. And in my opinion, having a mismatch with the mem reservation and the mem usage is also the kind of obfuscation that we want to avoid. Yes now the code is slightly more complex, but it is also slightly more correct. I'm not seeing currently another way on how to make it simpler *and* correct at the same time. Patrick H. Madalin
Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation
On 2020-12-09 19:55, Madalin Bucur wrote: -Original Message- From: Patrick Havelange Sent: 09 December 2020 16:17 To: Madalin Bucur ; David S. Miller ; Jakub Kicinski ; net...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation area. I'm assuming this is the problem you are trying to address here, besides the stack corruption issue. Yes exactly. I did not add this behaviour (having a main region and subdrivers using subregions), I'm just trying to correct what is already there. For example: this is some content of /proc/iomem for one board I'm working with, with the current existing code: ffe40-ffe4fdfff : fman ffe4e-ffe4e0fff : mac ffe4e2000-ffe4e2fff : mac ffe4e4000-ffe4e4fff : mac ffe4e6000-ffe4e6fff : mac ffe4e8000-ffe4e8fff : mac and now with my patches: ffe40-ffe4fdfff : /soc@ffe00/fman@40 ffe40-ffe480fff : fman ffe488000-ffe488fff : fman-port ffe489000-ffe489fff : fman-port ffe48a000-ffe48afff : fman-port ffe48b000-ffe48bfff : fman-port ffe48c000-ffe48cfff : fman-port ffe4a8000-ffe4a8fff : fman-port ffe4a9000-ffe4a9fff : fman-port ffe4aa000-ffe4aafff : fman-port ffe4ab000-ffe4abfff : fman-port ffe4ac000-ffe4acfff : fman-port ffe4c-ffe4d : fman ffe4e-ffe4e0fff : mac ffe4e2000-ffe4e2fff : mac ffe4e4000-ffe4e4fff : mac ffe4e6000-ffe4e6fff : mac ffe4e8000-ffe4e8fff : mac While for the latter I think we can put together a quick fix, for the former I'd like to take a bit of time to select the best fix, if one is really needed. So, please, let's split the two problems and first address the incorrect stack memory use. I have no idea how you can fix it without a (more correct this time) dummy region passed as parameter (and you don't want to use the first patch). But then it will be useless to do the call anyway, as it won't do any proper verification at all, so it could also be removed entirely, which begs the question, why do it at all in the first place (the devm_request_mem_region). I'm not an expert in that part of the code so feel free to correct me if I missed something. BR, Patrick H. Hi, Patrick, the DPAA entities are described in the device tree. Adding some hardcoding in the driver is not really the solution for this problem. And I'm not sure we have I'm not seeing any problem here, the offsets used by the fman driver were already there, I just reorganized them in 2 blocks. a clear problem statement to start with. Can you help me on that part? - The current call to __devm_request_region in fman_port.c is not correct. One way to fix this is to use devm_request_mem_region, however this requires that the main fman would not be reserving the whole region. This leads to the second problem: - Make sure the main fman driver is not reserving the whole region. Is that clearer like this ? Patrick H. Hi, The overlapping IO areas result from the device tree description, that in turn mimics the HW description in the manual. If we really want to remove the nesting, we should change the device trees, not the drivers. But then that change would not be compatible with the existing device trees in already existing hardware. I'm not sure how to handle that case properly. If we want to hack it, instead of splitting ioremaps, we can reserve 4 kB in the FMan driver, and keep the ioremap as it is now, with the benefit of less code churn. but then the ioremap and the memory reservation do not match. Why bother at all then with the mem reservation, just ioremap only and be done with it. What I'm saying is, I don't see the point of having a "fake" reservation call if it does not correspond that what is being used. In the end, what the reservation is trying to achieve is to make sure there is a single driver controlling a certain peripeheral, and this basic requirement would be addressed by that change plus devm_of_iomap() for child devices (ports, MACs). Again, correct me if I'm wrong, but with the fake mem reservation, it would *not* make sure that a single driver is controlling a certain peripheral. My point is, either have a *correct* mem reservation, or don't have one at all. There is no point in trying to cheat the system. Patrick H. Madalin
Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation
area. I'm assuming this is the problem you are trying to address here, besides the stack corruption issue. Yes exactly. I did not add this behaviour (having a main region and subdrivers using subregions), I'm just trying to correct what is already there. For example: this is some content of /proc/iomem for one board I'm working with, with the current existing code: ffe40-ffe4fdfff : fman ffe4e-ffe4e0fff : mac ffe4e2000-ffe4e2fff : mac ffe4e4000-ffe4e4fff : mac ffe4e6000-ffe4e6fff : mac ffe4e8000-ffe4e8fff : mac and now with my patches: ffe40-ffe4fdfff : /soc@ffe00/fman@40 ffe40-ffe480fff : fman ffe488000-ffe488fff : fman-port ffe489000-ffe489fff : fman-port ffe48a000-ffe48afff : fman-port ffe48b000-ffe48bfff : fman-port ffe48c000-ffe48cfff : fman-port ffe4a8000-ffe4a8fff : fman-port ffe4a9000-ffe4a9fff : fman-port ffe4aa000-ffe4aafff : fman-port ffe4ab000-ffe4abfff : fman-port ffe4ac000-ffe4acfff : fman-port ffe4c-ffe4d : fman ffe4e-ffe4e0fff : mac ffe4e2000-ffe4e2fff : mac ffe4e4000-ffe4e4fff : mac ffe4e6000-ffe4e6fff : mac ffe4e8000-ffe4e8fff : mac While for the latter I think we can put together a quick fix, for the former I'd like to take a bit of time to select the best fix, if one is really needed. So, please, let's split the two problems and first address the incorrect stack memory use. I have no idea how you can fix it without a (more correct this time) dummy region passed as parameter (and you don't want to use the first patch). But then it will be useless to do the call anyway, as it won't do any proper verification at all, so it could also be removed entirely, which begs the question, why do it at all in the first place (the devm_request_mem_region). I'm not an expert in that part of the code so feel free to correct me if I missed something. BR, Patrick H. Hi, Patrick, the DPAA entities are described in the device tree. Adding some hardcoding in the driver is not really the solution for this problem. And I'm not sure we have I'm not seeing any problem here, the offsets used by the fman driver were already there, I just reorganized them in 2 blocks. a clear problem statement to start with. Can you help me on that part? - The current call to __devm_request_region in fman_port.c is not correct. One way to fix this is to use devm_request_mem_region, however this requires that the main fman would not be reserving the whole region. This leads to the second problem: - Make sure the main fman driver is not reserving the whole region. Is that clearer like this ? Patrick H. Madalin
Re: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation
On 2020-12-03 16:47, Madalin Bucur wrote: -Original Message- From: Patrick Havelange Sent: 03 December 2020 15:51 To: Madalin Bucur ; David S. Miller ; Jakub Kicinski ; net...@vger.kernel.org; linux-kernel@vger.kernel.org Cc: Patrick Havelange Subject: [PATCH net 1/4] net: freescale/fman: Split the main resource region reservation The main fman driver is only using some parts of the fman memory region. Split the reservation of the main region in 2, so that the other regions that will be used by fman-port and fman-mac drivers can be reserved properly and not be in conflict with the main fman reservation. Signed-off-by: Patrick Havelange I think the problem you are trying to work on here is that the device tree entry that describes the FMan IP allots to the parent FMan device the whole memory-mapped registers area, as described in the device datasheet. The smaller FMan building blocks (ports, MDIO controllers, etc.) are each using a nested subset of this memory-mapped registers area. While this hierarchical depiction of the hardware has not posed a problem to date, it is possible to cause issues if both the FMan driver and any of the sub-blocks drivers are trying to exclusively reserve the registers area. I'm assuming this is the problem you are trying to address here, besides the stack corruption issue. Yes exactly. I did not add this behaviour (having a main region and subdrivers using subregions), I'm just trying to correct what is already there. For example: this is some content of /proc/iomem for one board I'm working with, with the current existing code: ffe40-ffe4fdfff : fman ffe4e-ffe4e0fff : mac ffe4e2000-ffe4e2fff : mac ffe4e4000-ffe4e4fff : mac ffe4e6000-ffe4e6fff : mac ffe4e8000-ffe4e8fff : mac and now with my patches: ffe40-ffe4fdfff : /soc@ffe00/fman@40 ffe40-ffe480fff : fman ffe488000-ffe488fff : fman-port ffe489000-ffe489fff : fman-port ffe48a000-ffe48afff : fman-port ffe48b000-ffe48bfff : fman-port ffe48c000-ffe48cfff : fman-port ffe4a8000-ffe4a8fff : fman-port ffe4a9000-ffe4a9fff : fman-port ffe4aa000-ffe4aafff : fman-port ffe4ab000-ffe4abfff : fman-port ffe4ac000-ffe4acfff : fman-port ffe4c-ffe4d : fman ffe4e-ffe4e0fff : mac ffe4e2000-ffe4e2fff : mac ffe4e4000-ffe4e4fff : mac ffe4e6000-ffe4e6fff : mac ffe4e8000-ffe4e8fff : mac While for the latter I think we can put together a quick fix, for the former I'd like to take a bit of time to select the best fix, if one is really needed. So, please, let's split the two problems and first address the incorrect stack memory use. I have no idea how you can fix it without a (more correct this time) dummy region passed as parameter (and you don't want to use the first patch). But then it will be useless to do the call anyway, as it won't do any proper verification at all, so it could also be removed entirely, which begs the question, why do it at all in the first place (the devm_request_mem_region). I'm not an expert in that part of the code so feel free to correct me if I missed something. BR, Patrick H.
Re: [PATCH 2/4] net: freescale/fman-port: remove direct use of __devm_request_region
On 2020-12-03 09:44, Madalin Bucur wrote: -Original Message- From: Patrick Havelange Sent: 02 December 2020 18:16 To: Madalin Bucur ; David S. Miller ; Jakub Kicinski ; net...@vger.kernel.org; linux-kernel@vger.kernel.org Cc: Patrick Havelange Subject: [PATCH 2/4] net: freescale/fman-port: remove direct use of __devm_request_region This driver was directly calling __devm_request_region with a specific resource on the stack as parameter. This is invalid as __devm_request_region expects the given resource passed as parameter to live longer than the call itself, as the pointer to the resource will be stored inside the internal struct used by the devres management. In addition to this issue, a related bug has been found by kmemleak with this trace : unreferenced object 0xc001efc01880 (size 64): comm "swapper/0", pid 1, jiffies 4294669078 (age 3620.536s) hex dump (first 32 bytes): 00 00 00 0f fe 4a c0 00 00 00 00 0f fe 4a cf ff .J...J.. c0 00 00 00 00 ee 9d 98 00 00 00 00 80 00 02 00 backtrace: [] .alloc_resource+0xb8/0xe0 [] .__request_region+0x70/0x1c4 [] .__devm_request_region+0x8c/0x138 [] .fman_port_probe+0x170/0x420 [] .platform_drv_probe+0x84/0x108 [] .driver_probe_device+0x2c4/0x394 [] .__driver_attach+0x124/0x128 [] .bus_for_each_dev+0xb4/0x110 [] .driver_attach+0x34/0x4c [] .bus_add_driver+0x264/0x2a4 [] .driver_register+0x94/0x160 [] .__platform_driver_register+0x60/0x7c [] .fman_port_load+0x28/0x64 [] .do_one_initcall+0xd4/0x1a8 [] .kernel_init_freeable+0x1bc/0x2a4 [] .kernel_init+0x24/0x138 Indeed, the new resource (created in __request_region) will be linked to the given resource living on the stack, which will end its lifetime after the function calling __devm_request_region has finished. Meaning the new resource allocated is no longer reachable. Now that the main fman driver is no longer reserving the region used by fman-port, this previous hack is no longer needed and we can use the regular call to devm_request_mem_region instead, solving those bugs at the same time. Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/fman_port.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c b/drivers/net/ethernet/freescale/fman/fman_port.c index d9baac0dbc7d..354974939d9d 100644 --- a/drivers/net/ethernet/freescale/fman/fman_port.c +++ b/drivers/net/ethernet/freescale/fman/fman_port.c @@ -1878,10 +1878,10 @@ static int fman_port_probe(struct platform_device *of_dev) of_node_put(port_node); - dev_res = __devm_request_region(port->dev, , res.start, - resource_size(), "fman-port"); + dev_res = devm_request_mem_region(port->dev, res.start, + resource_size(), "fman-port"); if (!dev_res) { - dev_err(port->dev, "%s: __devm_request_region() failed\n", + dev_err(port->dev, "%s: devm_request_mem_region() failed\n", __func__); err = -EINVAL; goto free_port; -- 2.17.1 Hi Patrick, please send a fix for the issue, targeting the net tree, separate from the other change you are trying to introduce. We need a better explanation for the latter and it should go through the net-next tree, if accepted. Hello, I've resent the series with a cover letter having a bit more explanations. I think all the patches should be applied on net, as they are linked to the same issue/resolution, and are not independent. BR, Patrick H.
[PATCH net 3/4] net: freescale/fman-mac: remove direct use of __devm_request_region
Since the main fman driver is no longer reserving the complete fman memory region, it is no longer needed to use a custom call to __devm_request_region, so replace it with devm_request_mem_region Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/mac.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/mac.c b/drivers/net/ethernet/freescale/fman/mac.c index 901749a7a318..35ca5aed 100644 --- a/drivers/net/ethernet/freescale/fman/mac.c +++ b/drivers/net/ethernet/freescale/fman/mac.c @@ -690,12 +690,10 @@ static int mac_probe(struct platform_device *_of_dev) goto _return_of_get_parent; } - mac_dev->res = __devm_request_region(dev, -fman_get_mem_region(priv->fman), -res.start, resource_size(), -"mac"); + mac_dev->res = devm_request_mem_region(dev, res.start, + resource_size(), "mac"); if (!mac_dev->res) { - dev_err(dev, "__devm_request_mem_region(mac) failed\n"); + dev_err(dev, "devm_request_mem_region(mac) failed\n"); err = -EBUSY; goto _return_of_get_parent; } -- 2.17.1
[PATCH net 2/4] net: freescale/fman-port: remove direct use of __devm_request_region
This driver was directly calling __devm_request_region with a specific resource on the stack as parameter. This is invalid as __devm_request_region expects the given resource passed as parameter to live longer than the call itself, as the pointer to the resource will be stored inside the internal struct used by the devres management. In addition to this issue, a related bug has been found by kmemleak with this trace : unreferenced object 0xc001efc01880 (size 64): comm "swapper/0", pid 1, jiffies 4294669078 (age 3620.536s) hex dump (first 32 bytes): 00 00 00 0f fe 4a c0 00 00 00 00 0f fe 4a cf ff .J...J.. c0 00 00 00 00 ee 9d 98 00 00 00 00 80 00 02 00 backtrace: [] .alloc_resource+0xb8/0xe0 [] .__request_region+0x70/0x1c4 [] .__devm_request_region+0x8c/0x138 [] .fman_port_probe+0x170/0x420 [] .platform_drv_probe+0x84/0x108 [] .driver_probe_device+0x2c4/0x394 [] .__driver_attach+0x124/0x128 [] .bus_for_each_dev+0xb4/0x110 [] .driver_attach+0x34/0x4c [] .bus_add_driver+0x264/0x2a4 [] .driver_register+0x94/0x160 [] .__platform_driver_register+0x60/0x7c [] .fman_port_load+0x28/0x64 [] .do_one_initcall+0xd4/0x1a8 [] .kernel_init_freeable+0x1bc/0x2a4 [] .kernel_init+0x24/0x138 Indeed, the new resource (created in __request_region) will be linked to the given resource living on the stack, which will end its lifetime after the function calling __devm_request_region has finished. Meaning the new resource allocated is no longer reachable. Now that the main fman driver is no longer reserving the region used by fman-port, this previous hack is no longer needed and we can use the regular call to devm_request_mem_region instead, solving those bugs at the same time. Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/fman_port.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c b/drivers/net/ethernet/freescale/fman/fman_port.c index d9baac0dbc7d..354974939d9d 100644 --- a/drivers/net/ethernet/freescale/fman/fman_port.c +++ b/drivers/net/ethernet/freescale/fman/fman_port.c @@ -1878,10 +1878,10 @@ static int fman_port_probe(struct platform_device *of_dev) of_node_put(port_node); - dev_res = __devm_request_region(port->dev, , res.start, - resource_size(), "fman-port"); + dev_res = devm_request_mem_region(port->dev, res.start, + resource_size(), "fman-port"); if (!dev_res) { - dev_err(port->dev, "%s: __devm_request_region() failed\n", + dev_err(port->dev, "%s: devm_request_mem_region() failed\n", __func__); err = -EINVAL; goto free_port; -- 2.17.1
[PATCH net 4/4] net: freescale/fman: remove fman_get_mem_region
This function is no longer used, so we can remove it. The pointer to the resource that was kept inside struct fman_state_struct can also be removed for the same reason. Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/fman.c | 17 - drivers/net/ethernet/freescale/fman/fman.h | 2 -- 2 files changed, 19 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c index 2e85209d560d..bf78e12a1683 100644 --- a/drivers/net/ethernet/freescale/fman/fman.c +++ b/drivers/net/ethernet/freescale/fman/fman.c @@ -531,8 +531,6 @@ struct fman_state_struct { u32 qman_channel_base; u32 num_of_qman_channels; - - struct resource *res; }; /* Structure that holds FMan initial configuration */ @@ -1737,7 +1735,6 @@ static int fman_config(struct fman *fman) fman->state->qman_channel_base = fman->dts_params.qman_channel_base; fman->state->num_of_qman_channels = fman->dts_params.num_of_qman_channels; - fman->state->res = fman->dts_params.res; fman->exception_cb = fman_exceptions; fman->bus_error_cb = fman_bus_error; fman->fpm_regs = fman->dts_params.base_addr_pol + FPM_OFFSET_FROM_POL; @@ -2405,20 +2402,6 @@ u32 fman_get_qman_channel_id(struct fman *fman, u32 port_id) } EXPORT_SYMBOL(fman_get_qman_channel_id); -/** - * fman_get_mem_region - * @fman: A Pointer to FMan device - * - * Get FMan memory region - * - * Return: A structure with FMan memory region information - */ -struct resource *fman_get_mem_region(struct fman *fman) -{ - return fman->state->res; -} -EXPORT_SYMBOL(fman_get_mem_region); - /* Bootargs defines */ /* Extra headroom for RX buffers - Default, min and max */ #define FSL_FM_RX_EXTRA_HEADROOM 64 diff --git a/drivers/net/ethernet/freescale/fman/fman.h b/drivers/net/ethernet/freescale/fman/fman.h index e6b339c57230..e326aa37b8b2 100644 --- a/drivers/net/ethernet/freescale/fman/fman.h +++ b/drivers/net/ethernet/freescale/fman/fman.h @@ -398,8 +398,6 @@ int fman_set_mac_max_frame(struct fman *fman, u8 mac_id, u16 mfl); u32 fman_get_qman_channel_id(struct fman *fman, u32 port_id); -struct resource *fman_get_mem_region(struct fman *fman); - u16 fman_get_max_frm(void); int fman_get_rx_extra_headroom(void); -- 2.17.1
[PATCH net 1/4] net: freescale/fman: Split the main resource region reservation
The main fman driver is only using some parts of the fman memory region. Split the reservation of the main region in 2, so that the other regions that will be used by fman-port and fman-mac drivers can be reserved properly and not be in conflict with the main fman reservation. Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/fman.c | 103 + drivers/net/ethernet/freescale/fman/fman.h | 9 +- 2 files changed, 69 insertions(+), 43 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c index ce0a121580f6..2e85209d560d 100644 --- a/drivers/net/ethernet/freescale/fman/fman.c +++ b/drivers/net/ethernet/freescale/fman/fman.c @@ -58,12 +58,15 @@ /* Modules registers offsets */ #define BMI_OFFSET 0x0008 #define QMI_OFFSET 0x00080400 -#define KG_OFFSET 0x000C1000 -#define DMA_OFFSET 0x000C2000 -#define FPM_OFFSET 0x000C3000 -#define IMEM_OFFSET0x000C4000 -#define HWP_OFFSET 0x000C7000 -#define CGP_OFFSET 0x000DB000 +#define SIZE_REGION_0 0x00081000 +#define POL_OFFSET 0x000C +#define KG_OFFSET_FROM_POL 0x1000 +#define DMA_OFFSET_FROM_POL0x2000 +#define FPM_OFFSET_FROM_POL0x3000 +#define IMEM_OFFSET_FROM_POL 0x4000 +#define HWP_OFFSET_FROM_POL0x7000 +#define CGP_OFFSET_FROM_POL0x0001B000 +#define SIZE_REGION_FROM_POL 0x0002 /* Exceptions bit map */ #define EX_DMA_BUS_ERROR 0x8000 @@ -1433,7 +1436,7 @@ static int clear_iram(struct fman *fman) struct fman_iram_regs __iomem *iram; int i, count; - iram = fman->base_addr + IMEM_OFFSET; + iram = fman->base_addr_pol + IMEM_OFFSET_FROM_POL; /* Enable the auto-increment */ iowrite32be(IRAM_IADD_AIE, >iadd); @@ -1710,11 +1713,8 @@ static int set_num_of_open_dmas(struct fman *fman, u8 port_id, static int fman_config(struct fman *fman) { - void __iomem *base_addr; int err; - base_addr = fman->dts_params.base_addr; - fman->state = kzalloc(sizeof(*fman->state), GFP_KERNEL); if (!fman->state) goto err_fm_state; @@ -1740,13 +1740,14 @@ static int fman_config(struct fman *fman) fman->state->res = fman->dts_params.res; fman->exception_cb = fman_exceptions; fman->bus_error_cb = fman_bus_error; - fman->fpm_regs = base_addr + FPM_OFFSET; - fman->bmi_regs = base_addr + BMI_OFFSET; - fman->qmi_regs = base_addr + QMI_OFFSET; - fman->dma_regs = base_addr + DMA_OFFSET; - fman->hwp_regs = base_addr + HWP_OFFSET; - fman->kg_regs = base_addr + KG_OFFSET; - fman->base_addr = base_addr; + fman->fpm_regs = fman->dts_params.base_addr_pol + FPM_OFFSET_FROM_POL; + fman->bmi_regs = fman->dts_params.base_addr_0 + BMI_OFFSET; + fman->qmi_regs = fman->dts_params.base_addr_0 + QMI_OFFSET; + fman->dma_regs = fman->dts_params.base_addr_pol + DMA_OFFSET_FROM_POL; + fman->hwp_regs = fman->dts_params.base_addr_pol + HWP_OFFSET_FROM_POL; + fman->kg_regs = fman->dts_params.base_addr_pol + KG_OFFSET_FROM_POL; + fman->base_addr_0 = fman->dts_params.base_addr_0; + fman->base_addr_pol = fman->dts_params.base_addr_pol; spin_lock_init(>spinlock); fman_defconfig(fman->cfg); @@ -1937,8 +1938,8 @@ static int fman_init(struct fman *fman) fman->state->exceptions &= ~FMAN_EX_QMI_SINGLE_ECC; /* clear CPG */ - memset_io((void __iomem *)(fman->base_addr + CGP_OFFSET), 0, - fman->state->fm_port_num_of_cg); + memset_io((void __iomem *)(fman->base_addr_pol + CGP_OFFSET_FROM_POL), + 0, fman->state->fm_port_num_of_cg); /* Save LIODN info before FMan reset * Skipping non-existent port 0 (i = 1) @@ -2717,13 +2718,11 @@ static struct fman *read_dts_node(struct platform_device *of_dev) { struct fman *fman; struct device_node *fm_node, *muram_node; - struct resource *res; + struct resource *tmp_res, *main_res; u32 val, range[2]; int err, irq; struct clk *clk; u32 clk_rate; - phys_addr_t phys_base_addr; - resource_size_t mem_size; fman = kzalloc(sizeof(*fman), GFP_KERNEL); if (!fman) @@ -2740,34 +2739,31 @@ static struct fman *read_dts_node(struct platform_device *of_dev) fman->dts_params.id = (u8)val; /* Get the FM interrupt */ - res = platform_get_resource(of_dev, IORESOURCE_IRQ, 0); - if (!res) { + tmp_res = platform_get_resource(of_dev, IORESOURCE_IRQ, 0); + if (!tmp_res) { dev_err(_dev->dev, "%s: Can't get FMan IR
[PATCH net 0/4] freescale/fman: remove usage of __devm_request_region
Hello, This patch series is solving a bug inside ethernet/freescale/fman/fman_port.c caused by an incorrect usage of __devm_request_region. The bug came from the fact that the resource given as parameter to the function __devm_request_region is on the stack, leading to an invalid pointer being stored inside the devres structure, and the new resource pointer being lost. In order to solve this, it is first necessary to make the regular call devm_request_mem_region work. This call cannot work as is, because the main fman driver has already reserved the memory region used by the fman_port driver. Thus the first action is to split the memory region reservation done in the main fman driver in two, so that the regions used by fman_port will not be reserved. The main memory regions have also been reduced to not request the memory regions used by fman_mac. Once this first step is done, regular usage of devm_request_mem_region can be done. This also leads to a bit of code removal done in the last patch. 1. fman: split the memory region of the main fman driver, so the memory that will be used by fman_port & fman_mac will not be already reserved. 2. fman_port: replace call of __devm_request_region to devm_request_mem_region 3. fman_mac: replace call of __devm_request_region to devm_request_mem_region 4. the code is now simplified a bit, there is no need to store the main region anymore The whole point of this series is to be able to apply the patch N°2. Since N°3 is a similar change, let's do it at the same time. I think they all should be part of the same series. Patrick Havelange (4): net: freescale/fman: Split the main resource region reservation net: freescale/fman-port: remove direct use of __devm_request_region net: freescale/fman-mac: remove direct use of __devm_request_region net: freescale/fman: remove fman_get_mem_region drivers/net/ethernet/freescale/fman/fman.c| 120 +- drivers/net/ethernet/freescale/fman/fman.h| 11 +- .../net/ethernet/freescale/fman/fman_port.c | 6 +- drivers/net/ethernet/freescale/fman/mac.c | 8 +- 4 files changed, 75 insertions(+), 70 deletions(-) base-commit: 832e09798c261cf58de3a68cfcc6556408c16a5a -- 2.17.1
[PATCH 4/4] net: freescale/fman: remove fman_get_mem_region
This function is no longer used, so we can remove it. The pointer to the resource that was kept inside struct fman_state_struct can also be removed for the same reason. Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/fman.c | 17 - drivers/net/ethernet/freescale/fman/fman.h | 2 -- 2 files changed, 19 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c index 2e85209d560d..bf78e12a1683 100644 --- a/drivers/net/ethernet/freescale/fman/fman.c +++ b/drivers/net/ethernet/freescale/fman/fman.c @@ -531,8 +531,6 @@ struct fman_state_struct { u32 qman_channel_base; u32 num_of_qman_channels; - - struct resource *res; }; /* Structure that holds FMan initial configuration */ @@ -1737,7 +1735,6 @@ static int fman_config(struct fman *fman) fman->state->qman_channel_base = fman->dts_params.qman_channel_base; fman->state->num_of_qman_channels = fman->dts_params.num_of_qman_channels; - fman->state->res = fman->dts_params.res; fman->exception_cb = fman_exceptions; fman->bus_error_cb = fman_bus_error; fman->fpm_regs = fman->dts_params.base_addr_pol + FPM_OFFSET_FROM_POL; @@ -2405,20 +2402,6 @@ u32 fman_get_qman_channel_id(struct fman *fman, u32 port_id) } EXPORT_SYMBOL(fman_get_qman_channel_id); -/** - * fman_get_mem_region - * @fman: A Pointer to FMan device - * - * Get FMan memory region - * - * Return: A structure with FMan memory region information - */ -struct resource *fman_get_mem_region(struct fman *fman) -{ - return fman->state->res; -} -EXPORT_SYMBOL(fman_get_mem_region); - /* Bootargs defines */ /* Extra headroom for RX buffers - Default, min and max */ #define FSL_FM_RX_EXTRA_HEADROOM 64 diff --git a/drivers/net/ethernet/freescale/fman/fman.h b/drivers/net/ethernet/freescale/fman/fman.h index e6b339c57230..e326aa37b8b2 100644 --- a/drivers/net/ethernet/freescale/fman/fman.h +++ b/drivers/net/ethernet/freescale/fman/fman.h @@ -398,8 +398,6 @@ int fman_set_mac_max_frame(struct fman *fman, u8 mac_id, u16 mfl); u32 fman_get_qman_channel_id(struct fman *fman, u32 port_id); -struct resource *fman_get_mem_region(struct fman *fman); - u16 fman_get_max_frm(void); int fman_get_rx_extra_headroom(void); -- 2.17.1
[PATCH 3/4] net: freescale/fman-mac: remove direct use of __devm_request_region
Since the main fman driver is no longer reserving the complete fman memory region, it is no longer needed to use a custom call to __devm_request_region, so replace it with devm_request_mem_region Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/mac.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/mac.c b/drivers/net/ethernet/freescale/fman/mac.c index 901749a7a318..35ca5aed 100644 --- a/drivers/net/ethernet/freescale/fman/mac.c +++ b/drivers/net/ethernet/freescale/fman/mac.c @@ -690,12 +690,10 @@ static int mac_probe(struct platform_device *_of_dev) goto _return_of_get_parent; } - mac_dev->res = __devm_request_region(dev, -fman_get_mem_region(priv->fman), -res.start, resource_size(), -"mac"); + mac_dev->res = devm_request_mem_region(dev, res.start, + resource_size(), "mac"); if (!mac_dev->res) { - dev_err(dev, "__devm_request_mem_region(mac) failed\n"); + dev_err(dev, "devm_request_mem_region(mac) failed\n"); err = -EBUSY; goto _return_of_get_parent; } -- 2.17.1
[PATCH 2/4] net: freescale/fman-port: remove direct use of __devm_request_region
This driver was directly calling __devm_request_region with a specific resource on the stack as parameter. This is invalid as __devm_request_region expects the given resource passed as parameter to live longer than the call itself, as the pointer to the resource will be stored inside the internal struct used by the devres management. In addition to this issue, a related bug has been found by kmemleak with this trace : unreferenced object 0xc001efc01880 (size 64): comm "swapper/0", pid 1, jiffies 4294669078 (age 3620.536s) hex dump (first 32 bytes): 00 00 00 0f fe 4a c0 00 00 00 00 0f fe 4a cf ff .J...J.. c0 00 00 00 00 ee 9d 98 00 00 00 00 80 00 02 00 backtrace: [] .alloc_resource+0xb8/0xe0 [] .__request_region+0x70/0x1c4 [] .__devm_request_region+0x8c/0x138 [] .fman_port_probe+0x170/0x420 [] .platform_drv_probe+0x84/0x108 [] .driver_probe_device+0x2c4/0x394 [] .__driver_attach+0x124/0x128 [] .bus_for_each_dev+0xb4/0x110 [] .driver_attach+0x34/0x4c [] .bus_add_driver+0x264/0x2a4 [] .driver_register+0x94/0x160 [] .__platform_driver_register+0x60/0x7c [] .fman_port_load+0x28/0x64 [] .do_one_initcall+0xd4/0x1a8 [] .kernel_init_freeable+0x1bc/0x2a4 [] .kernel_init+0x24/0x138 Indeed, the new resource (created in __request_region) will be linked to the given resource living on the stack, which will end its lifetime after the function calling __devm_request_region has finished. Meaning the new resource allocated is no longer reachable. Now that the main fman driver is no longer reserving the region used by fman-port, this previous hack is no longer needed and we can use the regular call to devm_request_mem_region instead, solving those bugs at the same time. Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/fman_port.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c b/drivers/net/ethernet/freescale/fman/fman_port.c index d9baac0dbc7d..354974939d9d 100644 --- a/drivers/net/ethernet/freescale/fman/fman_port.c +++ b/drivers/net/ethernet/freescale/fman/fman_port.c @@ -1878,10 +1878,10 @@ static int fman_port_probe(struct platform_device *of_dev) of_node_put(port_node); - dev_res = __devm_request_region(port->dev, , res.start, - resource_size(), "fman-port"); + dev_res = devm_request_mem_region(port->dev, res.start, + resource_size(), "fman-port"); if (!dev_res) { - dev_err(port->dev, "%s: __devm_request_region() failed\n", + dev_err(port->dev, "%s: devm_request_mem_region() failed\n", __func__); err = -EINVAL; goto free_port; -- 2.17.1
[PATCH 1/4] net: freescale/fman: Split the main resource region reservation
The main fman driver is only using some parts of the fman memory region. Split the reservation of the main region in 2, so that the other regions that will be used by fman-port and fman-mac drivers can be reserved properly and not be in conflict with the main fman reservation. Signed-off-by: Patrick Havelange --- drivers/net/ethernet/freescale/fman/fman.c | 103 + drivers/net/ethernet/freescale/fman/fman.h | 9 +- 2 files changed, 69 insertions(+), 43 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c index ce0a121580f6..2e85209d560d 100644 --- a/drivers/net/ethernet/freescale/fman/fman.c +++ b/drivers/net/ethernet/freescale/fman/fman.c @@ -58,12 +58,15 @@ /* Modules registers offsets */ #define BMI_OFFSET 0x0008 #define QMI_OFFSET 0x00080400 -#define KG_OFFSET 0x000C1000 -#define DMA_OFFSET 0x000C2000 -#define FPM_OFFSET 0x000C3000 -#define IMEM_OFFSET0x000C4000 -#define HWP_OFFSET 0x000C7000 -#define CGP_OFFSET 0x000DB000 +#define SIZE_REGION_0 0x00081000 +#define POL_OFFSET 0x000C +#define KG_OFFSET_FROM_POL 0x1000 +#define DMA_OFFSET_FROM_POL0x2000 +#define FPM_OFFSET_FROM_POL0x3000 +#define IMEM_OFFSET_FROM_POL 0x4000 +#define HWP_OFFSET_FROM_POL0x7000 +#define CGP_OFFSET_FROM_POL0x0001B000 +#define SIZE_REGION_FROM_POL 0x0002 /* Exceptions bit map */ #define EX_DMA_BUS_ERROR 0x8000 @@ -1433,7 +1436,7 @@ static int clear_iram(struct fman *fman) struct fman_iram_regs __iomem *iram; int i, count; - iram = fman->base_addr + IMEM_OFFSET; + iram = fman->base_addr_pol + IMEM_OFFSET_FROM_POL; /* Enable the auto-increment */ iowrite32be(IRAM_IADD_AIE, >iadd); @@ -1710,11 +1713,8 @@ static int set_num_of_open_dmas(struct fman *fman, u8 port_id, static int fman_config(struct fman *fman) { - void __iomem *base_addr; int err; - base_addr = fman->dts_params.base_addr; - fman->state = kzalloc(sizeof(*fman->state), GFP_KERNEL); if (!fman->state) goto err_fm_state; @@ -1740,13 +1740,14 @@ static int fman_config(struct fman *fman) fman->state->res = fman->dts_params.res; fman->exception_cb = fman_exceptions; fman->bus_error_cb = fman_bus_error; - fman->fpm_regs = base_addr + FPM_OFFSET; - fman->bmi_regs = base_addr + BMI_OFFSET; - fman->qmi_regs = base_addr + QMI_OFFSET; - fman->dma_regs = base_addr + DMA_OFFSET; - fman->hwp_regs = base_addr + HWP_OFFSET; - fman->kg_regs = base_addr + KG_OFFSET; - fman->base_addr = base_addr; + fman->fpm_regs = fman->dts_params.base_addr_pol + FPM_OFFSET_FROM_POL; + fman->bmi_regs = fman->dts_params.base_addr_0 + BMI_OFFSET; + fman->qmi_regs = fman->dts_params.base_addr_0 + QMI_OFFSET; + fman->dma_regs = fman->dts_params.base_addr_pol + DMA_OFFSET_FROM_POL; + fman->hwp_regs = fman->dts_params.base_addr_pol + HWP_OFFSET_FROM_POL; + fman->kg_regs = fman->dts_params.base_addr_pol + KG_OFFSET_FROM_POL; + fman->base_addr_0 = fman->dts_params.base_addr_0; + fman->base_addr_pol = fman->dts_params.base_addr_pol; spin_lock_init(>spinlock); fman_defconfig(fman->cfg); @@ -1937,8 +1938,8 @@ static int fman_init(struct fman *fman) fman->state->exceptions &= ~FMAN_EX_QMI_SINGLE_ECC; /* clear CPG */ - memset_io((void __iomem *)(fman->base_addr + CGP_OFFSET), 0, - fman->state->fm_port_num_of_cg); + memset_io((void __iomem *)(fman->base_addr_pol + CGP_OFFSET_FROM_POL), + 0, fman->state->fm_port_num_of_cg); /* Save LIODN info before FMan reset * Skipping non-existent port 0 (i = 1) @@ -2717,13 +2718,11 @@ static struct fman *read_dts_node(struct platform_device *of_dev) { struct fman *fman; struct device_node *fm_node, *muram_node; - struct resource *res; + struct resource *tmp_res, *main_res; u32 val, range[2]; int err, irq; struct clk *clk; u32 clk_rate; - phys_addr_t phys_base_addr; - resource_size_t mem_size; fman = kzalloc(sizeof(*fman), GFP_KERNEL); if (!fman) @@ -2740,34 +2739,31 @@ static struct fman *read_dts_node(struct platform_device *of_dev) fman->dts_params.id = (u8)val; /* Get the FM interrupt */ - res = platform_get_resource(of_dev, IORESOURCE_IRQ, 0); - if (!res) { + tmp_res = platform_get_resource(of_dev, IORESOURCE_IRQ, 0); + if (!tmp_res) { dev_err(_dev->dev, "%s: Can't get FMan IR
Re: [PATCH 0/4] ARM: dts: aspeed: Add Facebook Galaxy100 BMC
On Wed, Nov 11, 2020 at 11:34:10PM +, Joel Stanley wrote: > On Wed, 11 Nov 2020 at 23:23, wrote: > > > > From: Tao Ren > > > > The patch series adds the initial version of device tree for Facebook > > Galaxy100 (AST2400) BMC. > > > > Patch #1 adds common dtsi to minimize duplicated device entries across > > Facebook Network AST2400 BMC device trees. > > > > Patch #2 simplfies Wedge40 device tree by using the common dtsi. > > > > Patch #3 simplfies Wedge100 device tree by using the common dtsi. > > > > Patch #4 adds the initial version of device tree for Facebook Galaxy100 > > BMC. > > Nice. They look good to me. > > Reviewed-by: Joel Stanley > > Is there another person familiar with the design you would like to > review before I merge? Also, Reviewed-by: Patrick Williams -- Patrick Williams signature.asc Description: PGP signature
Re: [PATCH v5 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp
Hi Yun, thanks for keep improving this. I'm replying here but still considering all other reviewers comments. Best, Patrick On Tue, Nov 03, 2020 at 03:37:56 +0100, Yun Hsiang wrote... > If the user wants to stop controlling uclamp and let the task inherit > the value from the group, we need a method to reset. > > Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via > sched_setattr syscall. > > The policy is > _CLAMP_RESET => reset both min and max > _CLAMP_RESET | _CLAMP_MIN => reset min value > _CLAMP_RESET | _CLAMP_MAX => reset max value > _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max This documentation should be added to the uapi header and, most importantly, in: include/uapi/linux/sched/types.h where the documentation for struct sched_attr is provided. > Signed-off-by: Yun Hsiang > Reported-by: kernel test robot > --- > include/uapi/linux/sched.h | 7 +++-- > kernel/sched/core.c| 59 -- > 2 files changed, 49 insertions(+), 17 deletions(-) > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > index 3bac0a8ceab2..6c823ddb1a1e 100644 > --- a/include/uapi/linux/sched.h > +++ b/include/uapi/linux/sched.h > @@ -132,17 +132,20 @@ struct clone_args { > #define SCHED_FLAG_KEEP_PARAMS 0x10 > #define SCHED_FLAG_UTIL_CLAMP_MIN0x20 > #define SCHED_FLAG_UTIL_CLAMP_MAX0x40 > +#define SCHED_FLAG_UTIL_CLAMP_RESET 0x80 > > #define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \ >SCHED_FLAG_KEEP_PARAMS) > (Related to the following discussion point) What about adding in a comment here to call out that the following definitions are "internal only"? Moreover, we could probably wrap the following two define within an #ifdef __KERNEL__/#endif block Something like: + /* + * The following definitions are internal only, do not use them to set + * set_{set,get}attr() params. Use instead a valid combination of the + * flags defined above. + */ + #ifdef __KERNEL__ > #define SCHED_FLAG_UTIL_CLAMP(SCHED_FLAG_UTIL_CLAMP_MIN | \ > - SCHED_FLAG_UTIL_CLAMP_MAX) > + SCHED_FLAG_UTIL_CLAMP_MAX | \ > + SCHED_FLAG_UTIL_CLAMP_RESET) We need the _RESET flag only here (not below), since this is a UCLAMP feature and it's worth/useful to have a single "all uclamp flags" definition... > #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \ >SCHED_FLAG_RECLAIM | \ >SCHED_FLAG_DL_OVERRUN | \ >SCHED_FLAG_KEEP_ALL| \ > - SCHED_FLAG_UTIL_CLAMP) > + SCHED_FLAG_UTIL_CLAMP | \ > + SCHED_FLAG_UTIL_CLAMP_RESET) ... i.e., you can drop the chunk above. + #endif /* __KERNEL__ */ Regarding Qais comment, I had the same Dietmar's thought: I doubt there are apps using _FLAGS_ALL from userspace. For DL tasks, since they cannot fork, it makes no sense to specify, for example _RESET_ON_FORK|_RECLAIM. For CFS/RT tasks, where UCLAMP is supported, it makes no sense to specify DL flags. It's true however that having this def here when it's supposed to be used only internally, can be kind of "confusing", but it's also useful to keep the definition aligned with the flags defined above. The ifdef wrapping proposed above should make this even more explicit. Perhaps we can also better call this out also with an additional note right after: https://elixir.bootlin.com/linux/v5.9.6/source/include/uapi/linux/sched/types.h#L43 In that file, I believe the "Task Utilization Attributes" section can also be improved by adding a description of the _UCLAMP flags semantics. > #endif /* _UAPI_LINUX_SCHED_H */ > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 8160ab5263f8..6ae463b64834 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1004,7 +1004,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum > uclamp_id clamp_id, > return uclamp_idle_value(rq, clamp_id, clamp_value); > } > > -static void __uclamp_update_util_min_rt_default(struct task_struct *p) > +static inline void __uclamp_update_util_min_rt_default(struct task_struct *p) > { Again, IMO, this is _not_ an unrelated change at all. Actually, I still would like to do one step more and inline this function in the _only place_ where it's used. Qais arguments for not doing that where [1]: Updating the default rt value is done from different contexts. Hence it is important to document the rules under which this update must
From the kindness of our heart!! You have been selected to receive (£1,000,000.00 British Pound)
Dear Beneficiary, You have been selected to receive (£1,000,000.00 British Pound) as charity donation/grant from Patrick and Frances Connolly. Therefore, you are required to contact her through email for more details patrickcon2...@gmail.com -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
From the kindness of our heart!! You have been selected to receive (£1,000,000.00 British Pound)
Dear Beneficiary, You have been selected to receive (£1,000,000.00 British Pound) as charity donation/grant from Patrick and Frances Connolly. Therefore, you are required to contact her through email for more details patrickcon2...@gmail.com -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp
Hi Dietmar, Yun, I hope I'm not too late before v4 posting ;) I think the overall approach is sound, I just added in a couple of cleanups and a possible fix (user_defined reset). Best, Patrick On Tue, Oct 27, 2020 at 16:58:13 +0100, Yun Hsiang wrote... > Hi Diet mar, > On Mon, Oct 26, 2020 at 08:00:48PM +0100, Dietmar Eggemann wrote: >> On 26/10/2020 16:45, Yun Hsiang wrote: [...] >> I thought about something like this. Only lightly tested. >> >> ---8<--- >> >> From: Dietmar Eggemann >> Date: Mon, 26 Oct 2020 13:52:23 +0100 >> Subject: [PATCH] SCHED_FLAG_UTIL_CLAMP_RESET >> >> Signed-off-by: Dietmar Eggemann >> --- >> include/uapi/linux/sched.h | 4 +++- >> kernel/sched/core.c| 31 +++ >> 2 files changed, 30 insertions(+), 5 deletions(-) >> >> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h >> index 3bac0a8ceab2..0dd890822751 100644 >> --- a/include/uapi/linux/sched.h >> +++ b/include/uapi/linux/sched.h >> @@ -132,12 +132,14 @@ struct clone_args { >> #define SCHED_FLAG_KEEP_PARAMS 0x10 >> #define SCHED_FLAG_UTIL_CLAMP_MIN 0x20 >> #define SCHED_FLAG_UTIL_CLAMP_MAX 0x40 >> +#define SCHED_FLAG_UTIL_CLAMP_RESET 0x80 >> >> #define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \ >> SCHED_FLAG_KEEP_PARAMS) >> >> #define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \ >> - SCHED_FLAG_UTIL_CLAMP_MAX) >> + SCHED_FLAG_UTIL_CLAMP_MAX | \ >> + SCHED_FLAG_UTIL_CLAMP_RESET) >> >> #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \ >> SCHED_FLAG_RECLAIM | \ >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 3dc415f58bd7..717b1cf5cf1f 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -1438,6 +1438,23 @@ static int uclamp_validate(struct task_struct *p, >> return 0; >> } >> >> +static bool uclamp_reset(enum uclamp_id clamp_id, unsigned long flags) >> +{ Maybe we can add in some comments? /* No _UCLAMP_RESET flag set: do not reset */ >> +if (!(flags & SCHED_FLAG_UTIL_CLAMP_RESET)) >> +return false; >> + /* Only _UCLAMP_RESET flag set: reset both clamps */ >> +if (!(flags & (SCHED_FLAG_UTIL_CLAMP_MIN | SCHED_FLAG_UTIL_CLAMP_MAX))) >> +return true; >> + /* Both _UCLAMP_RESET and _UCLAMP_MIN flags are set: reset only min */ >> +if ((flags & SCHED_FLAG_UTIL_CLAMP_MIN) && clamp_id == UCLAMP_MIN) >> +return true; >> + /* Both _UCLAMP_RESET and _UCLAMP_MAX flags are set: reset only max */ >> +if ((flags & SCHED_FLAG_UTIL_CLAMP_MAX) && clamp_id == UCLAMP_MAX) >> +return true; Since the evaluation ordering is important, do we have to better _always_ use a READ_ONCE() for all flags accesses above, to ensure it is preserved? >> + >> +return false; >> +} >> + >> static void __setscheduler_uclamp(struct task_struct *p, >>const struct sched_attr *attr) >> { >> @@ -1449,24 +1466,30 @@ static void __setscheduler_uclamp(struct task_struct >> *p, >> */ Perhaps we should update the comment above this loop with something like: /* * Reset to default clamps on forced _UCLAMP_RESET (always) and * for tasks without a task-specific value (on scheduling class change). */ >> for_each_clamp_id(clamp_id) { >> struct uclamp_se *uc_se = >uclamp_req[clamp_id]; >> +unsigned int value; >> >> /* Keep using defined clamps across class changes */ >> -if (uc_se->user_defined) >> +if (!uclamp_reset(clamp_id, attr->sched_flags) && >> +uc_se->user_defined) { >> continue; >> +} I think we miss to reset the user_defined flag here. What about replacing the above chunk with: if (uclamp_reset(clamp_id, attr->sched_flags)) uc_se->user_defined = false; if (uc-se->user_defined) continue; ? >> >> /* >> * RT by default have a 100% boost value that could be modified >> * at runtime. >> */ >> if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) >&g
Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp
On Wed, Oct 28, 2020 at 12:39:43 +0100, Qais Yousef wrote... > On 10/28/20 11:11, Patrick Bellasi wrote: >> >> >> >> /* >> >>* RT by default have a 100% boost value that could be modified >> >>* at runtime. >> >>*/ >> >> if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) >> >> - __uclamp_update_util_min_rt_default(p); >> >> + value = sysctl_sched_uclamp_util_min_rt_default; >> >> By removing this usage of __uclamp_updadate_util_min_rt_default(p), >> the only other usage remaining is the call from: >>uclamp_udpate_util_min_rt_default(). >> >> What about an additional cleanup by in-lining the only surviving usage? > > This is not a cleanup IMO. There is special rule about updating that are > encoded and documented in this helper function. Namely: > > * p->pi_lock must be held. > * p->uclamp_req[].user_defined must be false. Both these conditions are satisfied in the above call site: - user_defined is tested just 4 lines above - pi_lock is taken by the caller, i.e. __sched_setscheduler() Thus, there is no need to test them two times. Moreover, the same granted pi_lock you check in __ucalmp_update_util_min_rt_default() is not checked at all in the rest of __setscheduler_uclamp(). Thus, perhaps we should have just avoided to add __uclamp_update_util_min_rt_default() since the beginning and: - have all its logic in the _only_ place where it's required - added the lockdep_assert_held() in __setscheduler_uclamp() That's why I consider this a very good cleanup opportunity. > I don't see open coding helps but rather makes the code harder to read and > prone to introduce bugs if anything gets reshuffled in the future. It's not open coding IMHO, it's just adding the code that's required.
Good News!!
You have a donation of £2,000,000.00 GBP from Frances & Patrick Connolly and Family. Reply to this email: (francespatrickconnol...@gmail.com) for more details. Please Note: This is the third time we have been trying to contact you. Best Regards Mrs. Mary Hamilton.
Good News!!
You have a donation of £2,000,000.00 GBP from Frances & Patrick Connolly and Family. Reply to this email: (francespatrickconnol...@gmail.com) for more details. Please Note: This is the third time we have been trying to contact you. Best Regards Mrs. Mary Hamilton.
[PATCH] ARM: dts: stm32: reorder spi4 within stm32mp15-pinctrl
Move spi4 at the right alphabetical place within stm32mp15-pinctrl Fixes: 4fe663890ac5 ("ARM: dts: stm32: Fix spi4 pins in stm32mp15-pinctrl") Signed-off-by: Patrick Delaunay --- arch/arm/boot/dts/stm32mp15-pinctrl.dtsi | 28 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi index d84686e00370..c9e514165672 100644 --- a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi +++ b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi @@ -1591,6 +1591,20 @@ }; }; + spi4_pins_a: spi4-0 { + pins { + pinmux = , /* SPI4_SCK */ +; /* SPI4_MOSI */ + bias-disable; + drive-push-pull; + slew-rate = <1>; + }; + pins2 { + pinmux = ; /* SPI4_MISO */ + bias-disable; + }; + }; + uart4_pins_a: uart4-0 { pins1 { pinmux = ; /* UART4_TX */ @@ -1726,20 +1740,6 @@ }; }; - spi4_pins_a: spi4-0 { - pins { - pinmux = , /* SPI4_SCK */ -; /* SPI4_MOSI */ - bias-disable; - drive-push-pull; - slew-rate = <1>; - }; - pins2 { - pinmux = ; /* SPI4_MISO */ - bias-disable; - }; - }; - usart2_pins_a: usart2-0 { pins1 { pinmux = , /* USART2_TX */ -- 2.17.1
Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp
On Tue, Oct 13, 2020 at 22:25:48 +0200, Dietmar Eggemann wrote... Hi Dietmar, > Hi Yun, > > On 12/10/2020 18:31, Yun Hsiang wrote: >> If the user wants to stop controlling uclamp and let the task inherit >> the value from the group, we need a method to reset. >> >> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via >> sched_setattr syscall. > > before we decide on how to implement the 'uclamp user_defined reset' > feature, could we come back to your use case in > https://lkml.kernel.org/r/20201002053812.GA176142@ubuntu ? > > Lets just consider uclamp min for now. We have: > > (1) system-wide: > > # cat /proc/sys/kernel/sched_util_clamp_min > > 1024 > > (2) tg (hierarchy) with top-app's cpu.uclamp.min to ~200 (20% of 1024): > > # cat /sys/fs/cgroup/cpu/top-app/cpu.uclamp.min > 20 > > (3) and 2 cfs tasks A and B in top-app: > > # cat /sys/fs/cgroup/cpu/top-app/tasks > > pid_A > pid_B > > Then you set A and B's uclamp min to 100. A and B are now user_defined. > A and B's effective uclamp min value is 100. > > Since the task uclamp min values (3) are less than (1) and (2), their > uclamp min value is not affected by (1) or (2). > > If A doesn't want to control itself anymore, it can set its uclamp min > to e.g. 300. Now A's effective uclamp min value is ~200, i.e. controlled > by (2), the one of B stays 100. > > So the policy is: > > (a) If the user_defined task wants to control it's uclamp, use task > uclamp value less than the tg (hierarchy) (and the system-wide) > value. > > (b) If the user_defined task doesn't want to control it's uclamp > anymore, use a uclamp value greater than or equal the tg (hierarchy) > (and the system-wide) value. > > So where exactly is the use case which would require a 'uclamp > user_defined reset' functionality? Not sure what's the specific use-case Yun is after, but I have at least one in my mind. Let say a task does not need boost at all, independently from the cgroup it's configured to run into. We can go on and set its task specific value to util_min=0. In this case, when the task is running alone on a CPU, it will get always the minimum OPP, independently from its utilization. Now, after a while (e.g. some special event happens) we want to relax this constraint and allow the task to run: 1. at whatever OPP is required by its utilization 2. with any additional boost possibly enforced by its cgroup Right now we have only quite cumbersome or hack solution: a) go check the current cgroup util_min value and set for the task something higher than that b) set task::util_min=1024 thus asking for the maximum possible boost Solution a) is more code for userspace and it's also racy. Solution b) is misleading since the task does not really want to run at 1024. It's also potentially over-killing in case the task should be moved to the root group, which is normally unbounded and thus the task will get executed always at the max OPP without any specific reason why. A simple _UCLAMP_RESET flag will allow user-space to easily switch a tasks to the default behavior (follow utilization or recommended boosts) which is what a task usually gets when it does not opt-in to uclamp. Looking forward to see if Yun has an even more specific use-case.
Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp
On Tue, Oct 13, 2020 at 15:32:46 +0200, Qais Yousef wrote... > On 10/13/20 13:46, Patrick Bellasi wrote: >> > So IMO you just need a single SCHED_FLAG_UTIL_CLAMP_RESET that if set in >> > the >> > attr, you just execute that loop in __setscheduler_uclamp() + reset >> > uc_se->user_defined. >> > >> > It should be invalid to pass the SCHED_FLAG_UTIL_CLAMP_RESET with >> > SCHED_FLAG_UTIL_CLAMP_MIN/MAX. Both have contradictory meaning IMO. >> > If user passes both we should return an EINVAL error. >> >> Passing in _CLAMP_RESET|_CLAMP_MIN will mean reset the min value while >> keeping the max at whatever it is. I think there could be cases where >> this support could be on hand. > > I am not convinced personally. I'm anxious about what this fine grained > control > means and how it should be used. I think less is more in this case and we can > always relax the restriction (appropriately) later if it's *really* required. > > Particularly the fact that this user_defined is per uclamp_se and that it > affects the cgroup behavior is implementation details this API shouldn't rely > on. The user_defined flag is an implementation details: true, but since the beginning uclamp _always_ allowed a task to set only one of its clamp values. That's why we have UTIL_CLAMP_{MIN,MAX} as separate flags and all the logic in place to set only one of the two. > A generic RESET my uclamp settings makes more sense for me as a long term > API to maintain. > > Actually maybe we should even go for a more explicit > SCHED_FLAG_UTIL_CLAMP_INHERIT_CGROUP flag instead. If we decide to abandon the > support for this feature in the future, at least we can make it return an > error > without affecting other functionality because of the implicit nature of > SCHED_FLAG_UTIL_CLAMP_RESET means inherit cgroup value too. That's not true and it's an even worst implementation detail what you want to expose. A task without a task specific clamp _always_ inherits the system defaults. Resetting a task specific clamp already makes sense also _without_ cgroups. It means: just do whatever the system allows you to do. Only if you are running with CGRoups enabled and the task happens to be _not_ in the root group, the "CGroups inheritance" happens. But that's exactly an internal detail a task should not care about. > That being said, I am not strongly against the fine grained approach if that's > what Yun wants now or what you both prefer. It's not a fine grained approach, it's just adding a reset mechanism for what uclamp already allows to do: setting min and max clamps independently. Regarding use cases, I also believe we have many more use cases of tasks interested in setting/resetting just one clamp than tasks interested in "fine grain" controlling both clamps at the same time. > I just think the name of the flag needs to change to be more explicit > too then. I don't agree on that and, again, I see much more fine grained details and internals exposure in what you propose compared to a single generic _RESET flag. > It'd be good to hear what others think. I agree on that ;)
Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp
On Tue, Oct 13, 2020 at 12:29:51 +0200, Qais Yousef wrote... > On 10/13/20 10:21, Patrick Bellasi wrote: >> [...] >> > +#define SCHED_FLAG_UTIL_CLAMP_RESET (SCHED_FLAG_UTIL_CLAMP_RESET_MIN | \ >> > + SCHED_FLAG_UTIL_CLAMP_RESET_MAX) >> > + >> > #define SCHED_FLAG_ALL(SCHED_FLAG_RESET_ON_FORK | \ >> > SCHED_FLAG_RECLAIM | \ >> > SCHED_FLAG_DL_OVERRUN | \ >> > SCHED_FLAG_KEEP_ALL| \ >> > - SCHED_FLAG_UTIL_CLAMP) >> > + SCHED_FLAG_UTIL_CLAMP | \ >> > + SCHED_FLAG_UTIL_CLAMP_RESET) >> >> >> ... and use it in conjunction with the existing _CLAMP_{MIN,MAX} to know >> which clamp should be reset? > > I think the RESET should restore *both* MIN and MAX and reset the user_defined > flag. Since the latter is the main purpose of this interface, I don't think > you > can reset the user_defined flag without resetting both MIN and MAX to > uclamp_none[UCLAMP_MIN/MAX]. We can certainly set one clamp and not the other, and indeed the user_defined flag is per-clamp_id, thus we can reset one clamp while still keeping user-defined the other one. >> > #endif /* _UAPI_LINUX_SCHED_H */ >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> > index 9a2fbf98fd6f..ed4cb412dde7 100644 >> > --- a/kernel/sched/core.c >> > +++ b/kernel/sched/core.c >> > @@ -1207,15 +1207,22 @@ static void __setscheduler_uclamp(struct >> > task_struct *p, >> >uclamp_se_set(uc_se, clamp_value, false); >> >} >> > >> > - if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) >> > + if (likely(!(attr->sched_flags & >> > + (SCHED_FLAG_UTIL_CLAMP | SCHED_FLAG_UTIL_CLAMP_RESET >> >return; >> >> This check will not be changed, while we will have to add a bypass in >> uclamp_validate(). >> >> > >> > - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) { >> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MIN) { >> > + uclamp_se_set(>uclamp_req[UCLAMP_MIN], >> > +0, false); >> > + } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) { >> >uclamp_se_set(>uclamp_req[UCLAMP_MIN], >> > attr->sched_util_min, true); >> >} >> > >> >> These checks also will have to be updated to check _RESET and >> _{MIN,MAX} combinations. >> >> Bonus point would be to be possible to pass in just the _RESET flag if >> we want to reset both clamps. IOW, passing in _RESET only should be >> consumed as if we passed in _RESET|_MIN|_MAX. >> >> Caveat, RT tasks have got a special 'reset value' for _MIN. >> We should check and ensure __uclamp_update_uti_min_rt_default() is >> property called for those tasks, which likely will require some >> additional refactoring :/ > > Hmm I am probably missing something. But if the SCHED_FLAG_UTIL_CLAMP_RESET is > set, just reset uc_se->user_defined in the loop in __setscheduler_uclamp(). > This should take care of doing the reset properly then. Including for > RT tasks. Yes and no. Yes because in principle we can just reset the flag for a clamp_id without updating the request values, as it is done by the snippets above, and the internals should work. However, we will end up reporting the old values when reading from user-space. We should better check all those reporting code paths or... just update the requested values as Yun is proposing above. I like better Yun approach so that we keep internal data structures aligned with features. > So IMO you just need a single SCHED_FLAG_UTIL_CLAMP_RESET that if set in the > attr, you just execute that loop in __setscheduler_uclamp() + reset > uc_se->user_defined. > > It should be invalid to pass the SCHED_FLAG_UTIL_CLAMP_RESET with > SCHED_FLAG_UTIL_CLAMP_MIN/MAX. Both have contradictory meaning IMO. > If user passes both we should return an EINVAL error. Passing in _CLAMP_RESET|_CLAMP_MIN will mean reset the min value while keeping the max at whatever it is. I think there could be cases where this support could be on hand. However, as in my previous email, by passing in only _CLAMP_RESET, we should go an reset both.
Re: [PATCH v2] sched/features: Fix !CONFIG_JUMP_LABEL case
On Tue, Oct 13, 2020 at 07:31:14 +0200, Juri Lelli wrote... > Commit 765cc3a4b224e ("sched/core: Optimize sched_feat() for > !CONFIG_SCHED_DEBUG builds") made sched features static for > !CONFIG_SCHED_DEBUG configurations, but overlooked the CONFIG_ > SCHED_DEBUG enabled and !CONFIG_JUMP_LABEL cases. For the latter echoing > changes to /sys/kernel/debug/sched_features has the nasty effect of > effectively changing what sched_features reports, but without actually > changing the scheduler behaviour (since different translation units get > different sysctl_sched_features). Hops, yes, I think I missed to properly check that config :/ Good spot! > Fix CONFIG_SCHED_DEBUG and !CONFIG_JUMP_LABEL configurations by properly > restructuring ifdefs. > > Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for > !CONFIG_SCHED_DEBUG builds") > Co-developed-by: Daniel Bristot de Oliveira > Signed-off-by: Daniel Bristot de Oliveira > Signed-off-by: Juri Lelli (did you get some wrong formatting for the changelog above?) > --- > v1->v2 > - use CONFIG_JUMP_LABEL (and not the old HAVE_JUMP_LABEL) [Valentin] > --- > kernel/sched/core.c | 2 +- > kernel/sched/sched.h | 13 ++--- > 2 files changed, 11 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 3dc415f58bd7..a7949e3ed7e7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -44,7 +44,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); > > DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); > > -#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) > +#ifdef CONFIG_SCHED_DEBUG > /* > * Debugging: various feature bits > * > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 28709f6b0975..8d1ca65db3b0 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1629,7 +1629,7 @@ enum { > > #undef SCHED_FEAT > > -#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) > +#ifdef CONFIG_SCHED_DEBUG > > /* > * To support run-time toggling of sched features, all the translation units > @@ -1637,6 +1637,7 @@ enum { > */ > extern const_debug unsigned int sysctl_sched_features; > > +#ifdef CONFIG_JUMP_LABEL > #define SCHED_FEAT(name, enabled)\ > static __always_inline bool static_branch_##name(struct static_key *key) \ > {\ > @@ -1649,7 +1650,13 @@ static __always_inline bool > static_branch_##name(struct static_key *key) \ > extern struct static_key sched_feat_keys[__SCHED_FEAT_NR]; > #define sched_feat(x) (static_branch_##x(_feat_keys[__SCHED_FEAT_##x])) > > -#else /* !(SCHED_DEBUG && CONFIG_JUMP_LABEL) */ > +#else /* !CONFIG_JUMP_LABEL */ > + > +#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x)) > + > +#endif /* CONFIG_JUMP_LABEL */ > + > +#else /* !SCHED_DEBUG */ > > /* > * Each translation unit has its own copy of sysctl_sched_features to allow > @@ -1665,7 +1672,7 @@ static const_debug __maybe_unused unsigned int > sysctl_sched_features = > > #define sched_feat(x) !!(sysctl_sched_features & (1UL << __SCHED_FEAT_##x)) > > -#endif /* SCHED_DEBUG && CONFIG_JUMP_LABEL */ > +#endif /* SCHED_DEBUG */ > > extern struct static_key_false sched_numa_balancing; > extern struct static_key_false sched_schedstats;
Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp
Hi Yun, thanks for sharing this new implementation. On Mon, Oct 12, 2020 at 18:31:40 +0200, Yun Hsiang wrote... > If the user wants to stop controlling uclamp and let the task inherit > the value from the group, we need a method to reset. > > Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via > sched_setattr syscall. Looks like what you say here is not what you code, since you actually add two new flags, _RESET_{MIN,MAX}. I think we value instead a simple user-space interface where just the additional one flag _RESET should be good enough. > Signed-off-by: Yun Hsiang > --- > include/uapi/linux/sched.h | 9 - > kernel/sched/core.c| 16 > 2 files changed, 20 insertions(+), 5 deletions(-) > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > index 3bac0a8ceab2..a12e88c362d8 100644 > --- a/include/uapi/linux/sched.h > +++ b/include/uapi/linux/sched.h > @@ -132,6 +132,9 @@ struct clone_args { > #define SCHED_FLAG_KEEP_PARAMS 0x10 > #define SCHED_FLAG_UTIL_CLAMP_MIN0x20 > #define SCHED_FLAG_UTIL_CLAMP_MAX0x40 > +#define SCHED_FLAG_UTIL_CLAMP_RESET_MIN 0x80 > +#define SCHED_FLAG_UTIL_CLAMP_RESET_MAX 0x100 What about adding just SCHED_FLAG_UTIL_CLAMP_RESET 0x08 ... > > #define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \ >SCHED_FLAG_KEEP_PARAMS) > @@ -139,10 +142,14 @@ struct clone_args { > #define SCHED_FLAG_UTIL_CLAMP(SCHED_FLAG_UTIL_CLAMP_MIN | \ >SCHED_FLAG_UTIL_CLAMP_MAX) ... making it part of SCHED_FLAG_UTIL_CLAMP ... > > +#define SCHED_FLAG_UTIL_CLAMP_RESET (SCHED_FLAG_UTIL_CLAMP_RESET_MIN | \ > + SCHED_FLAG_UTIL_CLAMP_RESET_MAX) > + > #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \ >SCHED_FLAG_RECLAIM | \ >SCHED_FLAG_DL_OVERRUN | \ >SCHED_FLAG_KEEP_ALL| \ > - SCHED_FLAG_UTIL_CLAMP) > + SCHED_FLAG_UTIL_CLAMP | \ > + SCHED_FLAG_UTIL_CLAMP_RESET) ... and use it in conjunction with the existing _CLAMP_{MIN,MAX} to know which clamp should be reset? > #endif /* _UAPI_LINUX_SCHED_H */ > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 9a2fbf98fd6f..ed4cb412dde7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1207,15 +1207,22 @@ static void __setscheduler_uclamp(struct task_struct > *p, > uclamp_se_set(uc_se, clamp_value, false); > } > > - if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) > + if (likely(!(attr->sched_flags & > + (SCHED_FLAG_UTIL_CLAMP | SCHED_FLAG_UTIL_CLAMP_RESET > return; This check will not be changed, while we will have to add a bypass in uclamp_validate(). > > - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) { > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MIN) { > + uclamp_se_set(>uclamp_req[UCLAMP_MIN], > + 0, false); > + } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) { > uclamp_se_set(>uclamp_req[UCLAMP_MIN], > attr->sched_util_min, true); > } > These checks also will have to be updated to check _RESET and _{MIN,MAX} combinations. Bonus point would be to be possible to pass in just the _RESET flag if we want to reset both clamps. IOW, passing in _RESET only should be consumed as if we passed in _RESET|_MIN|_MAX. Caveat, RT tasks have got a special 'reset value' for _MIN. We should check and ensure __uclamp_update_uti_min_rt_default() is property called for those tasks, which likely will require some additional refactoring :/ > - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) { > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MAX) { > + uclamp_se_set(>uclamp_req[UCLAMP_MAX], > + 1024, false); > + } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) { > uclamp_se_set(>uclamp_req[UCLAMP_MAX], > attr->sched_util_max, true); > } > @@ -4901,7 +4908,8 @@ static int __sched_setscheduler(struct task_struct *p, > goto change; > if (dl_policy(policy) && dl_param_changed(p, attr)) > goto change; > - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) > + if (attr->sched_flags & > + (SCHED_FLAG_UTIL_CLAMP | > SCHED_FLAG_UTIL_CLAMP_RESET)) > goto change; > > p->sched_reset_on_fork = reset_on_fork;
Re: [PATCH 1/1] sched/uclamp: release per-task uclamp control if user set to default value
Hi Yun, Dietmar, On Mon, Oct 05, 2020 at 14:38:18 +0200, Dietmar Eggemann wrote... > + Patrick Bellasi > + Qais Yousef > > On 02.10.20 07:38, Yun Hsiang wrote: >> On Wed, Sep 30, 2020 at 03:12:51PM +0200, Dietmar Eggemann wrote: > > [...] > >>> On 28/09/2020 10:26, Yun Hsiang wrote: >>>> If the user wants to release the util clamp and let cgroup to control it, >>>> we need a method to reset. >>>> >>>> So if the user set the task uclamp to the default value (0 for UCLAMP_MIN >>>> and 1024 for UCLAMP_MAX), reset the user_defined flag to release control. >>>> >>>> Signed-off-by: Yun Hsiang >>> >>> could you explain with a little bit more detail why you would need this >>> feature? >>> >>> Currently we assume that once the per-task uclamp (user-defined) values >>> are set, you could only change the effective uclamp values of this task >>> by (1) moving it into another taskgroup or (2) changing the system >>> default uclamp values. >>> >> >> Assume a module that controls group (such as top-app in android) uclamp and >> task A in the group. >> Once task A set uclamp, it will not be affected by the group setting. That's not true, and Dietmar example here after is correct. We call it uclamp since the values are clamps, which are always aggregate somehow at different levels. IOW, a task has never a full free choice of the final effective value. > This depends on the uclamp values of A and /TG (the task group). > > Both uclamp values are max aggregated (max aggregation between > system-wide, taskgroup and task values for UCLAMP_MIN and UCLAMP_MAX). > > (1) system-wide: /proc/sys/kernel/sched_util_clamp_[min,max] > > (2) taskgroup (hierarchy): /sys/fs/cgroup/cpu/TG/cpu.uclamp.[min,max] > > (3) task A: > > Example: [uclamp_min, uclamp_max] > > (1) [1024, 1024] > > (2) [25.00 (256), 75.00 (768)] > > (3a) [128, 512] : both values are not affected by /TG > > (3b) [384, 896] : both values are affected by /TG > > >> If task A doesn't want to control itself anymore, To be precise, in this case we should say: "if a task don't want to give up anymore". Indeed, the base idea is that a task can always and only "ask for less". What it really gets (effective value) is the minimum among its request, what the group allows and the system wide value on top, i.e ref [4,5]: eff_value = MIN(system-wide, MIN(tg, task)) >> it can not go back to the initial state to let the module(group) control. > > In case A changes its values e.g. from 3a to 3b it will go back to be > controlled by /TG again (like it was when it had no user defined > values). True, however it's also true that strictly speaking once a task has defined a per-task value, we will always aggregate/clamp that value wrt to TG and SystemWide value. >> But the other tasks in the group will be affected by the group. This is not clear to me. All tasks in a group will be treated independently. All the tasks are subject to the same _individual_ aggregation/clamping policy. > Yes, in case they have no user defined values or have values greater > than the one of /TG. > >> The policy might be >> 1) if the task wants to control it's uclamp, use task uclamp value Again, worth to stress, a task has _never_ full control of it's clamp. Precisely, a task has the freedom to always ask less than what it's enforced at TG/System level. IOW, task-specific uclamp values support only a "nice" policy, where a task can only give up something. Either be _less_ boosted or _more_ capped, which in both cases corresponds to asking for _less_ CPU bandwidth. >> (but under group uclamp constraint) > > That would be example 3a. > >> 2) if the task doesn't want to control it's uclamp, use group uclamp value. > > That would be example 3b. > >> If the policy is proper, we need a reset method for per-task uclamp. >> >>>> --- >>>> kernel/sched/core.c | 7 +-- >>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>>> index 9a2fbf98fd6f..fa63d70d783a 100644 >>>> --- a/kernel/sched/core.c >>>> +++ b/kernel/sched/core.c >>>> @@ -1187,6 +1187,7 @@ static void __setscheduler_uclamp(struct task_struct >>>> *p, >>>> const struct sched_attr *attr) >>>> { >>>>enum uclamp_id clamp_id; >>>> + bool user_defined; >>>> >>>>/* >>>> * On scheduling class change, reset to defaul
Re: [PATCH 4/5] ARM: dts: aspeed: minipack: Update 64MB FMC flash layout
On Mon, Aug 24, 2020 at 02:19:47PM -0700, rentao.b...@gmail.com wrote: > From: Tao Ren > > Set 64Mb FMC flash layout in Minipack device tree explicitly because the > flash layout was removed from "ast2500-facebook-netbmc-common.dtsi". > > Please note "data0" partition' size is updated to 4MB to be consistent > with other Facebook OpenBMC platforms. > > Signed-off-by: Tao Ren > --- > .../boot/dts/aspeed-bmc-facebook-minipack.dts | 47 ++- > 1 file changed, 45 insertions(+), 2 deletions(-) > Reviewed-by: Patrick Williams -- Patrick Williams signature.asc Description: PGP signature
Re: [PATCH 3/5] ARM: dts: aspeed: yamp: Set 32MB FMC flash layout
On Mon, Aug 24, 2020 at 02:19:46PM -0700, rentao.b...@gmail.com wrote: > From: Tao Ren > > Set 32MB FMC flash layout in Yamp device tree explicitly because flash > layout settings were removed from "ast2500-facebook-netbmc-common.dtsi". > > Signed-off-by: Tao Ren > --- > arch/arm/boot/dts/aspeed-bmc-facebook-yamp.dts | 17 + > 1 file changed, 17 insertions(+) > Reviewed-by: Patrick Williams -- Patrick Williams signature.asc Description: PGP signature
Re: [PATCH 2/5] ARM: dts: aspeed: cmm: Set 32MB FMC flash layout
On Mon, Aug 24, 2020 at 02:19:45PM -0700, rentao.b...@gmail.com wrote: > From: Tao Ren > > Set 32MB FMC flash layout in CMM device tree explicitly because the flash > layout settings were removed from "ast2500-facebook-netbmc-common.dtsi". > > Signed-off-by: Tao Ren > --- > arch/arm/boot/dts/aspeed-bmc-facebook-cmm.dts | 17 + > 1 file changed, 17 insertions(+) > Reviewed-by: Patrick Williams -- Patrick Williams signature.asc Description: PGP signature
Re: [PATCH 1/5] ARM: dts: aspeed: Remove flash layout from Facebook AST2500 Common dtsi
On Mon, Aug 24, 2020 at 02:19:44PM -0700, rentao.b...@gmail.com wrote: > From: Tao Ren > > Remove FMC flash layout from ast2500-facebook-netbmc-common.dtsi because > flash size and layout varies across different Facebook AST2500 OpenBMC > platforms. > > Signed-off-by: Tao Ren > --- > .../boot/dts/ast2500-facebook-netbmc-common.dtsi| 13 - > 1 file changed, 13 deletions(-) > Reviewed-by: Patrick Williams -- Patrick Williams signature.asc Description: PGP signature
[PATCH V2] USB: serial: ftdi_sio: add IDs for Xsens Mti USB converter
From: Patrick Riphagen The device added has an FTDI chip inside. The device is used to connect Xsens USB Motion Trackers. Cc: sta...@vger.kernel.org Signed-off-by: Patrick Riphagen --- drivers/usb/serial/ftdi_sio.c | 1 + drivers/usb/serial/ftdi_sio_ids.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c index 9ad44a96dfe3..2c08cad32f1d 100644 --- a/drivers/usb/serial/ftdi_sio.c +++ b/drivers/usb/serial/ftdi_sio.c @@ -713,6 +713,7 @@ static const struct usb_device_id id_table_combined[] = { { USB_DEVICE(XSENS_VID, XSENS_AWINDA_STATION_PID) }, { USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) }, { USB_DEVICE(XSENS_VID, XSENS_MTDEVBOARD_PID) }, + { USB_DEVICE(XSENS_VID, XSENS_MTIUSBCONVERTER_PID) }, { USB_DEVICE(XSENS_VID, XSENS_MTW_PID) }, { USB_DEVICE(FTDI_VID, FTDI_OMNI1509) }, { USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) }, diff --git a/drivers/usb/serial/ftdi_sio_ids.h b/drivers/usb/serial/ftdi_sio_ids.h index e8373528264c..b5ca17a5967a 100644 --- a/drivers/usb/serial/ftdi_sio_ids.h +++ b/drivers/usb/serial/ftdi_sio_ids.h @@ -160,6 +160,7 @@ #define XSENS_AWINDA_DONGLE_PID 0x0102 #define XSENS_MTW_PID 0x0200 /* Xsens MTw */ #define XSENS_MTDEVBOARD_PID 0x0300 /* Motion Tracker Development Board */ +#define XSENS_MTIUSBCONVERTER_PID 0x0301 /* MTi USB converter */ #define XSENS_CONVERTER_PID0xD00D /* Xsens USB-serial converter */ /* Xsens devices using FTDI VID */ -- 2.25.1 V2: Added CC to stable, From line to match SoB and version in subject
[PATCH] USB: serial: ftdi_sio: add IDs for Xsens Mti USB converter
The device added has an FTDI chip inside. The device is used to connect Xsens USB Motion Trackers. Cc: sta...@vger.kernel.org Signed-off-by: Patrick Riphagen --- drivers/usb/serial/ftdi_sio.c | 1 + drivers/usb/serial/ftdi_sio_ids.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c index 9ad44a96dfe3..2c08cad32f1d 100644 --- a/drivers/usb/serial/ftdi_sio.c +++ b/drivers/usb/serial/ftdi_sio.c @@ -713,6 +713,7 @@ static const struct usb_device_id id_table_combined[] = { { USB_DEVICE(XSENS_VID, XSENS_AWINDA_STATION_PID) }, { USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) }, { USB_DEVICE(XSENS_VID, XSENS_MTDEVBOARD_PID) }, + { USB_DEVICE(XSENS_VID, XSENS_MTIUSBCONVERTER_PID) }, { USB_DEVICE(XSENS_VID, XSENS_MTW_PID) }, { USB_DEVICE(FTDI_VID, FTDI_OMNI1509) }, { USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) }, diff --git a/drivers/usb/serial/ftdi_sio_ids.h b/drivers/usb/serial/ftdi_sio_ids.h index e8373528264c..b5ca17a5967a 100644 --- a/drivers/usb/serial/ftdi_sio_ids.h +++ b/drivers/usb/serial/ftdi_sio_ids.h @@ -160,6 +160,7 @@ #define XSENS_AWINDA_DONGLE_PID 0x0102 #define XSENS_MTW_PID 0x0200 /* Xsens MTw */ #define XSENS_MTDEVBOARD_PID 0x0300 /* Motion Tracker Development Board */ +#define XSENS_MTIUSBCONVERTER_PID 0x0301 /* MTi USB converter */ #define XSENS_CONVERTER_PID0xD00D /* Xsens USB-serial converter */ /* Xsens devices using FTDI VID */ -- 2.25.1
[PATCH] USB: serial: ftdi_sio: add IDs for Xsens Mti USB converter
The device added has an FTDI chip inside. The device is used to connect Xsens USB Motion Trackers. Signed-off-by: Patrick Riphagen --- drivers/usb/serial/ftdi_sio.c | 1 + drivers/usb/serial/ftdi_sio_ids.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c index 9ad44a96dfe3..2c08cad32f1d 100644 --- a/drivers/usb/serial/ftdi_sio.c +++ b/drivers/usb/serial/ftdi_sio.c @@ -713,6 +713,7 @@ static const struct usb_device_id id_table_combined[] = { { USB_DEVICE(XSENS_VID, XSENS_AWINDA_STATION_PID) }, { USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) }, { USB_DEVICE(XSENS_VID, XSENS_MTDEVBOARD_PID) }, + { USB_DEVICE(XSENS_VID, XSENS_MTIUSBCONVERTER_PID) }, { USB_DEVICE(XSENS_VID, XSENS_MTW_PID) }, { USB_DEVICE(FTDI_VID, FTDI_OMNI1509) }, { USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) }, diff --git a/drivers/usb/serial/ftdi_sio_ids.h b/drivers/usb/serial/ftdi_sio_ids.h index e8373528264c..b5ca17a5967a 100644 --- a/drivers/usb/serial/ftdi_sio_ids.h +++ b/drivers/usb/serial/ftdi_sio_ids.h @@ -160,6 +160,7 @@ #define XSENS_AWINDA_DONGLE_PID 0x0102 #define XSENS_MTW_PID 0x0200 /* Xsens MTw */ #define XSENS_MTDEVBOARD_PID 0x0300 /* Motion Tracker Development Board */ +#define XSENS_MTIUSBCONVERTER_PID 0x0301 /* MTi USB converter */ #define XSENS_CONVERTER_PID0xD00D /* Xsens USB-serial converter */ /* Xsens devices using FTDI VID */ -- 2.25.1
Dear
Dear friend, Please accept my apologies, I didn't mean to invade your privacy, but I wrote you an email already, but I'm not sure you have it, so I'm resending it to you. I pray that this letter reaches you in good health. I am Barrister Dossou Patrick. I have a US $ 10 million business proposal in the codicil and last will from a deceased who happens to be my client. He also has the same last name with you and he is from your country as well. I want to introduce you to his bank as his next of kin so that the bank will pay the 10 million US dollars to your account in your country and I will come to your country for my own part. I will provide more details on the subject once I have received your response via the attached email address: [dossou.cabin...@gmail.com] You can also request my phone number for further discussion On the question. We may also take this opportunity to discuss the split ratio and fund transfer terms on your behalf. Regards, Dossou patrick, Esq .. .. . Querido amigo, Accept put exonerated, no quería invadir su privacidad, pero he wrote an electronic correo anteriormente, pero no estoy seguro de que lo haya recibido, así that lo enviaré de regreso. Rezo para que esta carta te llegue con buena salud. Soy el abogado Delmedre Daniel. Tengo una commercial propuesta of $ 10 million dollars of Los Estados Unidos en el codicilo y el último testimonio of a fallecido that results ser mi client. El también lleva el mismo apellido contigo y también es de tu país. Quiero presentarle has known banco como pariente más cercano para que ellos paguen los 10 $ million dólares estadounidenses has known cuenta en su país y yo iré has su país por mi propia parte. Daré más aclaraciones sobre el thema una vez que reciba su respuesta through the dirección de correo electrónico adjunta:[dossou.cabin...@gmail.com] También puede solicitar mi número de teléfono para una discusión adicional sobre el asunto. También podemos aprovechar esa opportunidad para analizar the relación de intercambio y las modalidades of the transferencia del fondo a su number. Saludos, Dosssou patrick, Esq
[PATCH] USB: serial: ftdi_sio: add IDs for Xsens Mti USB converter
The device added has an FTDI chip inside. The device is used to connect Xsens USB Motion Trackers. Signed-off-by: Patrick Riphagen --- drivers/usb/serial/ftdi_sio.c | 1 + drivers/usb/serial/ftdi_sio_ids.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c index 9ad44a96dfe3..2c08cad32f1d 100644 --- a/drivers/usb/serial/ftdi_sio.c +++ b/drivers/usb/serial/ftdi_sio.c @@ -713,6 +713,7 @@ static const struct usb_device_id id_table_combined[] = { { USB_DEVICE(XSENS_VID, XSENS_AWINDA_STATION_PID) }, { USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) }, { USB_DEVICE(XSENS_VID, XSENS_MTDEVBOARD_PID) }, + { USB_DEVICE(XSENS_VID, XSENS_MTIUSBCONVERTER_PID) }, { USB_DEVICE(XSENS_VID, XSENS_MTW_PID) }, { USB_DEVICE(FTDI_VID, FTDI_OMNI1509) }, { USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) }, diff --git a/drivers/usb/serial/ftdi_sio_ids.h b/drivers/usb/serial/ftdi_sio_ids.h index e8373528264c..b5ca17a5967a 100644 --- a/drivers/usb/serial/ftdi_sio_ids.h +++ b/drivers/usb/serial/ftdi_sio_ids.h @@ -160,6 +160,7 @@ #define XSENS_AWINDA_DONGLE_PID 0x0102 #define XSENS_MTW_PID 0x0200 /* Xsens MTw */ #define XSENS_MTDEVBOARD_PID 0x0300 /* Motion Tracker Development Board */ +#define XSENS_MTIUSBCONVERTER_PID 0x0301 /* MTi USB converter */ #define XSENS_CONVERTER_PID0xD00D /* Xsens USB-serial converter */ /* Xsens devices using FTDI VID */ -- 2.25.1
Donation
Hello, We are Frances and Patrick Connolly the mega winner of £115m EuroMillions lottery, if you get this email then your email was selected after a spin ball. We have spread most of our wealth over several charities and organizations. We have voluntarily decided to donate the sum of £1 Million to you, to verify our winnings please see our YouTube page below. WATCH HERE: https://www.youtube.com/watch?v=NdAfkA8QPpg THIS IS YOUR DONATION CODE: [ 0034219 ] Reply with the DONATION CODE to this email: francespatrickconnol...@gmail.com Hope to make you and your family happy. Regards Frances and Patrick Connolly.
Re: nouveau regression with 5.7 caused by "PCI/PM: Assume ports without DLL Link Active train links in 100 ms"
On 7/21/20 10:27 AM, Mika Westerberg wrote: > On Tue, Jul 21, 2020 at 11:01:55AM -0400, Lyude Paul wrote: >> Sure thing. Also, feel free to let me know if you'd like access to one of the >> systems we saw breaking with this patch - I'm fairly sure I've got one of >> them >> locally at my apartment and don't mind setting up AMT/KVM/SSH > Probably no need for remote access (thanks for the offer, though). I > attached a test patch to the bug report: > > https://bugzilla.kernel.org/show_bug.cgi?id=208597 > > that tries to work it around (based on the ->pm_cap == 0). I wonder if > anyone would have time to try it out. Hi Mika, I can confirm that this patch applied to 5.4.52 fixes the issue with hybrid graphics on the Thinkpad X1 Extreme gen2. Thanks, Pat
Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus
Hi Vincent, On Mon, Jul 13, 2020 at 14:59:51 +0200, Vincent Guittot wrote... > On Fri, 10 Jul 2020 at 21:59, Patrick Bellasi > wrote: >> On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot >> wrote... >> >> [...] >> >> >> > C) Existing control paths >> >> >> >> Assuming: >> >> >> >> C: CFS task currently running on CPUx >> >> W: CFS task waking up on the same CPUx >> >> >> >> And considering the overall simplified workflow: >> >> >> >> core::try_to_wake_up() >> >> >> >> // 1) Select on which CPU W will run >> >> core::select_task_rq() >> >> fair::select_task_rq_fair() >> >> >> >> // 2) Enqueue W on the selected CPU >> >> core::ttwu_queue() >> >> core::ttwu_do_activate() >> >> core::activate_task() >> >> core::enqueue_task() >> >> fair::enqueue_task_fair() >> >> fair::enqueue_entity() >> >> >> >> // 3) Set W's vruntime bonus >> >> fair::place_entity() >> >> se->vruntime = ... >> >> >> >> // 4) Check if C can be preempted by W >> >> core::ttwu_do_wakeup() >> >> core::check_preempt_curr() >> >> fair::check_preempt_curr() >> >> fair::check_preempt_wakeup(curr, se) >> >> fair::wakeup_preempt_entity(curr, se) >> >> vdiff = curr.vruntime - se.vruntime >> >> return vdiff > wakeup_gran(se) >> >> >> >> We see that W preempts C iff: >> >> >> >>vdiff > wakeup_gran(se) >> >> >> >> Since: >> >> >> >> enqueue_entity(cfs_rq, se, flags) >> >> place_entity(cfs_rq, se, initial=0) >> >> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1) >> >> vruntime = cfs_rq->min_vruntime - thresh >> >> se->vruntime = max_vruntime(se->vruntime, vruntime) >> >> >> >> a waking task's W.vruntime can get a "vruntime bonus" up to: >> >> - 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS) >> >> - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS) >> >> >> >> >> >> > D) Desired behavior >> >> >> >> The "vruntime bonus" (thresh) computed in place_entity() should have a >> >> per-task definition, which defaults to the current implementation. >> >> >> >> A bigger vruntime bonus can be configured for latency sensitive tasks. >> >> A smaller vruntime bonus can be configured for latency tolerant tasks. >> > >> > I'm not sure that adjusting what you called "vruntime bonus" is the >> > right way to provide some latency because it doesn't only provide a >> > wakeup latency bonus but also provides a runtime bonus. >> >> True, however that's what we already do but _just_ in an hard-coded way. >> >> A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o >> FAIR_SLEEPERS. Point is that not all tasks are the same: for some this > > From a nice and fair PoV, it's not a bonus but the opposite. In fact > it's limiting how much credit, the task will keep from its sleep time. I agree about 'limiting a credit', thus being a _credit_ IMO it's a bonus and the limiting happens only with GENTLE_FAIR_SLEEPER. So, in general, tasks _waking up_ get a (limited) credit, i.e. a vruntime bonus. Form the FAIR PoV it is even more a bonus since all the machinery AFAIU it's designed to give some vruntime bonus to _non_ CPU bound / batch tasks. That's done to compensate for them being suspended and thus not having a chance to consume all their fair CPU share in the previous activation. > Also keep in mind that this value is clamped by its vruntime so a task > can't get bonus True, at wakeup we clamped it with the SE (normalized) vruntime. But still since we do: se->vruntime = max(se->vruntime, cfs_rq->min_vruntime-VRUNTIME_BONUS) \ A / \--- B ---/ The bigger B is the more likely we are to "penalize" the SE vuntime. >> bonus can be not really required, for others too small. >> >> Regarding the 'runtime bonus' I think it's kind of unavoidable, >> if we really want a latency sensitive task being scheduled >> before the others. > > That's where I disa
Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus
On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot wrote... > Hi Patrick, Hi Vincent, [...] >> > C) Existing control paths >> >> Assuming: >> >> C: CFS task currently running on CPUx >> W: CFS task waking up on the same CPUx >> >> And considering the overall simplified workflow: >> >> core::try_to_wake_up() >> >> // 1) Select on which CPU W will run >> core::select_task_rq() >> fair::select_task_rq_fair() >> >> // 2) Enqueue W on the selected CPU >> core::ttwu_queue() >> core::ttwu_do_activate() >> core::activate_task() >> core::enqueue_task() >> fair::enqueue_task_fair() >> fair::enqueue_entity() >> >> // 3) Set W's vruntime bonus >> fair::place_entity() >> se->vruntime = ... >> >> // 4) Check if C can be preempted by W >> core::ttwu_do_wakeup() >> core::check_preempt_curr() >> fair::check_preempt_curr() >> fair::check_preempt_wakeup(curr, se) >> fair::wakeup_preempt_entity(curr, se) >> vdiff = curr.vruntime - se.vruntime >> return vdiff > wakeup_gran(se) >> >> We see that W preempts C iff: >> >>vdiff > wakeup_gran(se) >> >> Since: >> >> enqueue_entity(cfs_rq, se, flags) >> place_entity(cfs_rq, se, initial=0) >> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1) >> vruntime = cfs_rq->min_vruntime - thresh >> se->vruntime = max_vruntime(se->vruntime, vruntime) >> >> a waking task's W.vruntime can get a "vruntime bonus" up to: >> - 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS) >> - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS) >> >> >> > D) Desired behavior >> >> The "vruntime bonus" (thresh) computed in place_entity() should have a >> per-task definition, which defaults to the current implementation. >> >> A bigger vruntime bonus can be configured for latency sensitive tasks. >> A smaller vruntime bonus can be configured for latency tolerant tasks. > > I'm not sure that adjusting what you called "vruntime bonus" is the > right way to provide some latency because it doesn't only provide a > wakeup latency bonus but also provides a runtime bonus. True, however that's what we already do but _just_ in an hard-coded way. A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o FAIR_SLEEPERS. Point is that not all tasks are the same: for some this bonus can be not really required, for others too small. Regarding the 'runtime bonus' I think it's kind of unavoidable, if we really want a latency sensitive task being scheduled before the others. > It means that one can impact the running time by playing with > latency_nice whereas the goal is only to impact the wakeup latency. Well, but I'm not sure how much you can really gain considering that this bonus is given only at wakeup time: the task should keep suspending himself. It would get a better result by just asking for a lower nice value. Now, asking for a reduced nice value is RLIMIT_NICE and CAP_SYS_NICE protected. The same will be for latency_nice. Moreover, considering that by default tasks will get what we already have as hard-coded or less of a bonus, I don't see how easy should be to abuse. To the contrary we can introduce a very useful knob to allow certain tasks to voluntarily demote themselves and avoid annoying a currently running task. > Instead, it should weight the decision in wakeup_preempt_entity() and > wakeup_gran() In those functions we already take the task prio into consideration (ref details at the end of this message). Lower nice value tasks have more chances to preempt current since they will have a smaller wakeup_gran, indeed: we preempt IFF vdiff(se, current) > wakeup_gran(se) \/ \-----/ A B While task's prio affects B, in this proposal, lantecy_nice works on the A side of the equation above by making it a bit more task specific. That said, it's true that both latency_nice and prio will ultimately play a role on how much CPU bandwidth a task gets. Question is: do we deem it useful to have an additional knob working on the A side of the equation above? Best, Patrick ---8<--8<--8<--8<--8<--8<--8<--8<--8<--8<--- TL;DR: The nice value already affects the wakeup latency As reported above: check_preempt_wakeup(rq, p, wake_flags) wakeup_preempt_entity(curr, se) (d)
[PATCH] ARM: dts: stm32: Correct spi4 pins in stm32mp15-pinctrl.dtsi
Move spi4_pins_a nodes from pinctrl_z to pinctrl as the associated pins are not in BANK Z. Fixes: 498a7014989dfdd9a47864b55704dc829ed0dc90 Signed-off-by: Patrick Delaunay --- arch/arm/boot/dts/stm32mp15-pinctrl.dtsi | 28 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi index 7eb858732d6d..6aedbd7077ff 100644 --- a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi +++ b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi @@ -1574,6 +1574,20 @@ }; }; + spi4_pins_a: spi4-0 { + pins { + pinmux = , /* SPI4_SCK */ +; /* SPI4_MOSI */ + bias-disable; + drive-push-pull; + slew-rate = <1>; + }; + pins2 { + pinmux = ; /* SPI4_MISO */ + bias-disable; + }; + }; + usart2_pins_a: usart2-0 { pins1 { pinmux = , /* USART2_TX */ @@ -1776,18 +1790,4 @@ bias-disable; }; }; - - spi4_pins_a: spi4-0 { - pins { - pinmux = , /* SPI4_SCK */ -; /* SPI4_MOSI */ - bias-disable; - drive-push-pull; - slew-rate = <1>; - }; - pins2 { - pinmux = ; /* SPI4_MISO */ - bias-disable; - }; - }; }; -- 2.17.1
Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key
On Tue, Jun 30, 2020 at 17:40:34 +0200, Qais Yousef wrote... > Hi Patrick > > On 06/30/20 16:55, Patrick Bellasi wrote: >> >> Hi Qais, >> sorry for commenting on v5 with a v6 already posted, but... >> ... I cannot keep up with your re-spinning rate ;) > > I classified that as a nit really and doesn't affect correctness. We have > different subjective view on what is better here. I did all the work in the > past 2 weeks and I think as the author of this patch I have the right to keep > my preference on subjective matters. I did consider your feedback and didn't > ignore it and improved the naming and added a comment to make sure there's no > confusion. > > We could nitpick the best name forever, but is it really that important? Which leans toward confirming the impression I had while reading your previous response, i.e. you stopped reading at the name change observation, which would be _just_ a nit-picking, although still worth IMHO. Instead, I went further and asked you to consider a different approach: not adding a new kernel symbol to represent a concept already there. > I really don't see any added value for one approach or another here to start > a long debate about it. Then you could have just called out that instead of silently ignoring the comment/proposal. > The comments were small enough that I didn't see any controversy that > warrants holding the patches longer. I agreed with your proposal to use > uc_se->active and clarified why your other suggestions don't hold. > > You pointed that uclamp_is_enabled() confused you; and I responded that I'll > change the name. Perhaps it would not confuse only me having 'something_enabled()' referring to 'something_used'. > Sorry for not being explicit about answering the below, but > I thought my answer implied that I don't prefer it. Your answer was about a name change, don't see correlation with a different approach... but should be just me. >> >> Thus, perhaps we can just use the same pattern used by the >> >> sched_numa_balancing static key: >> >> >> >> $ git grep sched_numa_balancing >> >> kernel/sched/core.c:DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); >> >> kernel/sched/core.c: >> >> static_branch_enable(_numa_balancing); >> >> kernel/sched/core.c: >> >> static_branch_disable(_numa_balancing); >> >> kernel/sched/core.c:int state = >> >> static_branch_likely(_numa_balancing); >> >> kernel/sched/fair.c:if >> >> (!static_branch_likely(_numa_balancing)) >> >> kernel/sched/fair.c:if >> >> (!static_branch_likely(_numa_balancing)) >> >> kernel/sched/fair.c:if >> >> (!static_branch_likely(_numa_balancing)) >> >> kernel/sched/fair.c:if >> >> (static_branch_unlikely(_numa_balancing)) >> >> kernel/sched/sched.h:extern struct static_key_false >> >> sched_numa_balancing; >> >> >> >> IOW: unconditionally define sched_uclamp_used as non static in core.c, >> >> and use it directly on schedutil too. >> >> So, what about this instead of adding the (renamed) method above? > > I am sorry there's no written rule that says one should do it in a specific > way. And AFAIK both way are implemented in the kernel. I appreciate your > suggestion but as the person who did all the hard work, I think my preference > matters here too. You sure know that sometime reviewing code can be an "hard work" too, so I would not go down that way at all with the discussion. Quite likely I have a different "subjective" view on how Open Source development works. > And actually with my approach when uclamp is not compiled in there's no need > to > define an extra variable; and since uclamp_is_used() is defined as false for > !CONFIG_UCLAMP_TASK, it'll help with DCE, so less likely to end up with dead > code that'll never run in the final binary. Good, this is the simple and small reply I've politely asked for. Best, Patrick
Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key
Hi Qais, sorry for commenting on v5 with a v6 already posted, but... ... I cannot keep up with your re-spinning rate ;) More importantly, perhaps you missed to comment on one of my previous points. Will have a better look at the rest of v6 later today. Cheers, Patrick On Tue, Jun 30, 2020 at 11:46:24 +0200, Qais Yousef wrote... > On 06/30/20 10:11, Patrick Bellasi wrote: >> On Mon, Jun 29, 2020 at 18:26:33 +0200, Qais Yousef >> wrote... [...] >> > + >> > +static inline bool uclamp_is_enabled(void) >> > +{ >> > + return static_branch_likely(_uclamp_used); >> > +} >> >> Looks like here we mix up terms, which can be confusing. >> AFAIKS, we use: >> - *_enabled for the sched class flags (compile time) >> - *_usedfor the user-space opting in (run time) > > I wanted to add a comment here. > > I can rename it to uclamp_is_used() if you want. In my previous message I was mostly asking about this: >> Thus, perhaps we can just use the same pattern used by the >> sched_numa_balancing static key: >> >> $ git grep sched_numa_balancing >> kernel/sched/core.c:DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); >> kernel/sched/core.c: >> static_branch_enable(_numa_balancing); >> kernel/sched/core.c: >> static_branch_disable(_numa_balancing); >> kernel/sched/core.c:int state = >> static_branch_likely(_numa_balancing); >> kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing)) >> kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing)) >> kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing)) >> kernel/sched/fair.c:if (static_branch_unlikely(_numa_balancing)) >> kernel/sched/sched.h:extern struct static_key_false sched_numa_balancing; >> >> IOW: unconditionally define sched_uclamp_used as non static in core.c, >> and use it directly on schedutil too. So, what about this instead of adding the (renamed) method above?
Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key
return; > [...] > +/** > + * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values. > + * @rq: The rq to clamp against. Must not be NULL. > + * @util:The util value to clamp. > + * @p: The task to clamp against. Can be NULL if you want to > clamp > + * against @rq only. > + * > + * Clamps the passed @util to the max(@rq, @p) effective uclamp values. > + * > + * If sched_uclamp_used static key is disabled, then just return the util > + * without any clamping since uclamp aggregation at the rq level in the fast > + * path is disabled, rendering this operation a NOP. > + * > + * Use uclamp_eff_value() if you don't care about uclamp values at rq level. > It > + * will return the correct effective uclamp value of the task even if the > + * static key is disabled. Well, if you don't care about rq, you don't call a uclamp_rq_* method. I would say that the above paragraph is redundant, moreover it adds some cross-reference to a different method (name) which required maintenance. What about removing it? > + */ > static __always_inline > unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, > struct task_struct *p) > { > - unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value); > - unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value); > + unsigned long min_util; > + unsigned long max_util; > + > + if (!static_branch_likely(_uclamp_used)) > + return util; > + > + min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value); > + max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value); I think moving the initialization is not required, the compiler should be smart enough to place theme where's better. > if (p) { > min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN)); > @@ -2371,6 +2396,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, > unsigned long util, > > return clamp(util, min_util, max_util); > } > + > +static inline bool uclamp_is_enabled(void) > +{ > + return static_branch_likely(_uclamp_used); > +} Looks like here we mix up terms, which can be confusing. AFAIKS, we use: - *_enabled for the sched class flags (compile time) - *_usedfor the user-space opting in (run time) Thus, perhaps we can just use the same pattern used by the sched_numa_balancing static key: $ git grep sched_numa_balancing kernel/sched/core.c:DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); kernel/sched/core.c:static_branch_enable(_numa_balancing); kernel/sched/core.c:static_branch_disable(_numa_balancing); kernel/sched/core.c:int state = static_branch_likely(_numa_balancing); kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing)) kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing)) kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing)) kernel/sched/fair.c:if (static_branch_unlikely(_numa_balancing)) kernel/sched/sched.h:extern struct static_key_false sched_numa_balancing; IOW: unconditionally define sched_uclamp_used as non static in core.c, and use it directly on schedutil too. > #else /* CONFIG_UCLAMP_TASK */ > static inline > unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, > @@ -2378,6 +2408,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, > unsigned long util, > { > return util; > } > + > +static inline bool uclamp_is_enabled(void) > +{ > + return false; > +} > #endif /* CONFIG_UCLAMP_TASK */ > > #ifdef arch_scale_freq_capacity Best, Patrick
Re: [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key
On Thu, Jun 25, 2020 at 17:43:52 +0200, Qais Yousef wrote... [...] > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 235b2cae00a0..e2f1fffa013c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -794,6 +794,25 @@ unsigned int sysctl_sched_uclamp_util_max = > SCHED_CAPACITY_SCALE; > /* All clamps are required to be less or equal than these values */ > static struct uclamp_se uclamp_default[UCLAMP_CNT]; > > +/* > + * This static key is used to reduce the uclamp overhead in the fast path. It > + * only disables the call to uclamp_rq_{inc, dec}() in > enqueue/dequeue_task(). > + * > + * This allows users to continue to enable uclamp in their kernel config with > + * minimum uclamp overhead in the fast path. > + * > + * As soon as userspace modifies any of the uclamp knobs, the static key is > + * enabled, since we have an actual users that make use of uclamp > + * functionality. > + * > + * The knobs that would enable this static key are: > + * > + * * A task modifying its uclamp value with sched_setattr(). > + * * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs. > + * * An admin modifying the cgroup cpu.uclamp.{min, max} > + */ > +static DEFINE_STATIC_KEY_FALSE(sched_uclamp_used); This is me being choosy, but given that: ---8<--- % grep -e '_used[ )]' fair.c core.c fair.c: return static_key_false(&__cfs_bandwidth_used); fair.c: static_key_slow_inc_cpuslocked(&__cfs_bandwidth_used); fair.c: static_key_slow_dec_cpuslocked(&__cfs_bandwidth_used); % grep -e '_enabled[ )]' fair.c core.c fair.c: if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled) fair.c: if (!cfs_rq->runtime_enabled || cfs_rq->nr_running) fair.c: if (!cfs_rq->runtime_enabled || cfs_rq->curr) fair.c: if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0)) fair.c: cfs_rq->runtime_enabled = 0; fair.c: cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF; fair.c: if (!cfs_rq->runtime_enabled) fair.c: cfs_rq->runtime_enabled = 0; core.c: if (static_key_false((_steal_rq_enabled))) { core.c: if (unlikely(!p->sched_class->uclamp_enabled)) core.c: if (unlikely(!p->sched_class->uclamp_enabled)) core.c: * Prevent race between setting of cfs_rq->runtime_enabled and core.c: runtime_enabled = quota != RUNTIME_INF; core.c: runtime_was_enabled = cfs_b->quota != RUNTIME_INF; core.c: if (runtime_enabled && !runtime_was_enabled) core.c: if (runtime_enabled) core.c: cfs_rq->runtime_enabled = runtime_enabled; core.c: if (runtime_was_enabled && !runtime_enabled) ---8<--- even just for consistency shake, I would still prefer sched_uclamp_enabled for the static key name. > + > /* Integer rounded range for each bucket */ > #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, > UCLAMP_BUCKETS) > > @@ -994,9 +1013,16 @@ static inline void uclamp_rq_dec_id(struct rq *rq, > struct task_struct *p, > lockdep_assert_held(>lock); > > bucket = _rq->bucket[uc_se->bucket_id]; > - SCHED_WARN_ON(!bucket->tasks); > - if (likely(bucket->tasks)) > - bucket->tasks--; > + > + /* > + * This could happen if sched_uclamp_used was enabled while the > + * current task was running, hence we could end up with unbalanced call > + * to uclamp_rq_dec_id(). > + */ > + if (unlikely(!bucket->tasks)) > + return; > + > + bucket->tasks--; > uc_se->active = false; In this chunk you are indeed changing the code. Are we sure there are not issues with patterns like: enqueue(taskA) // uclamp gets enabled enqueue(taskB) dequeue(taskA) // bucket->tasks is now 0 dequeue(taskB) TaskB has been enqueued with with uclamp enabled, thus it has got uc_se->active=True and enforced its clamp value at RQ level. But with your change above we don't reset that anymore. As per my previous proposal: why not just removing the SCHED_WARN_ON? That's the only real problem in the code above, since now we are not more granted to have balanced inc/dec. [...] > +bool uclamp_is_enabled(void) > +{ > + return static_branch_likely(_uclamp_used); > +} > + The above and the following changes are not necessary if we just add the guards at the beginning of uclamp_rq_util_with() instead of what you do in PATCH1. I think this will give better overheads removal by avoiding not necessary min/max clamps for both RT and CFS. > diff --git a/kernel/sched/cpufreq_schedutil.c > b/kernel/sched/cpufreq_schedutil.c > index 7fbaee24c824..3f4e296ccb67 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long > util_cfs, > unsigned long dl_util, util, irq; > struct rq *rq = cpu_rq(cpu); > > - if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) && > + if (!uclamp_is_enabled() && > type == FREQUENCY_UTIL && rt_rq_is_runnable(>rt)) { > return max; > } > diff --git
Re: [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq
On Thu, Jun 25, 2020 at 17:43:51 +0200, Qais Yousef wrote... > struct uclamp_rq was zeroed out entirely in assumption that in the first > call to uclamp_rq_inc() they'd be initialized correctly in accordance to > default settings. Perhaps I was not clear in my previous comment: https://lore.kernel.org/lkml/87sgekorfq.derkl...@matbug.net/ when I did say: Does not this means the problem is more likely with uclamp_rq_util_with(), which should be guarded? I did not mean that we have to guard the calls to that function but instead that we should just make that function aware of uclamp being opted in or not. > But when next patch introduces a static key to skip > uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil > will fail to perform any frequency changes because the > rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which > means all rqs are capped to 0 by default. The initialization you wants to do here it's needed because with the current approach you keep calling the same uclamp_rq_util_with() and keep doing min/max aggregations even when uclamp is not opted in. But this means also that we have min/max aggregation _when not really required_. > Fix it by making sure we do proper initialization at init without > relying on uclamp_rq_inc() doing it later. My proposal was as simple as: ---8<--- static __always_inline unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, struct task_struct *p) { unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value); unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value); + if (!static_branch_unlikely(_uclamp_used)) + return rt_task(p) ? uclamp_none(UCLAMP_MAX) : util if (p) { min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN)); max_util = max(max_util, uclamp_eff_value(p, UCLAMP_MAX)); } /* * Since CPU's {min,max}_util clamps are MAX aggregated considering * RUNNABLE tasks with _different_ clamps, we can end up with an * inversion. Fix it now when the clamps are applied. */ if (unlikely(min_util >= max_util)) return min_util; return clamp(util, min_util, max_util); } ---8<--- Such small change is more self-contained IMHO and does not remove an existing optimizations like this lazy RQ's initialization at first usage. Moreover, it can folded in the following patch, with all the other static keys shortcuts.
Re: [PATCH v2 2/2] sched/uclamp: Protect uclamp fast path code with static key
; /* >* We update all RUNNABLE tasks only when task groups are in use. >* Otherwise, keep it simple and do just a lazy update at each next > @@ -1221,6 +1265,9 @@ static void __setscheduler_uclamp(struct task_struct *p, > if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) > return; > > + if (static_branch_unlikely(_uclamp_unused)) > + static_branch_disable(_uclamp_unused); > + > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) { > uclamp_se_set(>uclamp_req[UCLAMP_MIN], > attr->sched_util_min, true); > @@ -7315,6 +7362,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file > *of, char *buf, > if (req.ret) > return req.ret; > > + if (static_branch_unlikely(_uclamp_unused)) > + static_branch_disable(_uclamp_unused); > + > mutex_lock(_mutex); > rcu_read_lock(); Best, Patrick
Re: [PATCH v2 1/2] sched/uclamp: Fix initialization of strut uclamp_rq
Hi Qais, On Fri, Jun 19, 2020 at 19:20:10 +0200, Qais Yousef wrote... > struct uclamp_rq was zeroed out entirely in assumption that in the first > call to uclamp_rq_inc() they'd be initialized correctly in accordance to > default settings. > > But when next patch introduces a static key to skip > uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil > will fail to perform any frequency changes because the > rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which > means all rqs are capped to 0 by default. Does not this means the problem is more likely with uclamp_rq_util_with(), which should be guarded? Otherwise, we will also keep doing useless min/max aggregations each time schedutil calls that function, thus not completely removing uclamp overheads while user-space has not opted in. What about dropping this and add the guard in the following patch, along with the others? > Fix it by making sure we do proper initialization at init without > > Fix it by making sure we do proper initialization at init without > relying on uclamp_rq_inc() doing it later. > > Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") > Signed-off-by: Qais Yousef > Cc: Juri Lelli > Cc: Vincent Guittot > Cc: Dietmar Eggemann > Cc: Steven Rostedt > Cc: Ben Segall > Cc: Mel Gorman > CC: Patrick Bellasi > Cc: Chris Redpath > Cc: Lukasz Luba > Cc: linux-kernel@vger.kernel.org > --- > kernel/sched/core.c | 23 ++- > 1 file changed, 18 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a43c84c27c6f..4265861e13e9 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1248,6 +1248,22 @@ static void uclamp_fork(struct task_struct *p) > } > } > > +static void __init init_uclamp_rq(struct rq *rq) > +{ > + enum uclamp_id clamp_id; > + struct uclamp_rq *uc_rq = rq->uclamp; > + > + for_each_clamp_id(clamp_id) { > + memset(uc_rq[clamp_id].bucket, > +0, > +sizeof(struct uclamp_bucket)*UCLAMP_BUCKETS); > + > + uc_rq[clamp_id].value = uclamp_none(clamp_id); > + } > + > + rq->uclamp_flags = 0; > +} > + > static void __init init_uclamp(void) > { > struct uclamp_se uc_max = {}; > @@ -1256,11 +1272,8 @@ static void __init init_uclamp(void) > > mutex_init(_mutex); > > - for_each_possible_cpu(cpu) { > - memset(_rq(cpu)->uclamp, 0, > - sizeof(struct uclamp_rq)*UCLAMP_CNT); > - cpu_rq(cpu)->uclamp_flags = 0; > - } > + for_each_possible_cpu(cpu) > + init_uclamp_rq(cpu_rq(cpu)); > > for_each_clamp_id(clamp_id) { > uclamp_se_set(_task.uclamp_req[clamp_id],
[SchedulerWakeupLatency] Per-task vruntime wakeup bonus
On Tue, Jun 23, 2020 at 09:29:03 +0200, Patrick Bellasi wrote... > .:: Scheduler Wakeup Path Requirements Collection Template > == > > A) Name Runtime tunable vruntime wakeup bonus. > B) Target behavior All SCHED_OTHER tasks get the same (max) vruntime wakeup bonus. This bonus affects the chance the task has to preempt the currently running task. Some tasks, which are (known to be) latency tolerant, should have a smaller chance to preempt a (known to be) latency sensitive task. To the contrary, latency sensitive tasks should have a higher chance to preempt a currently running latency tolerant task. This task specific distinction is not provided by the current implementation and all SCHED_OTHER tasks are handled according to the same simple, system-wide and not run-time tunable policy. > C) Existing control paths Assuming: C: CFS task currently running on CPUx W: CFS task waking up on the same CPUx And considering the overall simplified workflow: core::try_to_wake_up() // 1) Select on which CPU W will run core::select_task_rq() fair::select_task_rq_fair() // 2) Enqueue W on the selected CPU core::ttwu_queue() core::ttwu_do_activate() core::activate_task() core::enqueue_task() fair::enqueue_task_fair() fair::enqueue_entity() // 3) Set W's vruntime bonus fair::place_entity() se->vruntime = ... // 4) Check if C can be preempted by W core::ttwu_do_wakeup() core::check_preempt_curr() fair::check_preempt_curr() fair::check_preempt_wakeup(curr, se) fair::wakeup_preempt_entity(curr, se) vdiff = curr.vruntime - se.vruntime return vdiff > wakeup_gran(se) We see that W preempts C iff: vdiff > wakeup_gran(se) Since: enqueue_entity(cfs_rq, se, flags) place_entity(cfs_rq, se, initial=0) thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1) vruntime = cfs_rq->min_vruntime - thresh se->vruntime = max_vruntime(se->vruntime, vruntime) a waking task's W.vruntime can get a "vruntime bonus" up to: - 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS) - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS) > D) Desired behavior The "vruntime bonus" (thresh) computed in place_entity() should have a per-task definition, which defaults to the current implementation. A bigger vruntime bonus can be configured for latency sensitive tasks. A smaller vruntime bonus can be configured for latency tolerant tasks. TL;DR The "vruntime bonus" is meant to give sleepers a compensation for the service deficit due to them not having (possibly) fully consumed their assigned fair CPU quota within the current sched_latency interval, see: commit 51e0304ce6e5 ("sched: Implement a gentler fair-sleepers feature") The scheduler does that based on a conservative assumption: when a task sleeps it gives up a portion (P) of its fair CPU bandwidth share in the current sched_latency period. Willing to be FAIR, i.e. each task gets a FAIR quota of the CPU in each sched_latency period, the scheduler wants to give back P to waking tasks. However, striving to minimize overheads and complexity, the CFS scheduler does that using a simple heuristic: each task waking up gets a bonus, which is capped at one sched_latency period, independently from "its nature". What the scheduler completely disregards is that being completely FAIR is not always necessary. Depending on the nature of a task, not all tasks require a bonus. To the contrary: - a GENTLE_FAIR_SLEEPERS bonus given to a background task could result in preempting a latency sensitive currently running task - giving only 1/2 scheduler latency bonus to a latency sensitive task could result in that task being preempted before it completes its current activation. > E) Existing knobs The SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true) defined vruntime bonus value can be considered the current mainline default value. This means that "all" CFS tasks waking up will get a 0.5 * sysctl_sched_latency vruntime bonus wrt the cfs_rq->min_vruntime. > F) Existing knobs limitations GENTLE_FAIR_SLEEPERS is a system-wide knob and it's not run-time tunable on production systems (being a SCHED_DEBUG feature). Thus, the sched_feature should be removed and replaced by a per-task knob. > G) Proportionality Analysis The value of the vruntime bonus directly affects the chance a task has to preempt the currently running task. Indeed, from the code analysis in C: thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1) is the "wakeup bonus", which is used as: vruntime = cfs_rq->min_vruntime - thresh se->vruntime = max_vruntime(se->vruntime, vruntime) vdiff = curr.vrunti
Scheduler wakeup path tuning surface: Use-Cases and Requirements
Since last year's OSPM Summit we started conceiving the idea that task wakeup path could be better tuned for certain classes of workloads and usage scenarios. Various people showed interest for a possible tuning interface for the scheduler wakeup path. .:: The Problem === The discussions we had so far [1] have not been effective in clearly identifying if a common tuning surface is possible. The last discussion at this year's OSPM Summit [2,3] was also kind of inconclusive and left us with the message: start by collecting the requirements and then see what interface fits them the best. General consensus is that a unified interface can be challenging and maybe not feasible. However, generalisation is also a value and we should strive for it whenever it's possible. Someone might think that we did not put enough effort in the analysis of requirements. Perhaps what we missed so far is also a structured and organised way to collect requirements which also can help in factoring out the things they have in common. .:: The Proposal This thread aims at providing a guided template for the description of different task wakeup use-cases. It does that by setting a series of questions aimed at precisely describing what's "currently broken", what we would like to have instead and how we could achieve it. What we propose here is that, for each wakeup use-case, we start by replying to this email to provide the required details/comments for a predefined list of questions. This will generate independent discussion threads. Each thread will be free to focus on a specific proposal but still all the thread will be reasoning around a common set of fundamental concepts. The hope is that, by representing all the use-cases as sufficiently detailed responses to a common set of questions, once the discussion settles down, we can more easily verify if there are common things surfacing which then can be naturally wrapped into a unified user-space API. A first use-case description, following the template guidelines, will be posted shortly after this message. This also will provide an example for how to use the template. NOTE: Whenever required, pseudo-code or simplified C can be used. I hope this can drive a fruitful discussion in preparation for LPC! Best, Patrick ---8<--- For templates submissions: reply only to the following ---8<--- .:: Scheduler Wakeup Path Requirements Collection Template == A) Name: unique one-liner name for the proposed use-case B) Target behaviour: one paragraph to describe the wakeup path issue C) Existing control paths: reference to code paths to justify B) Assuming v5.6 as the reference kernel, this section should provide links to code paths such as, e.g. fair.c:3917 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/fair.c?h=v5.6#n3917 Alternatively code snippets can be added inline, e.g. /* * The 'current' period is already promised to the current tasks, * however the extra weight of the new task will slow them down a * little, place the new task so that it fits in the slot that * stays open at the end. */ if (initial && sched_feat(START_DEBIT)) vruntime += sched_vslice(cfs_rq, se); NOTE: if the use-case exists only outside the mainline Linux kernel this section can stay empty D) Desired behaviour: one paragraph to describe the desired update NOTE: the current mainline expression is assumed to be correct for existing use-cases. Thus, here we are looking for run-time tuning of those existing features. E) Existing knobs (if any): reference to whatever existing tunable Some features can already be tuned, but perhaps only via compile time knobs, SCHED_FEATs or system wide tunable. If that's the case, we should document them here and explain how they currently work and what are (if any) the implicit assumptions, e.g. what is the currently implemented scheduling policy/heuristic. F) Existing knobs (if any): one paragraph description of the limitations If the existing knobs are not up to the job for this use-case, shortly explain here why. It could be because a tuning surface is already there but it's hardcoded (e.g. compile time knob) or too coarse grained (e.g. a SCHED_FEAT). G) Proportionality Analysis: check the nature of the target behavior Goal here is to verify and discuss if the behaviour (B) has a proportional nature: different values of the control knobs (E) are expected to produce different effects for (B). Special care should be taken to check if the target behaviour has an intrinsically "binary nature", i.e. only two values make really sense. In this case it would be very useful to argument why a generalisation towards a non-binary behavio
[PATCH] ARM: dts: stm32: cosmetic update in stm32mp15-pinctrl.dtsi
Use tabs where possible and remove multiple blanks lines. Signed-off-by: Patrick Delaunay --- arch/arm/boot/dts/stm32mp15-pinctrl.dtsi | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi index 7eb858732d6d..7d351757f2f8 100644 --- a/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi +++ b/arch/arm/boot/dts/stm32mp15-pinctrl.dtsi @@ -210,8 +210,8 @@ , /* ETH_RGMII_TXD3 */ , /* ETH_RGMII_TX_CTL */ , /* ETH_MDC */ -, /* ETH_MDIO */ -, /* ETH_RGMII_RXD0 */ +, /* ETH_MDIO */ +, /* ETH_RGMII_RXD0 */ , /* ETH_RGMII_RXD1 */ , /* ETH_RGMII_RXD2 */ , /* ETH_RGMII_RXD3 */ @@ -453,7 +453,7 @@ i2c5_pins_b: i2c5-1 { pins { pinmux = , /* I2C5_SCL */ -; /* I2C5_SDA */ +; /* I2C5_SDA */ bias-disable; drive-open-drain; slew-rate = <0>; @@ -463,7 +463,7 @@ i2c5_sleep_pins_b: i2c5-sleep-1 { pins { pinmux = , /* I2C5_SCL */ -; /* I2C5_SDA */ +; /* I2C5_SDA */ }; }; @@ -1072,7 +1072,6 @@ }; }; - sai2a_pins_b: sai2a-1 { pins1 { pinmux = , /* SAI2_SD_A */ -- 2.17.1
Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
On Fri, Jun 05, 2020 at 13:32:04 +0200, Qais Yousef wrote... > On 06/05/20 09:55, Patrick Bellasi wrote: >> On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef >> wrote... [...] >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> > index 0464569f26a7..9f48090eb926 100644 >> > --- a/kernel/sched/core.c >> > +++ b/kernel/sched/core.c >> > @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, >> > struct task_struct *p, >> > * e.g. due to future modification, warn and fixup the expected >> > value. >> > */ >> > SCHED_WARN_ON(bucket->value > rq_clamp); >> > +#if 0 >> > if (bucket->value >= rq_clamp) { >> > bkt_clamp = uclamp_rq_max_value(rq, clamp_id, >> > uc_se->value); >> > WRITE_ONCE(uc_rq->value, bkt_clamp); >> > } >> > +#endif >> >> Yep, that's likely where we have most of the overhead at dequeue time, >> sine _sometimes_ we need to update the cpu's clamp value. >> >> However, while running perf sched pipe, I expect: >> - all tasks to have the same clamp value >> - all CPUs to have _always_ at least one RUNNABLE task >> >> Given these two conditions above, if the CPU is never "CFS idle" (i.e. >> without RUNNABLE CFS tasks), the code above should never be triggered. >> More on that later... > > So the cost is only incurred by idle cpus is what you're saying. Not really, you pay the cost every time you need to reduce the CPU clamp value. This can happen also on a busy CPU but only when you dequeue the last task defining the current uclamp(cpu) value and the remaining RUNNABLE tasks have a lower value. >> > } >> > >> > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) >> > >> > >> > >> > uclamp_rq_max_value() could be expensive as it loops over all buckets. >> >> It loops over UCLAMP_CNT values which are defined to fit into a single > > I think you meant to say UCLAMP_BUCKETS which is defined 5 by default. Right, UCLAMP_BUCKETS. >> $L. That was the optimal space/time complexity compromise we found to >> get the MAX of a set of values. > > It actually covers two cachelines, see below and my other email to > Mel. The two cache lines are covered if you consider both min and max clamps. One single CLAMP_ID has a _size_ which fits into a single cache line. However, to be precise: - while uclamp_min spans a single cache line, uclamp_max is likely across two - at enqueue/dequeue time we update both min/max, thus we can touch both cache lines >> > Commenting this whole path out strangely doesn't just 'fix' it, >> > but produces better results to no-uclamp kernel :-/ >> > >> > # ./perf bench -r 20 sched pipe -T -l 5 >> > Without uclamp:5039 >> > With uclamp: 4832 >> > With uclamp+patch: 5729 >> >> I explain it below: with that code removed you never decrease the CPU's >> uclamp value. Thus, the first time you schedule an RT task you go to MAX >> OPP and stay there forever. > > Okay. > >> >> > It might be because schedutil gets biased differently by uclamp..? If I >> > move to >> > performance governor these numbers almost double. >> > >> > I don't know. But this promoted me to look closer and >> >> Just to resume, when a task is dequeued we can have only these cases: >> >> - uclamp(task) < uclamp(cpu): >> this happens when the task was co-scheduled with other tasks with >> higher clamp values which are still RUNNABLE. >> In this case there are no uclamp(cpu) updates. >> >> - uclamp(task) == uclamp(cpu): >> this happens when the task was one of the tasks defining the current >> uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE >> tasks clamp values. >> >> In this last case we _not_ always need to do a uclamp(cpu) update. >> Indeed the update is required _only_ when that task was _the last_ task >> defining the current uclamp(cpu) value. >> >> In this case we use uclamp_rq_max_value() to do a linear scan of >> UCLAMP_CNT values which fits into a single cache line. > > Again, I think you mean UCLAMP_BUCKETS here. Unless I missed something, they > span 2 cahcelines on 64bit machines and 64b cacheline size. Correct: - s/UCLAMP_CNT/UCLAMP_BUCLKETS/ - 1 cacheline per CLAMP_ID - the array scan works on 1 CLAMP_ID: - spanning 1 cache line for uclamp_min - spannin
Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
Hi Qais, On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef wrote... > On 06/03/20 16:59, Vincent Guittot wrote: >> When I want to stress the fast path i usually use "perf bench sched pipe -T " >> The tip/sched/core on my arm octo core gives the following results for >> 20 iterations of perf bench sched pipe -T -l 5 >> >> all uclamp config disabled 50035.4(+/- 0.334%) >> all uclamp config enabled 48749.8(+/- 0.339%) -2.64% I use to run the same test but I don't remember such big numbers :/ >> It's quite easy to reproduce and probably easier to study the impact > > Thanks Vincent. This is very useful! > > I could reproduce that on my Juno. > > One of the codepath I was suspecting seems to affect it. > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 0464569f26a7..9f48090eb926 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, > struct task_struct *p, > * e.g. due to future modification, warn and fixup the expected value. > */ > SCHED_WARN_ON(bucket->value > rq_clamp); > +#if 0 > if (bucket->value >= rq_clamp) { > bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value); > WRITE_ONCE(uc_rq->value, bkt_clamp); > } > +#endif Yep, that's likely where we have most of the overhead at dequeue time, sine _sometimes_ we need to update the cpu's clamp value. However, while running perf sched pipe, I expect: - all tasks to have the same clamp value - all CPUs to have _always_ at least one RUNNABLE task Given these two conditions above, if the CPU is never "CFS idle" (i.e. without RUNNABLE CFS tasks), the code above should never be triggered. More on that later... > } > > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) > > > > uclamp_rq_max_value() could be expensive as it loops over all buckets. It loops over UCLAMP_CNT values which are defined to fit into a single $L. That was the optimal space/time complexity compromise we found to get the MAX of a set of values. > Commenting this whole path out strangely doesn't just 'fix' it, > but produces better results to no-uclamp kernel :-/ > > # ./perf bench -r 20 sched pipe -T -l 5 > Without uclamp: 5039 > With uclamp: 4832 > With uclamp+patch:5729 I explain it below: with that code removed you never decrease the CPU's uclamp value. Thus, the first time you schedule an RT task you go to MAX OPP and stay there forever. > It might be because schedutil gets biased differently by uclamp..? If I move > to > performance governor these numbers almost double. > > I don't know. But this promoted me to look closer and Just to resume, when a task is dequeued we can have only these cases: - uclamp(task) < uclamp(cpu): this happens when the task was co-scheduled with other tasks with higher clamp values which are still RUNNABLE. In this case there are no uclamp(cpu) updates. - uclamp(task) == uclamp(cpu): this happens when the task was one of the tasks defining the current uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE tasks clamp values. In this last case we _not_ always need to do a uclamp(cpu) update. Indeed the update is required _only_ when that task was _the last_ task defining the current uclamp(cpu) value. In this case we use uclamp_rq_max_value() to do a linear scan of UCLAMP_CNT values which fits into a single cache line. > I think I spotted a bug where in the if condition we check for '>=' > instead of '>', causing us to take the supposedly impossible fail safe > path. The fail safe path is when the '>' condition matches, which is what the SCHED_WARN_ON tell us. Indeed, we never expect uclamp(cpu) to be bigger than one of its RUNNABLE tasks. If that should happen we WARN and fix the cpu clamp value for the best. The normal path is instead '=' and, according to by previous resume, it's expected to be executed _only_ when we dequeue the last task of the clamp group defining the current uclamp(cpu) value. > Mind trying with the below patch please? > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 0464569f26a7..50d66d4016ff 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, > struct task_struct *p, > * e.g. due to future modification, warn and fixup the expected value. > */ > SCHED_WARN_ON(bucket->value > rq_clamp); > - if (bucket->value >= rq_clamp) { > + if (bucket->value > rq_clamp) { > bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value); > WRITE_ONCE(uc_rq->value, bkt_clamp); > } This patch is thus bogus, since we never expect to have uclamp(cpu) bigger than uclamp(task) and thus we will never reset the clamp value of a cpu.
Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
Hi Dietmar, thanks for sharing these numbers. On Tue, Jun 02, 2020 at 18:46:00 +0200, Dietmar Eggemann wrote... [...] > I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2 > (2 sockets * 10 cores * 2 threads) with powersave governor as: > > $ numactl -N 0 ./run-mmtests.sh XXX Great setup, it's worth to rule out all possible noise source (freq scaling, thermal throttling, NUMA scheduler, etc.). Wondering if disabling HT can also help here in reducing results "noise"? > w/ config-network-netperf-unbound. > > Running w/o 'numactl -N 0' gives slightly worse results. > > without-clamp : CONFIG_UCLAMP_TASK is not set > with-clamp : CONFIG_UCLAMP_TASK=y, > CONFIG_UCLAMP_TASK_GROUP is not set > with-clamp-tskgrp : CONFIG_UCLAMP_TASK=y, > CONFIG_UCLAMP_TASK_GROUP=y > > > netperf-udp > ./5.7.0-rc7./5.7.0-rc7 > ./5.7.0-rc7 > without-clamp with-clamp > with-clamp-tskgrp Can you please specify how to read the following scores? I give it a run to my local netperf and it reports Throughput, thous I would expect the higher the better... but... this seems something different. > Hmean send-64 153.62 ( 0.00%) 151.80 * -1.19%* > 155.60 * 1.28%* > Hmean send-128306.77 ( 0.00%) 306.27 * -0.16%* > 309.39 * 0.85%* > Hmean send-256608.54 ( 0.00%) 604.28 * -0.70%* > 613.42 * 0.80%* > Hmean send-1024 2395.80 ( 0.00%) 2365.67 * -1.26%* > 2409.50 * 0.57%* > Hmean send-2048 4608.70 ( 0.00%) 4544.02 * -1.40%* > 4665.96 * 1.24%* > Hmean send-3312 7223.97 ( 0.00%) 7158.88 * -0.90%* > 7331.23 * 1.48%* > Hmean send-4096 8729.53 ( 0.00%) 8598.78 * -1.50%* > 8860.47 * 1.50%* > Hmean send-8192 14961.77 ( 0.00%)14418.92 * -3.63%* > 14908.36 * -0.36%* > Hmean send-1638425799.50 ( 0.00%)25025.64 * -3.00%* > 25831.20 * 0.12%* If I read it as the lower the score the better, all the above results tell us that with-clamp is even better, while with-clamp-tskgrp is not that much worst. The other way around (the higher the score the better) would look odd since we definitively add in more code and complexity when uclamp has the TG support enabled we would not expect better scores. > Hmean recv-64 153.62 ( 0.00%) 151.80 * -1.19%* > 155.60 * 1.28%* > Hmean recv-128306.77 ( 0.00%) 306.27 * -0.16%* > 309.39 * 0.85%* > Hmean recv-256608.54 ( 0.00%) 604.28 * -0.70%* > 613.42 * 0.80%* > Hmean recv-1024 2395.80 ( 0.00%) 2365.67 * -1.26%* > 2409.50 * 0.57%* > Hmean recv-2048 4608.70 ( 0.00%) 4544.02 * -1.40%* > 4665.95 * 1.24%* > Hmean recv-3312 7223.97 ( 0.00%) 7158.88 * -0.90%* > 7331.23 * 1.48%* > Hmean recv-4096 8729.53 ( 0.00%) 8598.78 * -1.50%* > 8860.47 * 1.50%* > Hmean recv-8192 14961.61 ( 0.00%)14418.88 * -3.63%* > 14908.30 * -0.36%* > Hmean recv-1638425799.39 ( 0.00%)25025.49 * -3.00%* > 25831.00 * 0.12%* > > netperf-tcp > > Hmean 64 818.65 ( 0.00%) 812.98 * -0.69%* > 826.17 * 0.92%* > Hmean 1281569.55 ( 0.00%) 1555.79 * -0.88%* > 1586.94 * 1.11%* > Hmean 2562952.86 ( 0.00%) 2915.07 * -1.28%* > 2968.15 * 0.52%* > Hmean 1024 10425.91 ( 0.00%)10296.68 * -1.24%* > 10418.38 * -0.07%* > Hmean 2048 17454.51 ( 0.00%)17369.57 * -0.49%* > 17419.24 * -0.20%* > Hmean 3312 22509.95 ( 0.00%)9.69 * -1.25%* > 22373.32 * -0.61%* > Hmean 4096 25033.23 ( 0.00%)24859.59 * -0.69%* > 24912.50 * -0.48%* > Hmean 8192 32080.51 ( 0.00%)31744.51 * -1.05%* > 31800.45 * -0.87%* > Hmean 16384 36531.86 ( 0.00%)37064.68 * 1.46%* > 37397.71 * 2.37%* > > The diffs are smaller than on openSUSE Leap 15.1 and some of the > uclamp taskgroup results are better? > > With this test setup we now can play with the uclamp code in > enqueue_task() and dequeue_task(). > > --- > > W/ config-network-netperf-unbound (only netperf-udp and buffer size 64): > > $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp/perf.data > | grep activate_task > > # Event 'cycles:ppp' > # > # Baseline Delta Abs Shared ObjectSymbol > > 0.02% +0.54% [kernel.vmlinux] [k] activate_task > 0.02% +0.38% [kernel.vmlinux] [k] deactivate_task > > $ perf diff 5.7.0-rc7_without-clamp/perf.data > 5.7.0-rc7_with-clamp-tskgrp/perf.data | grep activate_task > > 0.02% +0.35% [kernel.vmlinux] [k]
Re: [PATCH v2] hwmon:(adm1275) Enable adm1278 ADM1278_TEMP1_EN
Hi Guenter, Thanks for the initial look at this. One question for you below... On Fri, May 29, 2020 at 10:30:16AM -0700, Guenter Roeck wrote: > On 5/29/20 5:46 AM, Manikandan Elumalai wrote: > > + /* Enable TEMP1 by default */ > > + config |= ADM1278_TEMP1_EN; > > + ret = i2c_smbus_write_byte_data(client, > > + ADM1275_PMON_CONFIG, > > + config); > > + if (ret < 0) { > > + dev_err(>dev, > > + "Failed to enable temperature config\n"); > > + return -ENODEV; > > + } > > This can be handled in a single operation, together with ADM1278_VOUT_EN > below. There is no need for two separate write operations. I don't know if you noticed here but the change ends up enabling TEMP1_EN in all cases. Is this acceptable? If not, do you have any preference on how it is selected for enablement? > > /* Enable VOUT if not enabled (it is disabled by default) */ > > if (!(config & ADM1278_VOUT_EN)) { > > @@ -681,9 +697,6 @@ static int adm1275_probe(struct i2c_client *client, > > } > > } > > > > - if (config & ADM1278_TEMP1_EN) > > - info->func[0] |= > > - PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP; > > if (config & ADM1278_VIN_EN) > > info->func[0] |= PMBUS_HAVE_VIN; > > break; > > > -- Patrick Williams signature.asc Description: PGP signature
Re: [PATCH] hwmon:(adm1275) Enable adm1278 ADM1278_TEMP1_EN
On Thu, May 28, 2020 at 07:45:23PM +0530, Manikandan Elumalai wrote: Hi Manikandan, Adding the PMBus maintainers... > > The adm1278 temperature sysfs attribute need it for one of the our openbmc > platform . > This functionality is not enabled by default, so PMON_CONFIG needs to be > modified in order to enable it. Vijay already mentioned the Signed-off-by here. Since this is a kernel patch and your first time contributing one, please read through: https://www.kernel.org/doc/html/latest/process/1.Intro.html and the MAINTAINERS file. Another thing you've missed is using the get_maintainer.pl script to find out who you're suppose to CC. It is fine to have additional CCs but we're missing the pmbus maintainer on this patch. > > --- > drivers/hwmon/pmbus/adm1275.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/drivers/hwmon/pmbus/adm1275.c b/drivers/hwmon/pmbus/adm1275.c > index 5caa37fb..47b293d 100644 > --- a/drivers/hwmon/pmbus/adm1275.c > +++ b/drivers/hwmon/pmbus/adm1275.c > @@ -681,6 +681,21 @@ static int adm1275_probe(struct i2c_client *client, > } > } > > + config = i2c_smbus_read_byte_data(client, ADM1275_PMON_CONFIG); > + if (config < 0) > + return config; > + > + /* Enable TEMP1 by defult */ > + config |= ADM1278_TEMP1_EN; > + ret = i2c_smbus_write_byte_data(client, > + ADM1275_PMON_CONFIG, > + config); > + if (ret < 0) { > + dev_err(>dev, > + "Failed to enable temperature config\n"); > + return -ENODEV; > + } > + This code might work for your design, but likely doesn't work for everyone and isn't likely to be accepted in its current state. I think you need some kind of detection logic here to know if TEMP1_EN *should* be enabled. Do we need a device-tree entry for this? > if (config & ADM1278_TEMP1_EN) > info->func[0] |= > PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP; > -- > 2.7.4 > -- Patrick Williams signature.asc Description: PGP signature
Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
[+Giovanni] On Thu, May 28, 2020 at 20:29:14 +0200, Peter Zijlstra wrote... > On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote: >> I had a humble try to catch the overhead but wasn't successful. The >> observation >> wasn't missed by us too then. > > Right, I remember us doing benchmarks when we introduced all this and > clearly we missed something. I would be good if Mel can share which > benchmark hurt most so we can go have a look. Indeed, would be great to have a description of their test setup and results. Perhaps Giovanni can also support us on that.
Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
I Qais, I see we are converging toward the final shape. :) Function wise code looks ok to me now. Lemme just point out few more remarks and possible nit-picks. I guess at the end it's up to you to decide if you wanna follow up with a v6 and to the maintainers to decide how picky they wanna be. Otherwise, FWIW, feel free to consider this a LGTM. Best, Patrick On Mon, May 11, 2020 at 17:40:52 +0200, Qais Yousef wrote... [...] > +static inline void uclamp_sync_util_min_rt_default(struct task_struct *p, > +enum uclamp_id clamp_id) > +{ > + unsigned int default_util_min = sysctl_sched_uclamp_util_min_rt_default; > + struct uclamp_se *uc_se; > + > + /* Only sync for UCLAMP_MIN and RT tasks */ > + if (clamp_id != UCLAMP_MIN || !rt_task(p)) > + return; > + > + uc_se = >uclamp_req[UCLAMP_MIN]; I went back to v3 version, where this was done above: https://lore.kernel.org/lkml/20200429113255.ga19...@codeaurora.org/ Message-ID: 20200429113255.ga19...@codeaurora.org and still I don't see why we want to keep it after this first check? IMO it's just not required and it makes to code a tiny uglier. > + > + /* > + * Only sync if user didn't override the default request and the sysctl > + * knob has changed. > + */ > + if (uc_se->user_defined || uc_se->value == default_util_min) > + return; > + nit-pick: the two comments above are stating the obvious. > + uclamp_se_set(uc_se, default_util_min, false); > +} > + > static inline struct uclamp_se > uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id) > { > @@ -907,8 +949,13 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id > clamp_id) > static inline struct uclamp_se > uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id) > { > - struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id); > - struct uclamp_se uc_max = uclamp_default[clamp_id]; > + struct uclamp_se uc_req, uc_max; > + > + /* Sync up any change to sysctl_sched_uclamp_util_min_rt_default. */ same here: the comment is stating the obvious. Maybe even just by using a different function name we better document the code, e.g. uclamp_rt_restrict(p, clamp_id); This will implicitly convey the purpose: RT tasks can be somehow further restricted, i.e. in addition to the TG restriction following. > + uclamp_sync_util_min_rt_default(p, clamp_id); > + > + uc_req = uclamp_tg_restrict(p, clamp_id); > + uc_max = uclamp_default[clamp_id]; > > /* System default restrictions always apply */ > if (unlikely(uc_req.value > uc_max.value)) [...]
Re: file system permissions regression affecting root
On Wed, May 13, 2020 at 9:11 AM Al Viro wrote: > > On Wed, May 13, 2020 at 08:00:28AM -0700, Patrick Donnelly wrote: > > In newer kernels (at least 5.6), it appears root is not able to write > > to files owned by other users in a sticky directory: > > Yes. Controlled by /proc/sys/fs/protected_regular, which systemd crowd > has decided to enable in commit 2732587540035227fe59e4b64b60127352611b35 > [...] Thanks for the information Al! However, it seems odd that this depends on the owner of the directory. i.e. this protection only seems to be enforced if the sticky directory is owned by root. That's expected? -- Patrick Donnelly
file system permissions regression affecting root
> gtmp/foo"], 0x7fff588732f0 /* 15 vars */) = 0 openat(AT_FDCWD, "gtmp/foo", O_WRONLY|O_CREAT|O_TRUNC, 0666) = -1 EACCES (Permission denied) write(2, "/bin/sh: gtmp/foo: Permission de"..., 37/bin/sh: gtmp/foo: Permission denied -- Patrick Donnelly
Re: [PATCH v4 2/2] Documentation/sysctl: Document uclamp sysctl knobs
Hi Qais, On Tue, May 05, 2020 at 16:56:37 +0200, Qais Yousef wrote... >> > +sched_util_clamp_min_rt_default: >> > + >> > + >> > +By default Linux is tuned for performance. Which means that RT tasks >> > always run >> > +at the highest frequency and most capable (highest capacity) CPU (in >> > +heterogeneous systems). >> > + >> > +Uclamp achieves this by setting the requested uclamp.min of all RT tasks >> > to >> > +SCHED_CAPACITY_SCALE (1024) by default, which effectively boosts the >> > tasks to >> > +run at the highest frequency and biases them to run on the biggest CPU. >> > + >> > +This knob allows admins to change the default behavior when uclamp is >> > being >> > +used. In battery powered devices particularly, running at the maximum >> > +capacity and frequency will increase energy consumption and shorten the >> > battery >> > +life. >> > + >> > +This knob is only effective for RT tasks which the user hasn't modified >> > their >> > +requested uclamp.min value via sched_setattr() syscall. >> > + >> > +This knob will not escape the constraint imposed by sched_util_clamp_min >> > +defined above. >> >> Perhaps it's worth to specify that this value is going to be clamped by >> the values above? Otherwise it's a bit ambiguous to know what happen >> when it's bigger than schedu_util_clamp_min. > > Hmm for me that sentence says exactly what you're asking for. > > So what you want is > > s/will not escape the constraint imposed by/will be clamped by/ > > ? > > I'm not sure if this will help if the above is already ambiguous. Maybe if > I explicitly say > > ..will not escape the *range* constrained imposed by.. > > sched_util_clamp_min is already defined as a range constraint, so hopefully it > should hit the mark better now? Right, that also can work. >> >> > +Any modification is applied lazily on the next opportunity the scheduler >> > needs >> > +to calculate the effective value of uclamp.min of the task. >> ^ >> >> This is also an implementation detail, I would remove it. > > The idea is that this value is not updated 'immediately'/synchronously. So > currently RUNNING tasks will not see the effect, which could generate > confusion > when users trip over it. IMO giving an idea of how it's updated will help with > expectation of the users. I doubt any will care, but I think it's an important > behavior element that is worth conveying and documenting. I'd be happy to > reword it if necessary. Right, I agree on giving an hint on the lazy update. What I was pointing out was mainly the reference to the 'effective' value. Maybe we can just drop that word. > I have this now > > """ > 984 This knob will not escape the range constraint imposed by > sched_util_clamp_min > 985 defined above. > 986 > 987 For example if > 988 > 989 sched_util_clamp_min_rt_default = 800 > 990 sched_util_clamp_min = 600 > 991 > 992 Then the boost will be clamped to 600 because 800 is outside of the > permissible > 993 range of [0:600]. This could happen for instance if a powersave mode will > 994 restrict all boosts temporarily by modifying sched_util_clamp_min. As > soon as > 995 this restriction is lifted, the requested sched_util_clamp_min_rt_default > 996 will take effect. > 997 > 998 Any modification is applied lazily to currently running tasks and should > be > 999 visible by the next wakeup. > """ That's better IMHO, would just slightly change the last sentence to: Any modification is applied lazily to tasks and is effective starting from their next wakeup. Best, Patrick
Re: [PATCH v4 2/2] Documentation/sysctl: Document uclamp sysctl knobs
Hi Qais, On Fri, May 01, 2020 at 13:49:27 +0200, Qais Yousef wrote... [...] > diff --git a/Documentation/admin-guide/sysctl/kernel.rst > b/Documentation/admin-guide/sysctl/kernel.rst > index 0d427fd10941..521c18ce3d92 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -940,6 +940,54 @@ Enables/disables scheduler statistics. Enabling this > feature > incurs a small amount of overhead in the scheduler but is > useful for debugging and performance tuning. > > +sched_util_clamp_min: > += > + > +Max allowed *minimum* utilization. > + > +Default value is SCHED_CAPACITY_SCALE (1024), which is the maximum possible ^^^ Mmm... I feel one of the two is an implementation detail which should probably not be exposed? The user perhaps needs to know the value (1024) but we don't need to expose the internal representation. > +value. > + > +It means that any requested uclamp.min value cannot be greater than > +sched_util_clamp_min, i.e., it is restricted to the range > +[0:sched_util_clamp_min]. > + > +sched_util_clamp_max: > += > + > +Max allowed *maximum* utilization. > + > +Default value is SCHED_CAPACITY_SCALE (1024), which is the maximum possible > +value. > + > +It means that any requested uclamp.max value cannot be greater than > +sched_util_clamp_max, i.e., it is restricted to the range > +[0:sched_util_clamp_max]. > + > +sched_util_clamp_min_rt_default: > + > + > +By default Linux is tuned for performance. Which means that RT tasks always > run > +at the highest frequency and most capable (highest capacity) CPU (in > +heterogeneous systems). > + > +Uclamp achieves this by setting the requested uclamp.min of all RT tasks to > +SCHED_CAPACITY_SCALE (1024) by default, which effectively boosts the tasks to > +run at the highest frequency and biases them to run on the biggest CPU. > + > +This knob allows admins to change the default behavior when uclamp is being > +used. In battery powered devices particularly, running at the maximum > +capacity and frequency will increase energy consumption and shorten the > battery > +life. > + > +This knob is only effective for RT tasks which the user hasn't modified their > +requested uclamp.min value via sched_setattr() syscall. > + > +This knob will not escape the constraint imposed by sched_util_clamp_min > +defined above. Perhaps it's worth to specify that this value is going to be clamped by the values above? Otherwise it's a bit ambiguous to know what happen when it's bigger than schedu_util_clamp_min. > +Any modification is applied lazily on the next opportunity the scheduler > needs > +to calculate the effective value of uclamp.min of the task. ^ This is also an implementation detail, I would remove it. > > seccomp > === Best, Patrick
Re: [PATCH v4 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
> /* System default restrictions always apply */ > if (unlikely(uc_req.value > uc_max.value)) > @@ -1114,12 +1163,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table > *table, int write, > loff_t *ppos) > { > bool update_root_tg = false; > - int old_min, old_max; > + int old_min, old_max, old_min_rt; > int result; > > mutex_lock(_mutex); > old_min = sysctl_sched_uclamp_util_min; > old_max = sysctl_sched_uclamp_util_max; > + old_min_rt = sysctl_sched_uclamp_util_min_rt_default; > > result = proc_dointvec(table, write, buffer, lenp, ppos); > if (result) > @@ -1133,6 +1183,18 @@ int sysctl_sched_uclamp_handler(struct ctl_table > *table, int write, > goto undo; > } > > + /* > + * The new value will be applied to RT tasks the next time the > + * scheduler needs to calculate the effective uclamp.min for that task, > + * assuming the task is using the system default and not a user > + * specified value. In the latter we shall leave the value as the user > + * requested. IMO it does not make sense to explain here what you will do with this value. This will make even more complicated to maintain the comment above if the code using it should change in the future. So, if the code where we use the knob is not clear enough, maybe we can move this comment to the description of: uclamp_sync_util_min_rt_default() or to be part of the documentation of: sysctl_sched_uclamp_util_min_rt_default By doing that you can also just add this if condition with the previous ones. > + */ > + if (sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) { > + result = -EINVAL; > + goto undo; > + } > + > if (old_min != sysctl_sched_uclamp_util_min) { > uclamp_se_set(_default[UCLAMP_MIN], > sysctl_sched_uclamp_util_min, false); > @@ -1158,6 +1220,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table > *table, int write, > undo: > sysctl_sched_uclamp_util_min = old_min; > sysctl_sched_uclamp_util_max = old_max; > + sysctl_sched_uclamp_util_min_rt_default = old_min_rt; > done: > mutex_unlock(_mutex); > > @@ -1200,9 +1263,13 @@ static void __setscheduler_uclamp(struct task_struct > *p, > if (uc_se->user_defined) > continue; > > - /* By default, RT tasks always get 100% boost */ > + /* > + * By default, RT tasks always get 100% boost, which the admins > + * are allowed to change via > + * sysctl_sched_uclamp_util_min_rt_default knob. > + */ > if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) > - clamp_value = uclamp_none(UCLAMP_MAX); > + clamp_value = sysctl_sched_uclamp_util_min_rt_default; Mmm... I suspect we don't need this anymore. If the task has a user_defined value, we skip this anyway. If the task has not a user_defined value, we will do set this anyway at each enqueue time. No? > > uclamp_se_set(uc_se, clamp_value, false); > } > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 8a176d8727a3..64117363c502 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -453,6 +453,13 @@ static struct ctl_table kern_table[] = { > .mode = 0644, > .proc_handler = sysctl_sched_uclamp_handler, > }, > + { > + .procname = "sched_util_clamp_min_rt_default", > + .data = _sched_uclamp_util_min_rt_default, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = sysctl_sched_uclamp_handler, > + }, > #endif > #ifdef CONFIG_SCHED_AUTOGROUP > { Best, Patrick
Re: [PATCH v7] ARM: DTS: Aspeed: Add YADRO Nicole BMC
On Wed, Apr 29, 2020 at 02:37:11PM +0300, Alexander Filippov wrote: > Nicole is an OpenPower machine with an Aspeed 2500 BMC SoC manufactured > by YADRO. > > Signed-off-by: Alexander Filippov > --- > arch/arm/boot/dts/Makefile | 1 + > arch/arm/boot/dts/aspeed-bmc-opp-nicole.dts | 326 > 2 files changed, 327 insertions(+) > create mode 100644 arch/arm/boot/dts/aspeed-bmc-opp-nicole.dts > Reviewed-by: Patrick Williams -- Patrick Williams signature.asc Description: PGP signature
Re: [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases
Hi Peter, On 14-Oct 16:52, Peter Zijlstra wrote: > > The energy aware schedutil patches remimded me this was still pending. > > On Fri, Aug 02, 2019 at 10:47:25AM +0100, Patrick Bellasi wrote: > > Hi Peter, Vincent, > > is there anything different I can do on this? > > I think both Vincent and me are basically fine with the patch, it was > the Changelog/explanation for it that sat uneasy. > > Specifically I think the 'confusion' around the PELT invariance stuff > doesn't help. > > I think that if you present it simply as making util_est directly follow > upward motion and only decay on downward -- and the rationale for it -- > then it should be fine. Ok, I'll update the commit message to remove the PELT related ambiguity and post a new version soon. Cheers, Patrick -- #include Patrick Bellasi
Re: [PATCH 2/2] firmware: coreboot: Export active CBFS partition
On Wed, 2019-10-09 at 14:19 -0700, Julius Werner wrote: > > Somehow we've gotten /sys/firmware/log to be the coreboot log, and > > quite > > frankly that blows my mind that this path was accepted upstream. > > Userspace has to know it's running on coreboot firmware to know > > that > > /sys/firmware/log is actually the coreboot log. > > Not really sure I understand your concern here? That's the generic > node for the log from the mainboard firmware, whatever it is. It was > originally added for non-coreboot firmware and that use is still > supported. If some other non-coreboot firmware wants to join in, it's > welcome to do so -- the interface is separated out enough to make it > easy to add more backends. > > I do agree that if we want to add other, more coreboot-specific > nodes, > they should be explicitly namespaced. > I'll add a new namespace called 'coreboot'. > > But I also wonder why this is being exposed by the kernel at all? > It's difficult for userspace tools to find out how to access the flash and then to find the FMAP, which resides somewhere in it on 4KiB boundary. The "boot media params" usually give the offset to the FMAP from beginning of the flash, but tell nothing about how to access it. > I share Stephen's concern that I'm not sure this belongs in the > kernel > at all. There are existing ways for userspace to access this > information like the cbmem utility does... if you want it accessible > from fwupd, it could chain-call into cbmem or we could factor that > functionality out into a library. If you want to get away from using > /dev/mem for this, we could maybe add a driver that exports CBMEM or > coreboot table areas via sysfs... but then I think that should be a > generic driver which makes them all accessible in one go, rather than > having to add yet another driver whenever someone needs to parse > another coreboot table blob for some reason. We could design an > interface like /sys/firmware/coreboot/table/ where every entry > in > the table gets exported as a binary file. I don't even consider using binaries that operate on /dev/mem. In my opinion CBMEM is something coreboot internal, the OS or userspace shouldn't even known about. > I think a specific sysfs driver only makes sense for things that are > human readable and that you'd actually expect a human to want to go > read directly, like the log. Maybe exporting FMAP entries one by one > like Stephen suggests could be such a case, but I doubt that there's > a > common enough need for that since there are plenty of existing ways > to > show FMAP entries from userspace (and if there was a generic > interface > like /sys/firmware/coreboot/table/37 to access it, we could just add > a > new flag to the dump_fmap utility to read it from there). I'll expose the coreboot tables using a sysfs driver, which then can be used by coreboot tools instead of accessing /dev/mem. As it holds the FMAP and "boot media params" that's all I need for now. The downside is that the userspace tools need to be keep in sync with the binary interface the coreboot tables are providing.
[PATCH 1/2] firmware: coreboot: Export the binary FMAP
From: Patrick Rudolph Expose coreboot's binary FMAP[1] to /sys/firmware/fmap. coreboot copies the FMAP to a CBMEM buffer at boot since CB:35377[2], allowing an architecture independ way of exposing the FMAP to userspace. Will be used by fwupd[3] to determine the current firmware layout. [1]: https://doc.coreboot.org/lib/flashmap.html [2]: https://review.coreboot.org/c/coreboot/+/35377 [3]: https://fwupd.org/ Signed-off-by: Patrick Rudolph --- drivers/firmware/google/Kconfig | 8 ++ drivers/firmware/google/Makefile | 1 + drivers/firmware/google/fmap-coreboot.c | 156 ++ drivers/firmware/google/fmap-coreboot.h | 13 ++ drivers/firmware/google/fmap_serialized.h | 59 5 files changed, 237 insertions(+) create mode 100644 drivers/firmware/google/fmap-coreboot.c create mode 100644 drivers/firmware/google/fmap-coreboot.h create mode 100644 drivers/firmware/google/fmap_serialized.h diff --git a/drivers/firmware/google/Kconfig b/drivers/firmware/google/Kconfig index a3a6ca659ffa..5fbbd7b8fef6 100644 --- a/drivers/firmware/google/Kconfig +++ b/drivers/firmware/google/Kconfig @@ -74,4 +74,12 @@ config GOOGLE_VPD This option enables the kernel to expose the content of Google VPD under /sys/firmware/vpd. +config GOOGLE_FMAP + tristate "Coreboot FMAP access" + depends on GOOGLE_COREBOOT_TABLE + help + This option enables the kernel to search for a Google FMAP in + the coreboot table. If found, this binary file is exported to userland + in the file /sys/firmware/fmap. + endif # GOOGLE_FIRMWARE diff --git a/drivers/firmware/google/Makefile b/drivers/firmware/google/Makefile index d17caded5d88..6d31fe167700 100644 --- a/drivers/firmware/google/Makefile +++ b/drivers/firmware/google/Makefile @@ -6,6 +6,7 @@ obj-$(CONFIG_GOOGLE_FRAMEBUFFER_COREBOOT) += framebuffer-coreboot.o obj-$(CONFIG_GOOGLE_MEMCONSOLE)+= memconsole.o obj-$(CONFIG_GOOGLE_MEMCONSOLE_COREBOOT) += memconsole-coreboot.o obj-$(CONFIG_GOOGLE_MEMCONSOLE_X86_LEGACY) += memconsole-x86-legacy.o +obj-$(CONFIG_GOOGLE_FMAP) += fmap-coreboot.o vpd-sysfs-y := vpd.o vpd_decode.o obj-$(CONFIG_GOOGLE_VPD) += vpd-sysfs.o diff --git a/drivers/firmware/google/fmap-coreboot.c b/drivers/firmware/google/fmap-coreboot.c new file mode 100644 index ..14050030ebc6 --- /dev/null +++ b/drivers/firmware/google/fmap-coreboot.c @@ -0,0 +1,156 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * fmap-coreboot.c + * + * Exports the binary FMAP through coreboot table. + * + * Copyright 2012-2013 David Herrmann + * Copyright 2017 Google Inc. + * Copyright 2019 9elements Agency GmbH + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "coreboot_table.h" +#include "fmap_serialized.h" + +#define CB_TAG_FMAP 0x37 + +static void *fmap; +static u32 fmap_size; + +/* + * Convert FMAP region to name. + * The caller has to free the string. + * Return NULL if no containing region was found. + */ +const char *coreboot_fmap_region_to_name(const u32 start, const u32 size) +{ + const char *name = NULL; + struct fmap *iter; + u32 size_old = ~0; + int i; + + iter = fmap; + /* Find smallest containing region */ + for (i = 0; i < iter->nareas && fmap; i++) { + if (iter->areas[i].offset <= start && + iter->areas[i].size >= size && + iter->areas[i].size <= size_old) { + size_old = iter->areas[i].size; + name = iter->areas[i].name; + } + } + + if (name) + return kstrdup(name, GFP_KERNEL); + return NULL; +} +EXPORT_SYMBOL(coreboot_fmap_region_to_name); + +static ssize_t fmap_read(struct file *filp, struct kobject *kobp, +struct bin_attribute *bin_attr, char *buf, +loff_t pos, size_t count) +{ + if (!fmap) + return -ENODEV; + + return memory_read_from_buffer(buf, count, , fmap, fmap_size); +} + +static int fmap_probe(struct coreboot_device *dev) +{ + struct lb_cbmem_ref *cbmem_ref = >cbmem_ref; + struct fmap *header; + + if (!cbmem_ref) + return -ENODEV; + + header = memremap(cbmem_ref->cbmem_addr, sizeof(*header), MEMREMAP_WB); + if (!header) { + pr_warn("coreboot: Failed to remap FMAP\n"); + return -ENOMEM; + } + + /* Validate FMAP signature */ + if (memcmp(header->signature, FMAP_SIGNATURE, + sizeof(header->signature))) { + pr_warn("coreboot: FMAP signature mismatch\n"); + memunmap(header); + return -ENODEV; + } + + /* Validate FMAP version *
[PATCH 2/2] firmware: coreboot: Export active CBFS partition
From: Patrick Rudolph Expose the name of the active CBFS partition under /sys/firmware/cbfs_active_partition In case of Google's VBOOT[1] that currently can be one of: "FW_MAIN_A", "FW_MAIN_B" or "COREBOOT" Will be used by fwupd[2] to determinde the active partition in the VBOOT A/B partition update scheme. [1]: https://doc.coreboot.org/security/vboot/index.html [2]: https://fwupd.org/ Signed-off-by: Patrick Rudolph --- drivers/firmware/google/Kconfig | 10 ++ drivers/firmware/google/Makefile | 1 + drivers/firmware/google/bootmedia-coreboot.c | 115 +++ drivers/firmware/google/coreboot_table.h | 13 +++ 4 files changed, 139 insertions(+) create mode 100644 drivers/firmware/google/bootmedia-coreboot.c diff --git a/drivers/firmware/google/Kconfig b/drivers/firmware/google/Kconfig index 5fbbd7b8fef6..f58b723d77de 100644 --- a/drivers/firmware/google/Kconfig +++ b/drivers/firmware/google/Kconfig @@ -82,4 +82,14 @@ config GOOGLE_FMAP the coreboot table. If found, this binary file is exported to userland in the file /sys/firmware/fmap. +config GOOGLE_BOOTMEDIA + tristate "Coreboot bootmedia params access" + depends on GOOGLE_COREBOOT_TABLE + depends on GOOGLE_FMAP + help + This option enables the kernel to search for a boot media params + table, providing the active CBFS partition. If found, the name of + the partition is exported to userland in the file + /sys/firmware/cbfs_active_partition. + endif # GOOGLE_FIRMWARE diff --git a/drivers/firmware/google/Makefile b/drivers/firmware/google/Makefile index 6d31fe167700..e47c59376566 100644 --- a/drivers/firmware/google/Makefile +++ b/drivers/firmware/google/Makefile @@ -7,6 +7,7 @@ obj-$(CONFIG_GOOGLE_MEMCONSOLE)+= memconsole.o obj-$(CONFIG_GOOGLE_MEMCONSOLE_COREBOOT) += memconsole-coreboot.o obj-$(CONFIG_GOOGLE_MEMCONSOLE_X86_LEGACY) += memconsole-x86-legacy.o obj-$(CONFIG_GOOGLE_FMAP) += fmap-coreboot.o +obj-$(CONFIG_GOOGLE_BOOTMEDIA) += bootmedia-coreboot.o vpd-sysfs-y := vpd.o vpd_decode.o obj-$(CONFIG_GOOGLE_VPD) += vpd-sysfs.o diff --git a/drivers/firmware/google/bootmedia-coreboot.c b/drivers/firmware/google/bootmedia-coreboot.c new file mode 100644 index ..09c0caedaf05 --- /dev/null +++ b/drivers/firmware/google/bootmedia-coreboot.c @@ -0,0 +1,115 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * bootmedia-coreboot.c + * + * Exports the active VBOOT partition name through boot media params. + * + * Copyright 2012-2013 David Herrmann + * Copyright 2017 Google Inc. + * Copyright 2019 Patrick Rudolph + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "coreboot_table.h" +#include "fmap-coreboot.h" + +#define CB_TAG_BOOT_MEDIA_PARAMS 0x30 + +/* The current CBFS partition */ +static char *name; + +static ssize_t cbfs_active_partition_read(struct file *filp, + struct kobject *kobp, + struct bin_attribute *bin_attr, + char *buf, + loff_t pos, size_t count) +{ + if (!name) + return -ENODEV; + + return memory_read_from_buffer(buf, count, , name, strlen(name)); +} + +static int bmp_probe(struct coreboot_device *dev) +{ + struct lb_boot_media_params *b = >bmp; + const char *tmp; + + /* Sanity checks on the data we got */ + if ((b->cbfs_offset == ~0) || + b->cbfs_size == 0 || + ((b->cbfs_offset + b->cbfs_size) > b->boot_media_size)) { + pr_warn("coreboot: Boot media params contains invalid data\n"); + return -ENODEV; + } + + tmp = coreboot_fmap_region_to_name(b->cbfs_offset, b->cbfs_size); + if (!tmp) { + pr_warn("coreboot: Active CBFS region not found in FMAP\n"); + return -ENODEV; + } + + name = devm_kmalloc(>dev, strlen(tmp) + 2, GFP_KERNEL); + if (!name) { + kfree(tmp); + return -ENODEV; + } + snprintf(name, strlen(tmp) + 2, "%s\n", tmp); + + kfree(tmp); + + return 0; +} + +static int bmp_remove(struct coreboot_device *dev) +{ + struct platform_device *pdev = dev_get_drvdata(>dev); + + platform_device_unregister(pdev); + + return 0; +} + +static struct coreboot_driver bmp_driver = { + .probe = bmp_probe, + .remove = bmp_remove, + .drv = { + .name = "bootmediaparams", + }, + .tag = CB_TAG_BOOT_MEDIA_PARAMS, +}; + +static struct bin_attribute cbfs_partition_bin_attr = { + .attr = {.name = "cbfs_active_partition"
Re: [PATCH 1/2] i2c: pxa: migrate to new i2c_slave APIs
Thanks for the review Andy. On Tue, Oct 01, 2019 at 07:29:13PM +0300, Andy Shevchenko wrote: > > > On Tue, Oct 01, 2019 at 10:59:59AM -0500, Patrick Williams wrote: > There are quite a few people in the Cc list. I'm not sure they all are > interested in this. I deliberately dropped few names, sorry, if I was > mistaken. Agree it was kind of a big list. Just chose what was given to me by get_maintainer.pl. It seems like there isn't a direct identified maintainer of this file. > > > + if (isr & ISR_RWM) { > > + u8 byte = 0; > > + > > + i2c_slave_event(i2c->slave, I2C_SLAVE_READ_REQUESTED, > > + ); > > + writel(byte, _IDBR(i2c)); > > + } else { > > + i2c_slave_event(i2c->slave, I2C_SLAVE_WRITE_REQUESTED, > > + NULL); > > + } > > Hmm... Perhaps > > u8 byte = 0; > > i2c_slave_event(i2c->slave, I2C_SLAVE_READ_REQUESTED, ); > if (isr & ISR_RWM) > writel(byte, _IDBR(i2c)); > The two different paths also require READ_REQUEST vs WRITE_REQUESTED. I could do a ternary there but it seemed more obvious to just unroll the logic. -- - Patrick
[PATCH 2/2] i2c: slave-eeprom: support additional models
Add support for emulating the following EEPROMs: * 24c01 - 1024 bit * 24c128 - 128k bit * 24c256 - 256k bit * 24c512 - 512k bit The flag bits in the device id were shifted up 1 bit to make room for saving the 24c512's size. 24c512 uses the full 16-bit address space of a 2-byte addressable EEPROM. Signed-off-by: Patrick Williams --- drivers/i2c/i2c-slave-eeprom.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/i2c/i2c-slave-eeprom.c b/drivers/i2c/i2c-slave-eeprom.c index efee56106251..65419441313b 100644 --- a/drivers/i2c/i2c-slave-eeprom.c +++ b/drivers/i2c/i2c-slave-eeprom.c @@ -37,9 +37,9 @@ struct eeprom_data { u8 buffer[]; }; -#define I2C_SLAVE_BYTELEN GENMASK(15, 0) -#define I2C_SLAVE_FLAG_ADDR16 BIT(16) -#define I2C_SLAVE_FLAG_RO BIT(17) +#define I2C_SLAVE_BYTELEN GENMASK(16, 0) +#define I2C_SLAVE_FLAG_ADDR16 BIT(17) +#define I2C_SLAVE_FLAG_RO BIT(18) #define I2C_SLAVE_DEVICE_MAGIC(_len, _flags) ((_flags) | (_len)) static int i2c_slave_eeprom_slave_cb(struct i2c_client *client, @@ -171,12 +171,20 @@ static int i2c_slave_eeprom_remove(struct i2c_client *client) } static const struct i2c_device_id i2c_slave_eeprom_id[] = { + { "slave-24c01", I2C_SLAVE_DEVICE_MAGIC(1024 / 8, 0) }, + { "slave-24c01ro", I2C_SLAVE_DEVICE_MAGIC(1024 / 8, I2C_SLAVE_FLAG_RO) }, { "slave-24c02", I2C_SLAVE_DEVICE_MAGIC(2048 / 8, 0) }, { "slave-24c02ro", I2C_SLAVE_DEVICE_MAGIC(2048 / 8, I2C_SLAVE_FLAG_RO) }, { "slave-24c32", I2C_SLAVE_DEVICE_MAGIC(32768 / 8, I2C_SLAVE_FLAG_ADDR16) }, { "slave-24c32ro", I2C_SLAVE_DEVICE_MAGIC(32768 / 8, I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) }, { "slave-24c64", I2C_SLAVE_DEVICE_MAGIC(65536 / 8, I2C_SLAVE_FLAG_ADDR16) }, { "slave-24c64ro", I2C_SLAVE_DEVICE_MAGIC(65536 / 8, I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) }, + { "slave-24c128", I2C_SLAVE_DEVICE_MAGIC(131072 / 8, I2C_SLAVE_FLAG_ADDR16) }, + { "slave-24c128ro", I2C_SLAVE_DEVICE_MAGIC(131072 / 8, I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) }, + { "slave-24c256", I2C_SLAVE_DEVICE_MAGIC(262144 / 8, I2C_SLAVE_FLAG_ADDR16) }, + { "slave-24c256ro", I2C_SLAVE_DEVICE_MAGIC(262144 / 8, I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) }, + { "slave-24c512", I2C_SLAVE_DEVICE_MAGIC(524288 / 8, I2C_SLAVE_FLAG_ADDR16) }, + { "slave-24c512ro", I2C_SLAVE_DEVICE_MAGIC(524288 / 8, I2C_SLAVE_FLAG_ADDR16 | I2C_SLAVE_FLAG_RO) }, { } }; MODULE_DEVICE_TABLE(i2c, i2c_slave_eeprom_id); -- 2.17.2 (Apple Git-113)
[PATCH 1/2] i2c: slave-eeprom: initialize empty eeprom properly
The i2c-slave-eeprom driver emulates at24 class eeprom devices, which come initialized with all 1s. Do the same in the software emulation. Signed-off-by: Patrick Williams --- drivers/i2c/i2c-slave-eeprom.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/i2c/i2c-slave-eeprom.c b/drivers/i2c/i2c-slave-eeprom.c index db9763cb4dae..efee56106251 100644 --- a/drivers/i2c/i2c-slave-eeprom.c +++ b/drivers/i2c/i2c-slave-eeprom.c @@ -131,6 +131,8 @@ static int i2c_slave_eeprom_probe(struct i2c_client *client, const struct i2c_de if (!eeprom) return -ENOMEM; + memset(eeprom->buffer, 0xFF, size); + eeprom->idx_write_cnt = 0; eeprom->num_address_bytes = flag_addr16 ? 2 : 1; eeprom->address_mask = size - 1; -- 2.17.2 (Apple Git-113)
[PATCH 2/2] i2c: pxa: remove unused i2c-slave APIs
With the i2c-pxa driver migrated to the standard i2c-slave APIs, the custom APIs and structures are no longer needed or used. Remove them. Signed-off-by: Patrick Williams --- drivers/i2c/busses/i2c-pxa.c | 1 - include/linux/i2c-pxa.h | 18 -- include/linux/platform_data/i2c-pxa.h | 4 3 files changed, 23 deletions(-) delete mode 100644 include/linux/i2c-pxa.h diff --git a/drivers/i2c/busses/i2c-pxa.c b/drivers/i2c/busses/i2c-pxa.c index c811646e809f..466e4f681d7a 100644 --- a/drivers/i2c/busses/i2c-pxa.c +++ b/drivers/i2c/busses/i2c-pxa.c @@ -25,7 +25,6 @@ #include #include #include -#include #include #include #include diff --git a/include/linux/i2c-pxa.h b/include/linux/i2c-pxa.h deleted file mode 100644 index a897e2b507b6.. --- a/include/linux/i2c-pxa.h +++ /dev/null @@ -1,18 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _LINUX_I2C_ALGO_PXA_H -#define _LINUX_I2C_ALGO_PXA_H - -typedef enum i2c_slave_event_e { - I2C_SLAVE_EVENT_START_READ, - I2C_SLAVE_EVENT_START_WRITE, - I2C_SLAVE_EVENT_STOP -} i2c_slave_event_t; - -struct i2c_slave_client { - void *data; - void (*event)(void *ptr, i2c_slave_event_t event); - int (*read) (void *ptr); - void (*write)(void *ptr, unsigned int val); -}; - -#endif /* _LINUX_I2C_ALGO_PXA_H */ diff --git a/include/linux/platform_data/i2c-pxa.h b/include/linux/platform_data/i2c-pxa.h index cb290092599c..6a9b28399b39 100644 --- a/include/linux/platform_data/i2c-pxa.h +++ b/include/linux/platform_data/i2c-pxa.h @@ -55,11 +55,7 @@ */ #define I2C_ISR_INIT 0x7FF /* status register init */ -struct i2c_slave_client; - struct i2c_pxa_platform_data { - unsigned intslave_addr; - struct i2c_slave_client *slave; unsigned intclass; unsigned intuse_pio :1; unsigned intfast_mode :1; -- 2.17.2 (Apple Git-113)
[PATCH 1/2] i2c: pxa: migrate to new i2c_slave APIs
The i2c subsystem was enhanced circa 2015 to support operating as an i2c-slave device. Prior to that, the i2c-pxa driver supported an i2c-slave but had its own APIs. There are no existing in-kernel drivers or platforms that utilize the i2c-pxa APIs. Migrate the i2c-pxa driver to the general i2c-slave APIs so that existing drivers, such as the i2c-slave-eeprom, can be used. This has been tested with a Marvell EspressoBin, using i2c-pxa and i2c-slave-eeprom, acting as a slave, and a RaspeberryPi 3, using the at24 driver, acting as a master. Signed-off-by: Patrick Williams --- drivers/i2c/busses/Kconfig | 1 + drivers/i2c/busses/i2c-pxa.c | 74 +--- 2 files changed, 61 insertions(+), 14 deletions(-) diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig index 146ce40d8e0a..d0c79ac9ffdb 100644 --- a/drivers/i2c/busses/Kconfig +++ b/drivers/i2c/busses/Kconfig @@ -874,6 +874,7 @@ config I2C_PXA_PCI config I2C_PXA_SLAVE bool "Intel PXA2XX I2C Slave comms support" depends on I2C_PXA && !X86_32 + select I2C_SLAVE help Support I2C slave mode communications on the PXA I2C bus. This is necessary for systems where the PXA may be a target on the diff --git a/drivers/i2c/busses/i2c-pxa.c b/drivers/i2c/busses/i2c-pxa.c index 2c3c3d6935c0..c811646e809f 100644 --- a/drivers/i2c/busses/i2c-pxa.c +++ b/drivers/i2c/busses/i2c-pxa.c @@ -180,7 +180,7 @@ struct pxa_i2c { struct i2c_adapter adap; struct clk *clk; #ifdef CONFIG_I2C_PXA_SLAVE - struct i2c_slave_client *slave; + struct i2c_client *slave; #endif unsigned intirqlogidx; @@ -544,22 +544,23 @@ static void i2c_pxa_slave_txempty(struct pxa_i2c *i2c, u32 isr) if (isr & ISR_BED) { /* what should we do here? */ } else { - int ret = 0; + u8 byte = 0; if (i2c->slave != NULL) - ret = i2c->slave->read(i2c->slave->data); + i2c_slave_event(i2c->slave, I2C_SLAVE_READ_PROCESSED, + ); - writel(ret, _IDBR(i2c)); + writel(byte, _IDBR(i2c)); writel(readl(_ICR(i2c)) | ICR_TB, _ICR(i2c)); /* allow next byte */ } } static void i2c_pxa_slave_rxfull(struct pxa_i2c *i2c, u32 isr) { - unsigned int byte = readl(_IDBR(i2c)); + u8 byte = readl(_IDBR(i2c)); if (i2c->slave != NULL) - i2c->slave->write(i2c->slave->data, byte); + i2c_slave_event(i2c->slave, I2C_SLAVE_WRITE_RECEIVED, ); writel(readl(_ICR(i2c)) | ICR_TB, _ICR(i2c)); } @@ -572,9 +573,18 @@ static void i2c_pxa_slave_start(struct pxa_i2c *i2c, u32 isr) dev_dbg(>adap.dev, "SAD, mode is slave-%cx\n", (isr & ISR_RWM) ? 'r' : 't'); - if (i2c->slave != NULL) - i2c->slave->event(i2c->slave->data, -(isr & ISR_RWM) ? I2C_SLAVE_EVENT_START_READ : I2C_SLAVE_EVENT_START_WRITE); + if (i2c->slave != NULL) { + if (isr & ISR_RWM) { + u8 byte = 0; + + i2c_slave_event(i2c->slave, I2C_SLAVE_READ_REQUESTED, + ); + writel(byte, _IDBR(i2c)); + } else { + i2c_slave_event(i2c->slave, I2C_SLAVE_WRITE_REQUESTED, + NULL); + } + } /* * slave could interrupt in the middle of us generating a @@ -607,7 +617,7 @@ static void i2c_pxa_slave_stop(struct pxa_i2c *i2c) dev_dbg(>adap.dev, "ISR: SSD (Slave Stop)\n"); if (i2c->slave != NULL) - i2c->slave->event(i2c->slave->data, I2C_SLAVE_EVENT_STOP); + i2c_slave_event(i2c->slave, I2C_SLAVE_STOP, NULL); if (i2c_debug > 2) dev_dbg(>adap.dev, "ISR: SSD (Slave Stop) acked\n"); @@ -619,6 +629,38 @@ static void i2c_pxa_slave_stop(struct pxa_i2c *i2c) if (i2c->msg) i2c_pxa_master_complete(i2c, I2C_RETRY); } + +static int i2c_pxa_slave_reg(struct i2c_client *slave) +{ + struct pxa_i2c *i2c = slave->adapter->algo_data; + + if (i2c->slave) + return -EBUSY; + + if (!i2c->reg_isar) + return -EAFNOSUPPORT; + + i2c->slave = slave; + i2c->slave_addr = slave->addr; + + writel(i2c->slave_addr, _ISAR(i2c)); + + return 0; +} + +static int i2c_pxa_slave_unreg(struct i2c_client *slave) +{ + struct pxa_i2c *i2c = slave->adapter->algo_data; + + WARN_ON(!i2c->slave); + + i2c->slave_a
[PATCH 0/2] i2c: pxa: migrate to i2c-core-slave APIs
i2c-pxa has its own i2c slave APIs rather than using the i2c-core APIs. There are no in-tree drivers currently using this custom slave API set. Migrate the i2c-pxa driver from its own i2c slave APIs to the i2c-core slave APIs so that in-tree slave devices can be used with the i2c-pxa hardware (ex. i2c-slave-eeprom). Patrick Williams (2): i2c: pxa: migrate to new i2c_slave APIs i2c: pxa: remove unused i2c-slave APIs drivers/i2c/busses/Kconfig| 1 + drivers/i2c/busses/i2c-pxa.c | 75 +-- include/linux/i2c-pxa.h | 18 --- include/linux/platform_data/i2c-pxa.h | 4 -- 4 files changed, 61 insertions(+), 37 deletions(-) delete mode 100644 include/linux/i2c-pxa.h -- 2.17.2 (Apple Git-113)
[PATCH] pinctrl: armada-37xx: swap polarity on LED group
The configuration registers for the LED group have inverted polarity, which puts the GPIO into open-drain state when used in GPIO mode. Switch to '0' for GPIO and '1' for LED modes. Fixes: 87466ccd9401 ("pinctrl: armada-37xx: Add pin controller support for Armada 37xx") Signed-off-by: Patrick Williams Cc: --- drivers/pinctrl/mvebu/pinctrl-armada-37xx.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c b/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c index 6462d3ca7ceb..6310963ce5f0 100644 --- a/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c +++ b/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c @@ -183,10 +183,10 @@ static struct armada_37xx_pin_group armada_37xx_nb_groups[] = { PIN_GRP_EXTRA("uart2", 9, 2, BIT(1) | BIT(13) | BIT(14) | BIT(19), BIT(1) | BIT(13) | BIT(14), BIT(1) | BIT(19), 18, 2, "gpio", "uart"), - PIN_GRP_GPIO("led0_od", 11, 1, BIT(20), "led"), - PIN_GRP_GPIO("led1_od", 12, 1, BIT(21), "led"), - PIN_GRP_GPIO("led2_od", 13, 1, BIT(22), "led"), - PIN_GRP_GPIO("led3_od", 14, 1, BIT(23), "led"), + PIN_GRP_GPIO_2("led0_od", 11, 1, BIT(20), BIT(20), 0, "led"), + PIN_GRP_GPIO_2("led1_od", 12, 1, BIT(21), BIT(21), 0, "led"), + PIN_GRP_GPIO_2("led2_od", 13, 1, BIT(22), BIT(22), 0, "led"), + PIN_GRP_GPIO_2("led3_od", 14, 1, BIT(23), BIT(23), 0, "led"), }; -- 2.17.2 (Apple Git-113)
[PATCH v2] pinctrl: armada-37xx: fix control of pins 32 and up
The 37xx configuration registers are only 32 bits long, so pins 32-35 spill over into the next register. The calculation for the register address was done, but the bitmask was not, so any configuration to pin 32 or above resulted in a bitmask that overflowed and performed no action. Fix the register / offset calculation to also adjust the offset. Fixes: 5715092a458c ("pinctrl: armada-37xx: Add gpio support") Signed-off-by: Patrick Williams Acked-by: Gregory CLEMENT Cc: --- drivers/pinctrl/mvebu/pinctrl-armada-37xx.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c b/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c index 6462d3ca7ceb..34c1fee52cbe 100644 --- a/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c +++ b/drivers/pinctrl/mvebu/pinctrl-armada-37xx.c @@ -221,11 +221,11 @@ static const struct armada_37xx_pin_data armada_37xx_pin_sb = { }; static inline void armada_37xx_update_reg(unsigned int *reg, - unsigned int offset) + unsigned int *offset) { /* We never have more than 2 registers */ - if (offset >= GPIO_PER_REG) { - offset -= GPIO_PER_REG; + if (*offset >= GPIO_PER_REG) { + *offset -= GPIO_PER_REG; *reg += sizeof(u32); } } @@ -376,7 +376,7 @@ static inline void armada_37xx_irq_update_reg(unsigned int *reg, { int offset = irqd_to_hwirq(d); - armada_37xx_update_reg(reg, offset); + armada_37xx_update_reg(reg, ); } static int armada_37xx_gpio_direction_input(struct gpio_chip *chip, @@ -386,7 +386,7 @@ static int armada_37xx_gpio_direction_input(struct gpio_chip *chip, unsigned int reg = OUTPUT_EN; unsigned int mask; - armada_37xx_update_reg(, offset); + armada_37xx_update_reg(, ); mask = BIT(offset); return regmap_update_bits(info->regmap, reg, mask, 0); @@ -399,7 +399,7 @@ static int armada_37xx_gpio_get_direction(struct gpio_chip *chip, unsigned int reg = OUTPUT_EN; unsigned int val, mask; - armada_37xx_update_reg(, offset); + armada_37xx_update_reg(, ); mask = BIT(offset); regmap_read(info->regmap, reg, ); @@ -413,7 +413,7 @@ static int armada_37xx_gpio_direction_output(struct gpio_chip *chip, unsigned int reg = OUTPUT_EN; unsigned int mask, val, ret; - armada_37xx_update_reg(, offset); + armada_37xx_update_reg(, ); mask = BIT(offset); ret = regmap_update_bits(info->regmap, reg, mask, mask); @@ -434,7 +434,7 @@ static int armada_37xx_gpio_get(struct gpio_chip *chip, unsigned int offset) unsigned int reg = INPUT_VAL; unsigned int val, mask; - armada_37xx_update_reg(, offset); + armada_37xx_update_reg(, ); mask = BIT(offset); regmap_read(info->regmap, reg, ); @@ -449,7 +449,7 @@ static void armada_37xx_gpio_set(struct gpio_chip *chip, unsigned int offset, unsigned int reg = OUTPUT_VAL; unsigned int mask, val; - armada_37xx_update_reg(, offset); + armada_37xx_update_reg(, ); mask = BIT(offset); val = value ? mask : 0; -- 2.17.2 (Apple Git-113)
[no subject]
-- One million two hundred thousand euros (1,200,000 €) has been donated to you by Frances and Patrick Connolly, we are from County Armagh in Northern Ireland, We won the EuroMillions lottery jackpot of 115 million euros. Email us for more details: frances.connoll...@gmail.com
Re: Usecases for the per-task latency-nice attribute
On Wed, Sep 18, 2019 at 16:22:32 +0100, Vincent Guittot wrote... > On Wed, 18 Sep 2019 at 16:19, Patrick Bellasi wrote: [...] >> $> Wakeup path tunings >> == >> >> Some additional possible use-cases was already discussed in [3]: >> >> - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE} >>depending on crossing certain pre-configured threshold of latency >>niceness. >> >> - dynamically bias the vruntime updates we do in place_entity() >>depending on the actual latency niceness of a task. >> >>PeterZ thinks this is dangerous but that we can "(carefully) fumble a >>bit there." > > I agree with Peter that we can easily break the fairness if we bias vruntime Just to be more precise here and also to better understand, here I'm talking about turning the tweaks we already have for: - START_DEBIT - GENTLE_FAIR_SLEEPERS a bit more parametric and proportional to the latency-nice of a task. In principle, if a task declares a positive latency niceness, could we not read this also as "I accept to be a bit penalised in terms of fairness at wakeup time"? Whatever tweaks we do there should affect anyway only one sched_latency period... although I'm not yet sure if that's possible and how. >> - bias the decisions we take in check_preempt_tick() still depending >>on a relative comparison of the current and wakeup task latency >>niceness values. > > This one seems possible as it will mainly enable a task to preempt > "earlier" the running task but will not break the fairness > So the main impact will be the number of context switch between tasks > to favor or not the scheduling latency Preempting before is definitively a nice-to-have feature. At the same time it's interesting a support where a low latency-nice task (e.g. TOP_APP) RUNNABLE on a CPU has better chances to be executed up to completion without being preempted by an high latency-nice task (e.g. BACKGROUND) waking up on its CPU. For that to happen, we need a mechanism to "delay" the execution of a less important RUNNABLE task up to a certain period. It's impacting the fairness, true, but latency-nice in this case will means that we want to "complete faster", not just "start faster". Is this definition something we can reason about? Best, Patrick -- #include Patrick Bellasi
Re: Usecases for the per-task latency-nice attribute
ith a different name, since jitters clashes with other RT related concepts. Maybe we don't even need a name at all, the other two attributes you specify are good enough to identify those tasks: they are just "small background" tasks. small : because on their small util_est value background : because of their high latency-nice value > already active core and thus refrains from waking up of a new core if > possible. This requires tagging of tasks from the userspace hinting which > tasks are un-important and thus waking-up a new core to minimize the > latency is un-necessary for such tasks. > As per the discussion on the posted RFC, it will be appropriate to use the > task latency property where a task with the highest latency-nice value can > be packed. We should better defined here what you mean with "highest" latency-nice value, do you really mean the top of the range, e.g. 19? Or... > But for this specific use-cases, having just a binary value to know which > task is latency-sensitive and which not is sufficient enough, but having a > range is also a good way to go where above some threshold the task can be > packed. ... yes, maybe we can reason about a "threshold based profiling" where something like for example: /proc/sys/kernel/sched_packing_util_max: 200 /proc/sys/kernel/sched_packing_latency_min : 17 means that a task with latency-nice >= 17 and util_est <= 200 will be packed? $> Wakeup path tunings == Some additional possible use-cases was already discussed in [3]: - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE} depending on crossing certain pre-configured threshold of latency niceness. - dynamically bias the vruntime updates we do in place_entity() depending on the actual latency niceness of a task. PeterZ thinks this is dangerous but that we can "(carefully) fumble a bit there." - bias the decisions we take in check_preempt_tick() still depending on a relative comparison of the current and wakeup task latency niceness values. > References: > === > [1]. https://lkml.org/lkml/2019/8/30/829 > [2]. https://lkml.org/lkml/2019/7/25/296 [3]. Message-ID: <20190905114709.gm2...@hirez.programming.kicks-ass.net> https://lore.kernel.org/lkml/20190905114709.gm2...@hirez.programming.kicks-ass.net/ Best, Patrick -- #include Patrick Bellasi
Re: linux-next: Tree for Sep 16 (kernel/sched/core.c)
On Wed, Sep 18, 2019 at 07:05:53 +0100, Ingo Molnar wrote... > * Randy Dunlap wrote: > >> On 9/17/19 6:38 AM, Patrick Bellasi wrote: >> > >> > On Tue, Sep 17, 2019 at 08:52:42 +0100, Ingo Molnar wrote... >> > >> >> * Randy Dunlap wrote: >> >> >> >>> On 9/16/19 3:38 PM, Mark Brown wrote: >> >>>> Hi all, >> >>>> >> >>>> Changes since 20190915: >> >>>> >> >>> >> >>> on x86_64: >> >>> >> >>> when CONFIG_CGROUPS is not set: >> > >> > Hi Randy, >> > thanks for the report. >> > >> >>> CC kernel/sched/core.o >> >>> ../kernel/sched/core.c: In function ‘uclamp_update_active_tasks’: >> >>> ../kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known >> >>> struct css_task_iter it; >> >>>^~ >> >>> CC kernel/printk/printk_safe.o >> >>> ../kernel/sched/core.c:1084:2: error: implicit declaration of function >> >>> ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? >> >>> [-Werror=implicit-function-declaration] >> >>> css_task_iter_start(css, 0, ); >> >>> ^~~ >> >>> __sg_page_iter_start >> >>> ../kernel/sched/core.c:1085:14: error: implicit declaration of function >> >>> ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? >> >>> [-Werror=implicit-function-declaration] >> >>> while ((p = css_task_iter_next())) { >> >>> ^~ >> >>> __sg_page_iter_next >> >>> ../kernel/sched/core.c:1091:2: error: implicit declaration of function >> >>> ‘css_task_iter_end’; did you mean ‘get_task_cred’? >> >>> [-Werror=implicit-function-declaration] >> >>> css_task_iter_end(); >> >>> ^ >> >>> get_task_cred >> >>> ../kernel/sched/core.c:1081:23: warning: unused variable ‘it’ >> >>> [-Wunused-variable] >> >>> struct css_task_iter it; >> >>>^~ >> >>> >> >> >> >> I cannot reproduce this build failue: I took Linus's latest which has all >> >> the -next scheduler commits included (ad062195731b), and an x86-64 "make >> >> defconfig" and a disabling of CONFIG_CGROUPS still resuls in a kernel >> >> that builds fine. >> > >> > Same here Ingo, I cannot reproduce on arm64 and !CONFIG_CGROUPS and >> > testing on tip/sched/core. >> > >> > However, if you like, the following patch can make that code a >> > bit more "robust". >> > >> > Best, >> > Patrick >> > >> > ---8<--- >> > From 7e17b7bb08dd8dfc57e01c2a7b6875439eb47cbe Mon Sep 17 00:00:00 2001 >> > From: Patrick Bellasi >> > Date: Tue, 17 Sep 2019 14:12:10 +0100 >> > Subject: [PATCH 1/1] sched/core: uclamp: Fix compile error on >> > !CONFIG_CGROUPS >> > >> > Randy reported a compiler error on x86_64 and !CONFIG_CGROUPS which is due >> > to uclamp_update_active_tasks() using the undefined css_task_iter(). >> > >> > Since uclamp_update_active_tasks() is used only when cgroup support is >> > enabled, fix that by properly guarding that function at compile time. >> > >> > Signed-off-by: Patrick Bellasi >> > Link: >> > https://lore.kernel.org/lkml/1898d3c9-1997-17ce-a022-a5e28c8dc...@infradead.org/ >> > Fixes: commit babbe170e05 ("sched/uclamp: Update CPU's refcount on TG's >> > clamp changes") >> >> Acked-by: Randy Dunlap # build-tested >> >> Thanks. > > Build failures like this one shouldn't depend on the compiler version - > and it's still a mystery how and why this build bug triggered - we cannot > apply the fix without knowing the answer to those questions. Right, but it's also quite strange it's not triggering without the guarding above. The only definition of struct css_task_iter I can see is the one provided in: include/linux/cgroup.h:50 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/cgroup.h?h=35f7a95266153b1cf0caca3aa9661cb721864527#n50 which is CONFIG_CGROUPS guarded. > Can you reproduce the build bug with Linus's latest tree? If not, which > part of -next triggers the build failure? I tried again using this morning's Linus tree headed at: commit 35f7a9526615 ("Merge tag 'devprop-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm") and compilation actually fails for me too. Everything is fine in v5.3 with !CONFIG_CGROUPS and a git bisect between v5.3 and Linus master points to: commit babbe170e053c ("sched/uclamp: Update CPU's refcount on TG's clamp changes") So, I think it's really my fault not properly testing !CONFIG_CGROUP, which is enforced by default from CONFIG_SCHED_AUTOGROUP. The patch above fixes the compilation error, hope this helps. Cheers, Patrick -- #include Patrick Bellasi
INQUIRY
Dear Sir, My name is Patrick Lewis, I am the Partnership Director of the famous brand John Lewis Plc. John Lewis Plc is is a UK's largest multi-channel retailer with over 50 shops furnished with European products. Could you also send to us all information required to become one of your regular distributors. Please, we would appreciate if you could send us your stock list availability via email?. Indeed we are interested in your products, we would like to know if you can provide them. . Regards, Patrick Lewis Managing Director - Partnership Services www.johnlewis.com www.johnlewispartnership.co.uk Direct-line +44(0) 7404024964 Fax : +44 0208253410 sa...@salonsdirect.co.uk __ Powered by Mach5 Mailer: http://mach5-mailer.com
Re: linux-next: Tree for Sep 16 (kernel/sched/core.c)
On Tue, Sep 17, 2019 at 08:52:42 +0100, Ingo Molnar wrote... > * Randy Dunlap wrote: > >> On 9/16/19 3:38 PM, Mark Brown wrote: >> > Hi all, >> > >> > Changes since 20190915: >> > >> >> on x86_64: >> >> when CONFIG_CGROUPS is not set: Hi Randy, thanks for the report. >> CC kernel/sched/core.o >> ../kernel/sched/core.c: In function ‘uclamp_update_active_tasks’: >> ../kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known >> struct css_task_iter it; >>^~ >> CC kernel/printk/printk_safe.o >> ../kernel/sched/core.c:1084:2: error: implicit declaration of function >> ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? >> [-Werror=implicit-function-declaration] >> css_task_iter_start(css, 0, ); >> ^~~ >> __sg_page_iter_start >> ../kernel/sched/core.c:1085:14: error: implicit declaration of function >> ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? >> [-Werror=implicit-function-declaration] >> while ((p = css_task_iter_next())) { >> ^~ >> __sg_page_iter_next >> ../kernel/sched/core.c:1091:2: error: implicit declaration of function >> ‘css_task_iter_end’; did you mean ‘get_task_cred’? >> [-Werror=implicit-function-declaration] >> css_task_iter_end(); >> ^ >> get_task_cred >> ../kernel/sched/core.c:1081:23: warning: unused variable ‘it’ >> [-Wunused-variable] >> struct css_task_iter it; >>^~ >> > > I cannot reproduce this build failue: I took Linus's latest which has all > the -next scheduler commits included (ad062195731b), and an x86-64 "make > defconfig" and a disabling of CONFIG_CGROUPS still resuls in a kernel > that builds fine. Same here Ingo, I cannot reproduce on arm64 and !CONFIG_CGROUPS and testing on tip/sched/core. However, if you like, the following patch can make that code a bit more "robust". Best, Patrick ---8<--- >From 7e17b7bb08dd8dfc57e01c2a7b6875439eb47cbe Mon Sep 17 00:00:00 2001 From: Patrick Bellasi Date: Tue, 17 Sep 2019 14:12:10 +0100 Subject: [PATCH 1/1] sched/core: uclamp: Fix compile error on !CONFIG_CGROUPS Randy reported a compiler error on x86_64 and !CONFIG_CGROUPS which is due to uclamp_update_active_tasks() using the undefined css_task_iter(). Since uclamp_update_active_tasks() is used only when cgroup support is enabled, fix that by properly guarding that function at compile time. Signed-off-by: Patrick Bellasi Link: https://lore.kernel.org/lkml/1898d3c9-1997-17ce-a022-a5e28c8dc...@infradead.org/ Fixes: commit babbe170e05 ("sched/uclamp: Update CPU's refcount on TG's clamp changes") --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3c7b90bcbe4e..14873ad4b34a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1043,6 +1043,7 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) uclamp_rq_dec_id(rq, p, clamp_id); } +#ifdef CONFIG_UCLAMP_TASK_GROUP static inline void uclamp_update_active(struct task_struct *p, enum uclamp_id clamp_id) { @@ -1091,7 +1092,6 @@ uclamp_update_active_tasks(struct cgroup_subsys_state *css, css_task_iter_end(); } -#ifdef CONFIG_UCLAMP_TASK_GROUP static void cpu_util_update_eff(struct cgroup_subsys_state *css); static void uclamp_update_root_tg(void) { -- 2.22.0 ---8<--- -- #include Patrick Bellasi
Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote... > On 05/09/2019 12:18, Patrick Bellasi wrote: >>> There's a few things wrong there; I really feel that if we call it nice, >>> it should be like nice. Otherwise we should call it latency-bias and not >>> have the association with nice to confuse people. >>> >>> Secondly; the default should be in the middle of the range. Naturally >>> this would be a signed range like nice [-(x+1),x] for some x. but if you >>> want [0,1024], then the default really should be 512, but personally I >>> like 0 better as a default, in which case we need negative numbers. >>> >>> This is important because we want to be able to bias towards less >>> importance to (tail) latency as well as more importantance to (tail) >>> latency. >>> >>> Specifically, Oracle wants to sacrifice (some) latency for throughput. >>> Facebook OTOH seems to want to sacrifice (some) throughput for latency. >> >> Right, we have this dualism to deal with and current mainline behaviour >> is somehow in the middle. >> >> BTW, the FB requirement is the same we have in Android. >> We want some CFS tasks to have very small latency and a low chance >> to be preempted by the wake-up of less-important "background" tasks. >> >> I'm not totally against the usage of a signed range, but I'm thinking >> that since we are introducing a new (non POSIX) concept we can get the >> chance to make it more human friendly. >> >> Give the two extremes above, would not be much simpler and intuitive to >> have 0 implementing the FB/Android (no latency) case and 1024 the >> (max latency) Oracle case? >> > > For something like latency-, I don't see the point of having > such a wide range. The nice range is probably more than enough - and before > even bothering about the range, we should probably agree on what the range > should represent. > > If it's niceness, I read it as: positive latency-nice value means we're > nice to latency, means we reduce it. So the further up you go, the more you > restrict your wakeup scan. I think it's quite easy to map that into the > code: current behaviour at 0, with a decreasing scan mask size as we go > towards +19. I don't think anyone needs 512 steps to tune this. > > I don't know what logic we'd follow for negative values though. Maybe > latency-nice -20 means always going through the slowpath, but what of the > intermediate values? Yep, I think so fare we are all converging towards the idea to use the a signed range. Regarding the range itself, yes: 1024 looks very oversized, but +-20 is still something which leave room for a bit of flexibility and it also better matches the idea that we don't want to "enumerate behaviours" but just expose a knob. To map certain "bias" we could benefit from a slightly larger range. > AFAICT this RFC only looks at wakeups, but I guess latency-nice can be For the wakeup path there is also the TurboSched proposal by Parth: Message-ID: <20190725070857.6639-1-pa...@linux.ibm.com> https://lore.kernel.org/lkml/20190725070857.6639-1-pa...@linux.ibm.com/ we should keep in mind. > applied elsewhere (e.g. load-balance, something like task_hot() and its > use of sysctl_sched_migration_cost). For LB can you come up with some better description of what usages you see could benefit from a "per task" or "per task-group" latency niceness? Best, Patrick -- #include Patrick Bellasi
Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
On Thu, Sep 05, 2019 at 12:40:30 +0100, Peter Zijlstra wrote... > On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote: > >> Right, we have this dualism to deal with and current mainline behaviour >> is somehow in the middle. >> >> BTW, the FB requirement is the same we have in Android. >> We want some CFS tasks to have very small latency and a low chance >> to be preempted by the wake-up of less-important "background" tasks. >> >> I'm not totally against the usage of a signed range, but I'm thinking >> that since we are introducing a new (non POSIX) concept we can get the >> chance to make it more human friendly. > > I'm arguing that signed _is_ more human friendly ;-) ... but you are not human. :) >> Give the two extremes above, would not be much simpler and intuitive to >> have 0 implementing the FB/Android (no latency) case and 1024 the >> (max latency) Oracle case? > > See, I find the signed thing more natural, negative is a bias away from > latency sensitive, positive is a bias towards latency sensitive. > > Also; 0 is a good default value ;-) Yes, that's appealing indeed. >> Moreover, we will never match completely the nice semantic, give that >> a 1 nice unit has a proper math meaning, isn't something like 10% CPU >> usage change for each step? > > Only because we were nice when implementing it. Posix leaves it > unspecified and we could change it at any time. The only real semantics > is a relative 'weight' (opengroup uses the term 'favourable'). Good to know, I was considering it a POXIS requirement. >> Could changing the name to "latency-tolerance" break the tie by marking >> its difference wrt prior/nice levels? AFAIR, that was also the original >> proposal [1] by PaulT during the OSPM discussion. > > latency torrerance could still be a signed entity, positive would > signify we're more tolerant of latency (ie. less sensitive) while > negative would be less tolerant (ie. more sensitive). Right. >> For latency-nice instead we will likely base our biasing strategies on >> some predefined (maybe system-wide configurable) const thresholds. > > I'm not quite sure; yes, for some of these things, like the idle search > on wakeup, certainly. But say for wakeup-preemption, we could definitely > make it a task relative attribute. -- #include Patrick Bellasi
Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
On Thu, Sep 05, 2019 at 12:30:02 +0100, Peter Zijlstra wrote... > On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote: >> On 09/05/19 12:46, Peter Zijlstra wrote: > >> > This is important because we want to be able to bias towards less >> > importance to (tail) latency as well as more importantance to (tail) >> > latency. >> > >> > Specifically, Oracle wants to sacrifice (some) latency for throughput. >> > Facebook OTOH seems to want to sacrifice (some) throughput for latency. >> >> Another use case I'm considering is using latency-nice to prefer an idle CPU >> if >> latency-nice is set otherwise go for the most energy efficient CPU. >> >> Ie: sacrifice (some) energy for latency. >> >> The way I see interpreting latency-nice here as a binary switch. But >> maybe we can use the range to select what (some) energy to sacrifice >> mean here. Hmmm. > > It cannot be binary, per definition is must be ternary, that is, <0, ==0 > and >0 (or middle value if you're of that persuasion). > > In your case, I'm thinking you mean >0, we want to lower the latency. > > Anyway; there were a number of things mentioned at OSPM that we could > tie into this thing and finding sensible mappings is going to be a bit > of trial and error I suppose. > > But as patrick said; we're very much exporting a BIAS knob, not a set of > behaviours. Right, although I think behaviours could still be exported but via a different and configurable interface, using thresholds. Either at compile time or via procfs maybe we can expose and properly document what happen in the scheduler if/when a task has a "latency niceness" crossing a given threshold. For example, by setting something like: /proc/sys/kernel/sched_cfs_latency_idle = 1000 we state that the task is going to be scheduled according to the SCHED_IDLE policy. ( ( (tomatoes target here) ) ) Not sure also if we wanna commit to user-space APIs how we internally map/translate a "latency niceness" value into a scheduler behaviour bias. Maybe better not at least at the very beginning. Best, Patrick -- #include Patrick Bellasi
Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
On Thu, Sep 05, 2019 at 12:13:47 +0100, Qais Yousef wrote... > On 09/05/19 12:46, Peter Zijlstra wrote: >> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote: >> >> > > From just reading the above, I would expect it to have the range >> > > [-20,19] just like normal nice. Apparently this is not so. >> > >> > Regarding the range for the latency-nice values, I guess we have two >> > options: >> > >> > - [-20..19], which makes it similar to priorities >> > downside: we quite likely end up with a kernel space representation >> > which does not match the user-space one, e.g. look at >> > task_struct::prio. >> > >> > - [0..1024], which makes it more similar to a "percentage" >> > >> > Being latency-nice a new concept, we are not constrained by POSIX and >> > IMHO the [0..1024] scale is a better fit. >> > >> > That will translate into: >> > >> > latency-nice=0 : default (current mainline) behaviour, all "biasing" >> > policies are disabled and we wakeup up as fast as possible >> > >> > latency-nice=1024 : maximum niceness, where for example we can imaging >> > to turn switch a CFS task to be SCHED_IDLE? >> >> There's a few things wrong there; I really feel that if we call it nice, >> it should be like nice. Otherwise we should call it latency-bias and not >> have the association with nice to confuse people. >> >> Secondly; the default should be in the middle of the range. Naturally >> this would be a signed range like nice [-(x+1),x] for some x. but if you >> want [0,1024], then the default really should be 512, but personally I >> like 0 better as a default, in which case we need negative numbers. >> >> This is important because we want to be able to bias towards less >> importance to (tail) latency as well as more importantance to (tail) >> latency. >> >> Specifically, Oracle wants to sacrifice (some) latency for throughput. >> Facebook OTOH seems to want to sacrifice (some) throughput for latency. > > Another use case I'm considering is using latency-nice to prefer an idle CPU > if > latency-nice is set otherwise go for the most energy efficient CPU. > > Ie: sacrifice (some) energy for latency. > > The way I see interpreting latency-nice here as a binary switch. But maybe we > can use the range to select what (some) energy to sacrifice mean here. Hmmm. I see this concept possibly evolving into something more then just a binary switch. Not yet convinced if it make sense and/or it's possible but, in principle, I was thinking about these possible usages for CFS tasks: - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE} depending on crossing certain pre-configured threshold of latency niceness. - dynamically bias the vruntime updates we do in place_entity() depending on the actual latency niceness of a task. - bias the decisions we take in check_preempt_tick() still depending on a relative comparison of the current and wakeup task latency niceness values. -- #include Patrick Bellasi
Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
On Thu, Sep 05, 2019 at 11:46:16 +0100, Peter Zijlstra wrote... > On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote: > >> > From just reading the above, I would expect it to have the range >> > [-20,19] just like normal nice. Apparently this is not so. >> >> Regarding the range for the latency-nice values, I guess we have two >> options: >> >> - [-20..19], which makes it similar to priorities >> downside: we quite likely end up with a kernel space representation >> which does not match the user-space one, e.g. look at >> task_struct::prio. >> >> - [0..1024], which makes it more similar to a "percentage" >> >> Being latency-nice a new concept, we are not constrained by POSIX and >> IMHO the [0..1024] scale is a better fit. >> >> That will translate into: >> >> latency-nice=0 : default (current mainline) behaviour, all "biasing" >> policies are disabled and we wakeup up as fast as possible >> >> latency-nice=1024 : maximum niceness, where for example we can imaging >> to turn switch a CFS task to be SCHED_IDLE? > > There's a few things wrong there; I really feel that if we call it nice, > it should be like nice. Otherwise we should call it latency-bias and not > have the association with nice to confuse people. > > Secondly; the default should be in the middle of the range. Naturally > this would be a signed range like nice [-(x+1),x] for some x. but if you > want [0,1024], then the default really should be 512, but personally I > like 0 better as a default, in which case we need negative numbers. > > This is important because we want to be able to bias towards less > importance to (tail) latency as well as more importantance to (tail) > latency. > > Specifically, Oracle wants to sacrifice (some) latency for throughput. > Facebook OTOH seems to want to sacrifice (some) throughput for latency. Right, we have this dualism to deal with and current mainline behaviour is somehow in the middle. BTW, the FB requirement is the same we have in Android. We want some CFS tasks to have very small latency and a low chance to be preempted by the wake-up of less-important "background" tasks. I'm not totally against the usage of a signed range, but I'm thinking that since we are introducing a new (non POSIX) concept we can get the chance to make it more human friendly. Give the two extremes above, would not be much simpler and intuitive to have 0 implementing the FB/Android (no latency) case and 1024 the (max latency) Oracle case? Moreover, we will never match completely the nice semantic, give that a 1 nice unit has a proper math meaning, isn't something like 10% CPU usage change for each step? For latency-nice instead we will likely base our biasing strategies on some predefined (maybe system-wide configurable) const thresholds. Could changing the name to "latency-tolerance" break the tie by marking its difference wrt prior/nice levels? AFAIR, that was also the original proposal [1] by PaulT during the OSPM discussion. Best, Patrick [1] https://youtu.be/oz43thSFqmk?t=1302 -- #include Patrick Bellasi
Re: [RFC PATCH 0/9] Task latency-nice
On Fri, Aug 30, 2019 at 18:49:35 +0100, subhra mazumdar wrote... > Introduce new per task property latency-nice for controlling scalability > in scheduler idle CPU search path. As per my comments in other message, we should try to better split the "latency nice" concept introduction (API and mechanisms) from its usage in different scheduler code paths. This distinction should be evident from the cover letter too. What you use "latency nice" for is just one possible use-case, thus it's important to make sure it's generic enough to fit other usages too. > Valid latency-nice values are from 1 to > 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New > CPU cgroup file cpu.latency-nice is added as an interface to set and get. > All tasks in the same cgroup share the same latency-nice value. Using a > lower latency-nice value can help latency intolerant tasks e.g very short > running OLTP threads where full LLC search cost can be significant compared > to run time of the threads. The default latency-nice value is 5. > > In addition to latency-nice, it also adds a new sched feature SIS_CORE to > be able to disable idle core search altogether which is costly and hurts > more than it helps in short running workloads. I don't see why you latency-nice cannot be used to implement what you get with NO_SIS_CORE. IMHO, "latency nice" should be exposed as a pretty generic concept which progressively enables different "levels of biasing" both at wake-up time and load-balance time. Why it's not possible to have SIS_CORE/NO_SIS_CORE switch implemented just as different threshold values for the latency-nice value of a task? Best, Patrick -- #include Patrick Bellasi
Re: [RFC PATCH 4/9] sched: SIS_CORE to disable idle core search
On Fri, Aug 30, 2019 at 18:49:39 +0100, subhra mazumdar wrote... > Use SIS_CORE to disable idle core search. For some workloads > select_idle_core becomes a scalability bottleneck, removing it improves > throughput. Also there are workloads where disabling it can hurt latency, > so need to have an option. > > Signed-off-by: subhra mazumdar > --- > kernel/sched/fair.c | 8 +--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c31082d..23ec9c6 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, > int prev, int target) > if (!sd) > return target; > > - i = select_idle_core(p, sd, target); > - if ((unsigned)i < nr_cpumask_bits) > - return i; > + if (sched_feat(SIS_CORE)) { > + i = select_idle_core(p, sd, target); > + if ((unsigned)i < nr_cpumask_bits) > + return i; > + } > > i = select_idle_cpu(p, sd, target); > if ((unsigned)i < nr_cpumask_bits) This looks like should be squashed with the previous one, or whatever code you'll add to define when this "biasing" is to be used or not. Best, Patrick -- #include Patrick Bellasi
Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search
On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote... > Add a new sched feature SIS_CORE to have an option to disable idle core > search (select_idle_core). > > Signed-off-by: subhra mazumdar > --- > kernel/sched/features.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 858589b..de4d506 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) > */ > SCHED_FEAT(SIS_AVG_CPU, false) > SCHED_FEAT(SIS_PROP, true) > +SCHED_FEAT(SIS_CORE, true) Why do we need a sched_feature? If you think there are systems in which the usage of latency-nice does not make sense for in "Select Idle Sibling", then we should probably better add a new Kconfig option. If that's the case, you can probably use the init/Kconfig's "Scheduler features" section, recently added by: commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") > /* > * Issue a WARN when we do multiple update_rq_clock() calls Best, Patrick -- #include Patrick Bellasi
Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote... > On 9/4/19 11:02 PM, Tim Chen wrote: >> On 8/30/19 10:49 AM, subhra mazumdar wrote: >>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file >>> "latency-nice" which is shared by all the threads in that Cgroup. >> >> >> Subhra, >> >> Thanks for posting the patchset. Having a latency nice hint >> is useful beyond idle load balancing. I can think of other >> application scenarios, like scheduling batch machine learning AVX 512 >> processes with latency sensitive processes. AVX512 limits the frequency >> of the CPU and it is best to avoid latency sensitive task on the >> same core with AVX512. So latency nice hint allows the scheduler >> to have a criteria to determine the latency sensitivity of a task >> and arrange latency sensitive tasks away from AVX512 tasks. >> > > > Hi Tim and Subhra, > > This patchset seems to be interesting for my TurboSched patches as well > where I try to pack jitter tasks on fewer cores to get higher Turbo > Frequencies. > Well, the problem I face is that we sometime end up putting multiple jitter > tasks on a core > running some latency sensitive application which may see performance > degradation. > So my plan was to classify such tasks to be latency sensitive thereby hinting > the load > balancer to not put tasks on such cores. > > TurboSched: https://lkml.org/lkml/2019/7/25/296 > >> You configure the latency hint on a cgroup basis. >> But I think not all tasks in a cgroup necessarily have the same >> latency sensitivity. >> >> For example, I can see that cgroup can be applied on a per user basis, >> and the user could run different tasks that have different latency >> sensitivity. >> We may also need a way to configure latency sensitivity on a per task basis >> instead on >> a per cgroup basis. >> > > AFAIU, the problem defined above intersects with my patches as well where the > interface > is required to classify the jitter tasks. I have already tried few methods > like > syscall and cgroup to classify such tasks and maybe something like that can > be adopted > with these patchset as well. Agree, these two patchest are definitively overlapping in terms of mechanisms and APIs to expose to userspace. You to guys seems to target different goals but the general approach should be: - expose a single and abstract concept to user-space latency-nice or latency-tolerant as PaulT proposed at OSPM - map this concept in kernel-space to different kind of bias, both at wakeup time and load-balance time, and use both for RT and CFS tasks. That's my understanding at least ;) I guess we will have interesting discussions at the upcoming LPC to figure out a solution fitting all needs. > Thanks, > Parth Best, Patrick -- #include Patrick Bellasi
Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
ency_nice_write_u64, > + }, > { } /* Terminate */ > }; > > @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = { > .write = cpu_max_write, > }, > #endif > + { > + .name = "latency-nice", > + .read_u64 = cpu_latency_nice_read_u64, > + .write_u64 = cpu_latency_nice_write_u64, > + }, > { } /* terminate */ > }; > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..b08d00c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, > struct task_group *parent) > goto err; > > tg->shares = NICE_0_LOAD; > + tg->latency_nice = LATENCY_NICE_DEFAULT; Maybe better NICE_0_LATENCY to be more consistent? > init_cfs_bandwidth(tg_cfs_bandwidth(tg)); > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index b52ed1a..365c928 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq > *this_rq) { } > #define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT) > > /* > + * Latency-nice default value > + */ > +#define LATENCY_NICE_DEFAULT5 > +#define LATENCY_NICE_MIN1 > +#define LATENCY_NICE_MAX100 Values 1 and 5 looks kind of arbitrary. For the range specifically, I already commented in this other message: Message-ID: <87r24v2i14@arm.com> https://lore.kernel.org/lkml/87r24v2i14@arm.com/ > + > +/* > * Single value that decides SCHED_DEADLINE internal math precision. > * 10 -> just above 1us > * 9 -> just above 0.5us > @@ -362,6 +369,7 @@ struct cfs_bandwidth { > /* Task group related information */ > struct task_group { > struct cgroup_subsys_state css; > + u64 latency_nice; > > #ifdef CONFIG_FAIR_GROUP_SCHED > /* schedulable entities of this group on each CPU */ Best, Patrick -- #include Patrick Bellasi
Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote... > On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote: >> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file >> "latency-nice" which is shared by all the threads in that Cgroup. > > *sigh*, no. We start with a normal per task attribute, and then later, > if it is needed and makes sense, we add it to cgroups. FWIW, to add on top of what Peter says, we used this same approach for uclamp and it proved to be a very effective way to come up with a good design. General principles have been: - a system wide API [1] (under /proc/sys/kernel/sched_*) defines default values for all tasks affected by that feature. This interface has to define also upper bounds for task specific values. Thus, in the case of latency-nice, it should be set by default to the MIN value, since that's the current mainline behaviour: all tasks are latency sensitive. - a per-task API [2] (via the sched_setattr() syscall) can be used to relax the system wide setting thus implementing a "nice" policy. - a per-taskgroup API [3] (via cpu controller's attributes) can be used to relax the system-wide settings and restrict the per-task API. The above features are worth to be added in that exact order. > Also, your Changelog fails on pretty much every point. It doesn't > explain why, it doesn't describe anything and so on. On the description side, I guess it's worth to mention somewhere to which scheduling classes this feature can be useful for. It's worth to mention that it can apply only to: - CFS tasks: for example, at wakeup time a task with an high latency-nice should avoid to preempt a low latency-nice task. Maybe by mapping the latency nice value into proper vruntime normalization value? - RT tasks: for example, at wakeup time a task with an high latency-nice value could avoid to preempt a CFS task. I'm sure there will be discussion about some of these features, that's why it's important in the proposal presentation to keep a well defined distinction among the "mechanisms and API" and how we use the new concept to "bias" some scheduler policies. > From just reading the above, I would expect it to have the range > [-20,19] just like normal nice. Apparently this is not so. Regarding the range for the latency-nice values, I guess we have two options: - [-20..19], which makes it similar to priorities downside: we quite likely end up with a kernel space representation which does not match the user-space one, e.g. look at task_struct::prio. - [0..1024], which makes it more similar to a "percentage" Being latency-nice a new concept, we are not constrained by POSIX and IMHO the [0..1024] scale is a better fit. That will translate into: latency-nice=0 : default (current mainline) behaviour, all "biasing" policies are disabled and we wakeup up as fast as possible latency-nice=1024 : maximum niceness, where for example we can imaging to turn switch a CFS task to be SCHED_IDLE? Best, Patrick [1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps") [2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support utilization clamping") [3] 5 patches in today's tip/sched/core up to: commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes") -- #include Patrick Bellasi
[tip: sched/core] sched/uclamp: Propagate parent clamps
The following commit has been merged into the sched/core branch of tip: Commit-ID: 0b60ba2dd342016e4e717dbaa4ca9af3a43f4434 Gitweb: https://git.kernel.org/tip/0b60ba2dd342016e4e717dbaa4ca9af3a43f4434 Author:Patrick Bellasi AuthorDate:Thu, 22 Aug 2019 14:28:07 +01:00 Committer: Ingo Molnar CommitterDate: Tue, 03 Sep 2019 09:17:38 +02:00 sched/uclamp: Propagate parent clamps In order to properly support hierarchical resources control, the cgroup delegation model requires that attribute writes from a child group never fail but still are locally consistent and constrained based on parent's assigned resources. This requires to properly propagate and aggregate parent attributes down to its descendants. Implement this mechanism by adding a new "effective" clamp value for each task group. The effective clamp value is defined as the smaller value between the clamp value of a group and the effective clamp value of its parent. This is the actual clamp value enforced on tasks in a task group. Since it's possible for a cpu.uclamp.min value to be bigger than the cpu.uclamp.max value, ensure local consistency by restricting each "protection" (i.e. min utilization) with the corresponding "limit" (i.e. max utilization). Do that at effective clamps propagation to ensure all user-space write never fails while still always tracking the most restrictive values. Signed-off-by: Patrick Bellasi Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Michal Koutny Acked-by: Tejun Heo Cc: Alessio Balsini Cc: Dietmar Eggemann Cc: Joel Fernandes Cc: Juri Lelli Cc: Linus Torvalds Cc: Morten Rasmussen Cc: Paul Turner Cc: Peter Zijlstra Cc: Quentin Perret Cc: Rafael J . Wysocki Cc: Steve Muckle Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Todd Kjos Cc: Vincent Guittot Cc: Viresh Kumar Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bell...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 44 +++- kernel/sched/sched.h | 2 ++- 2 files changed, 46 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c186abe..8855481 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1166,6 +1166,7 @@ static void __init init_uclamp(void) uclamp_default[clamp_id] = uc_max; #ifdef CONFIG_UCLAMP_TASK_GROUP root_task_group.uclamp_req[clamp_id] = uc_max; + root_task_group.uclamp[clamp_id] = uc_max; #endif } } @@ -6824,6 +6825,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg, for_each_clamp_id(clamp_id) { uclamp_se_set(>uclamp_req[clamp_id], uclamp_none(clamp_id), false); + tg->uclamp[clamp_id] = parent->uclamp[clamp_id]; } #endif } @@ -7070,6 +7072,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) } #ifdef CONFIG_UCLAMP_TASK_GROUP +static void cpu_util_update_eff(struct cgroup_subsys_state *css) +{ + struct cgroup_subsys_state *top_css = css; + struct uclamp_se *uc_parent = NULL; + struct uclamp_se *uc_se = NULL; + unsigned int eff[UCLAMP_CNT]; + unsigned int clamp_id; + unsigned int clamps; + + css_for_each_descendant_pre(css, top_css) { + uc_parent = css_tg(css)->parent + ? css_tg(css)->parent->uclamp : NULL; + + for_each_clamp_id(clamp_id) { + /* Assume effective clamps matches requested clamps */ + eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value; + /* Cap effective clamps with parent's effective clamps */ + if (uc_parent && + eff[clamp_id] > uc_parent[clamp_id].value) { + eff[clamp_id] = uc_parent[clamp_id].value; + } + } + /* Ensure protection is always capped by limit */ + eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]); + + /* Propagate most restrictive effective clamps */ + clamps = 0x0; + uc_se = css_tg(css)->uclamp; + for_each_clamp_id(clamp_id) { + if (eff[clamp_id] == uc_se[clamp_id].value) + continue; + uc_se[clamp_id].value = eff[clamp_id]; + uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]); + clamps |= (0x1 << clamp_id); + } + if (!clamps) + css = css_rightmost_descendant(css); + } +} /* * Integer 10^N with a given N exponent by casting to integer the literal "1eN" @@ -7138,6 +7179,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf, */
[tip: sched/core] sched/uclamp: Extend CPU's cgroup controller
The following commit has been merged into the sched/core branch of tip: Commit-ID: 2480c093130f64ac3a410504fa8b3db1fc4b87ce Gitweb: https://git.kernel.org/tip/2480c093130f64ac3a410504fa8b3db1fc4b87ce Author:Patrick Bellasi AuthorDate:Thu, 22 Aug 2019 14:28:06 +01:00 Committer: Ingo Molnar CommitterDate: Tue, 03 Sep 2019 09:17:37 +02:00 sched/uclamp: Extend CPU's cgroup controller The cgroup CPU bandwidth controller allows to assign a specified (maximum) bandwidth to the tasks of a group. However this bandwidth is defined and enforced only on a temporal base, without considering the actual frequency a CPU is running on. Thus, the amount of computation completed by a task within an allocated bandwidth can be very different depending on the actual frequency the CPU is running that task. The amount of computation can be affected also by the specific CPU a task is running on, especially when running on asymmetric capacity systems like Arm's big.LITTLE. With the availability of schedutil, the scheduler is now able to drive frequency selections based on actual task utilization. Moreover, the utilization clamping support provides a mechanism to bias the frequency selection operated by schedutil depending on constraints assigned to the tasks currently RUNNABLE on a CPU. Giving the mechanisms described above, it is now possible to extend the cpu controller to specify the minimum (or maximum) utilization which should be considered for tasks RUNNABLE on a cpu. This makes it possible to better defined the actual computational power assigned to task groups, thus improving the cgroup CPU bandwidth controller which is currently based just on time constraints. Extend the CPU controller with a couple of new attributes uclamp.{min,max} which allow to enforce utilization boosting and capping for all the tasks in a group. Specifically: - uclamp.min: defines the minimum utilization which should be considered i.e. the RUNNABLE tasks of this group will run at least at a minimum frequency which corresponds to the uclamp.min utilization - uclamp.max: defines the maximum utilization which should be considered i.e. the RUNNABLE tasks of this group will run up to a maximum frequency which corresponds to the uclamp.max utilization These attributes: a) are available only for non-root nodes, both on default and legacy hierarchies, while system wide clamps are defined by a generic interface which does not depends on cgroups. This system wide interface enforces constraints on tasks in the root node. b) enforce effective constraints at each level of the hierarchy which are a restriction of the group requests considering its parent's effective constraints. Root group effective constraints are defined by the system wide interface. This mechanism allows each (non-root) level of the hierarchy to: - request whatever clamp values it would like to get - effectively get only up to the maximum amount allowed by its parent c) have higher priority than task-specific clamps, defined via sched_setattr(), thus allowing to control and restrict task requests. Add two new attributes to the cpu controller to collect "requested" clamp values. Allow that at each non-root level of the hierarchy. Keep it simple by not caring now about "effective" values computation and propagation along the hierarchy. Update sysctl_sched_uclamp_handler() to use the newly introduced uclamp_mutex so that we serialize system default updates with cgroup relate updates. Signed-off-by: Patrick Bellasi Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Michal Koutny Acked-by: Tejun Heo Cc: Alessio Balsini Cc: Dietmar Eggemann Cc: Joel Fernandes Cc: Juri Lelli Cc: Linus Torvalds Cc: Morten Rasmussen Cc: Paul Turner Cc: Peter Zijlstra Cc: Quentin Perret Cc: Rafael J . Wysocki Cc: Steve Muckle Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Todd Kjos Cc: Vincent Guittot Cc: Viresh Kumar Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bell...@arm.com Signed-off-by: Ingo Molnar --- Documentation/admin-guide/cgroup-v2.rst | 34 - init/Kconfig| 22 +++- kernel/sched/core.c | 193 ++- kernel/sched/sched.h| 8 +- 4 files changed, 253 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3b29005..5f1c266 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy. +In all the above models, cycles distribution is defined only on a temporal +base and it does not account for the freq