Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On 4/12/21 3:05 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1]. With a simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module init. The execution time to load the kernel module with and without memory accounting were: with accounting = 6.798s w/o accounting = 1.758s That is an increase of 5.04s (287%). With this patchset applied, the execution time became 4.254s. So the memory accounting overhead is now 2.496s which is a 50% reduction. Btw, there were two recent independent report about benchmark results regression caused by the introduction of the per-object accounting: 1) Xing reported a hackbench regression: https://lkml.org/lkml/2021/1/13/1277 2) Masayoshi reported a pgbench regression: https://www.spinics.net/lists/linux-mm/msg252540.html I wonder if you can run them (or at least one) and attach the result to the series? It would be very helpful. Actually, it was a bug reported filed by Masayoshi-san that triggered me to work on reducing the memory accounting overhead. He is also in the cc line and so is aware of that. I will cc Xing in my v2 patch. Cheers, Longman
Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On 4/12/21 1:47 PM, Roman Gushchin wrote: On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote: On 4/9/21 9:51 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1]. With a simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module init. The execution time to load the kernel module with and without memory accounting were: with accounting = 6.798s w/o accounting = 1.758s That is an increase of 5.04s (287%). With this patchset applied, the execution time became 4.254s. So the memory accounting overhead is now 2.496s which is a 50% reduction. Hi Waiman! Thank you for working on it, it's indeed very useful! A couple of questions: 1) did your config included lockdep or not? The test kernel is based on a production kernel config and so lockdep isn't enabled. 2) do you have a (rough) estimation how much each change contributes to the overall reduction? I should have a better breakdown of the effect of individual patches. I rerun the benchmarking module with turbo-boosting disabled to reduce run-to-run variation. The execution times were: Before patch: time = 10.800s (with memory accounting), 2.848s (w/o accounting), overhead = 7.952s After patch 2: time = 9.140s, overhead = 6.292s After patch 3: time = 7.641s, overhead = 4.793s After patch 5: time = 6.801s, overhead = 3.953s Thank you! If there will be v2, I'd include this information into commit logs. Yes, I am planning to send out v2 with these information in the cover-letter. I am just waiting a bit to see if there are more feedback. -Longman Patches 1 & 4 are preparatory patches that should affect performance. So the memory accounting overhead was reduced by about half. BTW, the benchmark that I used is kind of the best case behavior as it as all updates are to the percpu stocks. Real workloads will likely to have a certain amount of update to the memcg charges and vmstats. So the performance benefit will be less. Cheers, Longman
Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: > With the recent introduction of the new slab memory controller, we > eliminate the need for having separate kmemcaches for each memory > cgroup and reduce overall kernel memory usage. However, we also add > additional memory accounting overhead to each call of kmem_cache_alloc() > and kmem_cache_free(). > > For workloads that require a lot of kmemcache allocations and > de-allocations, they may experience performance regression as illustrated > in [1]. > > With a simple kernel module that performs repeated loop of 100,000,000 > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module > init. The execution time to load the kernel module with and without > memory accounting were: > > with accounting = 6.798s > w/o accounting = 1.758s > > That is an increase of 5.04s (287%). With this patchset applied, the > execution time became 4.254s. So the memory accounting overhead is now > 2.496s which is a 50% reduction. Btw, there were two recent independent report about benchmark results regression caused by the introduction of the per-object accounting: 1) Xing reported a hackbench regression: https://lkml.org/lkml/2021/1/13/1277 2) Masayoshi reported a pgbench regression: https://www.spinics.net/lists/linux-mm/msg252540.html I wonder if you can run them (or at least one) and attach the result to the series? It would be very helpful. Thank you!
Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote: > On 4/9/21 9:51 PM, Roman Gushchin wrote: > > On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: > > > With the recent introduction of the new slab memory controller, we > > > eliminate the need for having separate kmemcaches for each memory > > > cgroup and reduce overall kernel memory usage. However, we also add > > > additional memory accounting overhead to each call of kmem_cache_alloc() > > > and kmem_cache_free(). > > > > > > For workloads that require a lot of kmemcache allocations and > > > de-allocations, they may experience performance regression as illustrated > > > in [1]. > > > > > > With a simple kernel module that performs repeated loop of 100,000,000 > > > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module > > > init. The execution time to load the kernel module with and without > > > memory accounting were: > > > > > >with accounting = 6.798s > > >w/o accounting = 1.758s > > > > > > That is an increase of 5.04s (287%). With this patchset applied, the > > > execution time became 4.254s. So the memory accounting overhead is now > > > 2.496s which is a 50% reduction. > > Hi Waiman! > > > > Thank you for working on it, it's indeed very useful! > > A couple of questions: > > 1) did your config included lockdep or not? > The test kernel is based on a production kernel config and so lockdep isn't > enabled. > > 2) do you have a (rough) estimation how much each change contributes > > to the overall reduction? > > I should have a better breakdown of the effect of individual patches. I > rerun the benchmarking module with turbo-boosting disabled to reduce > run-to-run variation. The execution times were: > > Before patch: time = 10.800s (with memory accounting), 2.848s (w/o > accounting), overhead = 7.952s > After patch 2: time = 9.140s, overhead = 6.292s > After patch 3: time = 7.641s, overhead = 4.793s > After patch 5: time = 6.801s, overhead = 3.953s Thank you! If there will be v2, I'd include this information into commit logs. > > Patches 1 & 4 are preparatory patches that should affect performance. > > So the memory accounting overhead was reduced by about half. This is really great! Thanks!
Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On 4/9/21 9:51 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1]. With a simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module init. The execution time to load the kernel module with and without memory accounting were: with accounting = 6.798s w/o accounting = 1.758s That is an increase of 5.04s (287%). With this patchset applied, the execution time became 4.254s. So the memory accounting overhead is now 2.496s which is a 50% reduction. Hi Waiman! Thank you for working on it, it's indeed very useful! A couple of questions: 1) did your config included lockdep or not? The test kernel is based on a production kernel config and so lockdep isn't enabled. 2) do you have a (rough) estimation how much each change contributes to the overall reduction? I should have a better breakdown of the effect of individual patches. I rerun the benchmarking module with turbo-boosting disabled to reduce run-to-run variation. The execution times were: Before patch: time = 10.800s (with memory accounting), 2.848s (w/o accounting), overhead = 7.952s After patch 2: time = 9.140s, overhead = 6.292s After patch 3: time = 7.641s, overhead = 4.793s After patch 5: time = 6.801s, overhead = 3.953s Patches 1 & 4 are preparatory patches that should affect performance. So the memory accounting overhead was reduced by about half. Cheers, Longman
Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: > With the recent introduction of the new slab memory controller, we > eliminate the need for having separate kmemcaches for each memory > cgroup and reduce overall kernel memory usage. However, we also add > additional memory accounting overhead to each call of kmem_cache_alloc() > and kmem_cache_free(). > > For workloads that require a lot of kmemcache allocations and > de-allocations, they may experience performance regression as illustrated > in [1]. > > With a simple kernel module that performs repeated loop of 100,000,000 > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module > init. The execution time to load the kernel module with and without > memory accounting were: > > with accounting = 6.798s > w/o accounting = 1.758s > > That is an increase of 5.04s (287%). With this patchset applied, the > execution time became 4.254s. So the memory accounting overhead is now > 2.496s which is a 50% reduction. Hi Waiman! Thank you for working on it, it's indeed very useful! A couple of questions: 1) did your config included lockdep or not? 2) do you have a (rough) estimation how much each change contributes to the overall reduction? Thanks! > > It was found that a major part of the memory accounting overhead > is caused by the local_irq_save()/local_irq_restore() sequences in > updating local stock charge bytes and vmstat array, at least in x86 > systems. There are two such sequences in kmem_cache_alloc() and two > in kmem_cache_free(). This patchset tries to reduce the use of such > sequences as much as possible. In fact, it eliminates them in the common > case. Another part of this patchset to cache the vmstat data update in > the local stock as well which also helps. > > [1] > https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u > > Waiman Long (5): > mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() > mm/memcg: Introduce obj_cgroup_uncharge_mod_state() > mm/memcg: Cache vmstat data in percpu memcg_stock_pcp > mm/memcg: Separate out object stock data into its own struct > mm/memcg: Optimize user context object stock access > > include/linux/memcontrol.h | 14 ++- > mm/memcontrol.c| 198 - > mm/percpu.c| 9 +- > mm/slab.h | 32 +++--- > 4 files changed, 195 insertions(+), 58 deletions(-) > > -- > 2.18.1 >
[PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1]. With a simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module init. The execution time to load the kernel module with and without memory accounting were: with accounting = 6.798s w/o accounting = 1.758s That is an increase of 5.04s (287%). With this patchset applied, the execution time became 4.254s. So the memory accounting overhead is now 2.496s which is a 50% reduction. It was found that a major part of the memory accounting overhead is caused by the local_irq_save()/local_irq_restore() sequences in updating local stock charge bytes and vmstat array, at least in x86 systems. There are two such sequences in kmem_cache_alloc() and two in kmem_cache_free(). This patchset tries to reduce the use of such sequences as much as possible. In fact, it eliminates them in the common case. Another part of this patchset to cache the vmstat data update in the local stock as well which also helps. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u Waiman Long (5): mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() mm/memcg: Introduce obj_cgroup_uncharge_mod_state() mm/memcg: Cache vmstat data in percpu memcg_stock_pcp mm/memcg: Separate out object stock data into its own struct mm/memcg: Optimize user context object stock access include/linux/memcontrol.h | 14 ++- mm/memcontrol.c| 198 - mm/percpu.c| 9 +- mm/slab.h | 32 +++--- 4 files changed, 195 insertions(+), 58 deletions(-) -- 2.18.1