Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Thu, Mar 23, 2017 at 08:38:43AM -0700, Dave Hansen wrote: > On 03/22/2017 01:41 AM, Aaron Lu wrote: > > On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: > >> On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > >>> Introduce a workqueue for all the free workers so that user can fine > >>> tune how many workers can be active through sysfs interface: max_active. > >>> More workers will normally lead to better performance, but too many can > >>> cause severe lock contention. > >> > >> Let me ask a question. > >> > >> How well can workqueue distribute the jobs in multiple CPU? > > > > I would say it's good enough for my needs. > > After all, it doesn't need many kworkers to achieve the 50% time > > decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from > > previous attched data. > > It's also worth noting that we'd like to *also* like to look into > increasing how scalable freeing pages to a given zone is. Still on EX, I restricted the allocation to be only on node 1, with 120G memory allocated there: max_activetimecompared to base lock from perf base(no parallel) 3.81s ±3.3% N/A <1% 1 3.10s ±7.7% ↓18.6%14.76% 2 2.44s ±13.6%↓35.9%36.95% 4 2.07s ±13.6%↓45.6%59.67% 8 1.98s ±0.4% ↓48.0%62.59% 162.01s ±2.4% ↓47.2%79.62% If we can improve the scalibility of freeing a given zone, then parallel free will be able to achieve more. BTW, the lock is basically pgdat->lru_lock in release_pages and zone->lock in free_pcppages_bulk: 62.59%62.59% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 37.17% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;free_pcppages_bulk;free_hot_cold_page;free_hot_cold_page_list;release_pages;free_pages_and_swap_cache;tlb_flush_mmu_free_batches;batch_free_work;process_one_work;worker_thread;kthread;ret_from_fork 25.27% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;release_pages;free_pages_and_swap_cache;tlb_flush_mmu_free_batches;batch_free_work;process_one_work;worker_thread;kthread;ret_from_fork
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Thu, Mar 23, 2017 at 08:38:43AM -0700, Dave Hansen wrote: > On 03/22/2017 01:41 AM, Aaron Lu wrote: > > On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: > >> On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > >>> Introduce a workqueue for all the free workers so that user can fine > >>> tune how many workers can be active through sysfs interface: max_active. > >>> More workers will normally lead to better performance, but too many can > >>> cause severe lock contention. > >> > >> Let me ask a question. > >> > >> How well can workqueue distribute the jobs in multiple CPU? > > > > I would say it's good enough for my needs. > > After all, it doesn't need many kworkers to achieve the 50% time > > decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from > > previous attched data. > > It's also worth noting that we'd like to *also* like to look into > increasing how scalable freeing pages to a given zone is. Still on EX, I restricted the allocation to be only on node 1, with 120G memory allocated there: max_activetimecompared to base lock from perf base(no parallel) 3.81s ±3.3% N/A <1% 1 3.10s ±7.7% ↓18.6%14.76% 2 2.44s ±13.6%↓35.9%36.95% 4 2.07s ±13.6%↓45.6%59.67% 8 1.98s ±0.4% ↓48.0%62.59% 162.01s ±2.4% ↓47.2%79.62% If we can improve the scalibility of freeing a given zone, then parallel free will be able to achieve more. BTW, the lock is basically pgdat->lru_lock in release_pages and zone->lock in free_pcppages_bulk: 62.59%62.59% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 37.17% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;free_pcppages_bulk;free_hot_cold_page;free_hot_cold_page_list;release_pages;free_pages_and_swap_cache;tlb_flush_mmu_free_batches;batch_free_work;process_one_work;worker_thread;kthread;ret_from_fork 25.27% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;release_pages;free_pages_and_swap_cache;tlb_flush_mmu_free_batches;batch_free_work;process_one_work;worker_thread;kthread;ret_from_fork
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On 03/22/2017 01:41 AM, Aaron Lu wrote: > On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: >> On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: >>> Introduce a workqueue for all the free workers so that user can fine >>> tune how many workers can be active through sysfs interface: max_active. >>> More workers will normally lead to better performance, but too many can >>> cause severe lock contention. >> >> Let me ask a question. >> >> How well can workqueue distribute the jobs in multiple CPU? > > I would say it's good enough for my needs. > After all, it doesn't need many kworkers to achieve the 50% time > decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from > previous attched data. It's also worth noting that we'd like to *also* like to look into increasing how scalable freeing pages to a given zone is.
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On 03/22/2017 01:41 AM, Aaron Lu wrote: > On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: >> On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: >>> Introduce a workqueue for all the free workers so that user can fine >>> tune how many workers can be active through sysfs interface: max_active. >>> More workers will normally lead to better performance, but too many can >>> cause severe lock contention. >> >> Let me ask a question. >> >> How well can workqueue distribute the jobs in multiple CPU? > > I would say it's good enough for my needs. > After all, it doesn't need many kworkers to achieve the 50% time > decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from > previous attched data. It's also worth noting that we'd like to *also* like to look into increasing how scalable freeing pages to a given zone is.
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Wed, Mar 22, 2017 at 09:43:04PM +0800, Aaron Lu wrote: > On Wed, Mar 22, 2017 at 05:55:12PM +0900, Minchan Kim wrote: > > On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote: > > > My understanding of the unbound workqueue is that it will create a > > > thread pool for each node, versus each CPU as in the bound workqueue > > > case, and use threads from the thread pool(create threads if not enough) > > > to do the work. > > > > Yes, that was my understand so I read code and found that > > > > insert_work: > > .. > > if (__need_more_worker(pool)) > > wake_up_worker(pool); > > > > so I thought if there is a running thread in that node, workqueue > > will not wake any other threads so parallelism should be max 2. > > AFAIK, if the work goes sleep, scheduler will spawn new worker > > thread so the active worker could be a lot but I cannot see any > > significant sleepable point in that work(ie, batch_free_work). > > Looks like worker_thread() will spawn new worker through manage_worker(). > > Note that pool->nr_running will always be zero for an unbound workqueue > and thus need_more_worker() will return true as long as there are queued > work items in the pool. Aha, it solves my wonder. Thanks a lot!
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Wed, Mar 22, 2017 at 09:43:04PM +0800, Aaron Lu wrote: > On Wed, Mar 22, 2017 at 05:55:12PM +0900, Minchan Kim wrote: > > On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote: > > > My understanding of the unbound workqueue is that it will create a > > > thread pool for each node, versus each CPU as in the bound workqueue > > > case, and use threads from the thread pool(create threads if not enough) > > > to do the work. > > > > Yes, that was my understand so I read code and found that > > > > insert_work: > > .. > > if (__need_more_worker(pool)) > > wake_up_worker(pool); > > > > so I thought if there is a running thread in that node, workqueue > > will not wake any other threads so parallelism should be max 2. > > AFAIK, if the work goes sleep, scheduler will spawn new worker > > thread so the active worker could be a lot but I cannot see any > > significant sleepable point in that work(ie, batch_free_work). > > Looks like worker_thread() will spawn new worker through manage_worker(). > > Note that pool->nr_running will always be zero for an unbound workqueue > and thus need_more_worker() will return true as long as there are queued > work items in the pool. Aha, it solves my wonder. Thanks a lot!
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Wed, Mar 22, 2017 at 05:55:12PM +0900, Minchan Kim wrote: > On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote: > > My understanding of the unbound workqueue is that it will create a > > thread pool for each node, versus each CPU as in the bound workqueue > > case, and use threads from the thread pool(create threads if not enough) > > to do the work. > > Yes, that was my understand so I read code and found that > > insert_work: > .. > if (__need_more_worker(pool)) > wake_up_worker(pool); > > so I thought if there is a running thread in that node, workqueue > will not wake any other threads so parallelism should be max 2. > AFAIK, if the work goes sleep, scheduler will spawn new worker > thread so the active worker could be a lot but I cannot see any > significant sleepable point in that work(ie, batch_free_work). Looks like worker_thread() will spawn new worker through manage_worker(). Note that pool->nr_running will always be zero for an unbound workqueue and thus need_more_worker() will return true as long as there are queued work items in the pool. Thanks, Aaron
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Wed, Mar 22, 2017 at 05:55:12PM +0900, Minchan Kim wrote: > On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote: > > My understanding of the unbound workqueue is that it will create a > > thread pool for each node, versus each CPU as in the bound workqueue > > case, and use threads from the thread pool(create threads if not enough) > > to do the work. > > Yes, that was my understand so I read code and found that > > insert_work: > .. > if (__need_more_worker(pool)) > wake_up_worker(pool); > > so I thought if there is a running thread in that node, workqueue > will not wake any other threads so parallelism should be max 2. > AFAIK, if the work goes sleep, scheduler will spawn new worker > thread so the active worker could be a lot but I cannot see any > significant sleepable point in that work(ie, batch_free_work). Looks like worker_thread() will spawn new worker through manage_worker(). Note that pool->nr_running will always be zero for an unbound workqueue and thus need_more_worker() will return true as long as there are queued work items in the pool. Thanks, Aaron
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote: > On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: > > Hi, > > > > On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > > > Introduce a workqueue for all the free workers so that user can fine > > > tune how many workers can be active through sysfs interface: max_active. > > > More workers will normally lead to better performance, but too many can > > > cause severe lock contention. > > > > Let me ask a question. > > > > How well can workqueue distribute the jobs in multiple CPU? > > I would say it's good enough for my needs. > After all, it doesn't need many kworkers to achieve the 50% time > decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from > previous attched data. > > > I don't ask about currency but parallelism. > > I guess benefit you are seeing comes from the parallelism and > > for your goal, unbound wq should spawn a thread per cpu and > > doing the work in every each CPU. does it work? > > I don't think a unbound workqueue will spawn a thread per CPU, that > seems too much a cost to have a unbound workqueue. > > My understanding of the unbound workqueue is that it will create a > thread pool for each node, versus each CPU as in the bound workqueue > case, and use threads from the thread pool(create threads if not enough) > to do the work. Yes, that was my understand so I read code and found that insert_work: .. if (__need_more_worker(pool)) wake_up_worker(pool); so I thought if there is a running thread in that node, workqueue will not wake any other threads so parallelism should be max 2. AFAIK, if the work goes sleep, scheduler will spawn new worker thread so the active worker could be a lot but I cannot see any significant sleepable point in that work(ie, batch_free_work). > > I guess you want to ask if the unbound workqueue can spawn enough > threads to do the job? From the output of 'vmstat 1' during the free() > test, I can see some 70+ processes in runnable state when I didn't > set an upper limit for max_active of the workqueue. Hmm, it seems I was wrong. I am really curious how we can have 70+ active workers in that. Could you explain what I am missing? Thanks. > > Thanks, > Aaron > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Wed, Mar 22, 2017 at 04:41:04PM +0800, Aaron Lu wrote: > On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: > > Hi, > > > > On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > > > Introduce a workqueue for all the free workers so that user can fine > > > tune how many workers can be active through sysfs interface: max_active. > > > More workers will normally lead to better performance, but too many can > > > cause severe lock contention. > > > > Let me ask a question. > > > > How well can workqueue distribute the jobs in multiple CPU? > > I would say it's good enough for my needs. > After all, it doesn't need many kworkers to achieve the 50% time > decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from > previous attched data. > > > I don't ask about currency but parallelism. > > I guess benefit you are seeing comes from the parallelism and > > for your goal, unbound wq should spawn a thread per cpu and > > doing the work in every each CPU. does it work? > > I don't think a unbound workqueue will spawn a thread per CPU, that > seems too much a cost to have a unbound workqueue. > > My understanding of the unbound workqueue is that it will create a > thread pool for each node, versus each CPU as in the bound workqueue > case, and use threads from the thread pool(create threads if not enough) > to do the work. Yes, that was my understand so I read code and found that insert_work: .. if (__need_more_worker(pool)) wake_up_worker(pool); so I thought if there is a running thread in that node, workqueue will not wake any other threads so parallelism should be max 2. AFAIK, if the work goes sleep, scheduler will spawn new worker thread so the active worker could be a lot but I cannot see any significant sleepable point in that work(ie, batch_free_work). > > I guess you want to ask if the unbound workqueue can spawn enough > threads to do the job? From the output of 'vmstat 1' during the free() > test, I can see some 70+ processes in runnable state when I didn't > set an upper limit for max_active of the workqueue. Hmm, it seems I was wrong. I am really curious how we can have 70+ active workers in that. Could you explain what I am missing? Thanks. > > Thanks, > Aaron > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: > Hi, > > On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > > Introduce a workqueue for all the free workers so that user can fine > > tune how many workers can be active through sysfs interface: max_active. > > More workers will normally lead to better performance, but too many can > > cause severe lock contention. > > Let me ask a question. > > How well can workqueue distribute the jobs in multiple CPU? I would say it's good enough for my needs. After all, it doesn't need many kworkers to achieve the 50% time decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from previous attched data. > I don't ask about currency but parallelism. > I guess benefit you are seeing comes from the parallelism and > for your goal, unbound wq should spawn a thread per cpu and > doing the work in every each CPU. does it work? I don't think a unbound workqueue will spawn a thread per CPU, that seems too much a cost to have a unbound workqueue. My understanding of the unbound workqueue is that it will create a thread pool for each node, versus each CPU as in the bound workqueue case, and use threads from the thread pool(create threads if not enough) to do the work. I guess you want to ask if the unbound workqueue can spawn enough threads to do the job? From the output of 'vmstat 1' during the free() test, I can see some 70+ processes in runnable state when I didn't set an upper limit for max_active of the workqueue. Thanks, Aaron
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: > Hi, > > On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > > Introduce a workqueue for all the free workers so that user can fine > > tune how many workers can be active through sysfs interface: max_active. > > More workers will normally lead to better performance, but too many can > > cause severe lock contention. > > Let me ask a question. > > How well can workqueue distribute the jobs in multiple CPU? I would say it's good enough for my needs. After all, it doesn't need many kworkers to achieve the 50% time decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from previous attched data. > I don't ask about currency but parallelism. > I guess benefit you are seeing comes from the parallelism and > for your goal, unbound wq should spawn a thread per cpu and > doing the work in every each CPU. does it work? I don't think a unbound workqueue will spawn a thread per CPU, that seems too much a cost to have a unbound workqueue. My understanding of the unbound workqueue is that it will create a thread pool for each node, versus each CPU as in the bound workqueue case, and use threads from the thread pool(create threads if not enough) to do the work. I guess you want to ask if the unbound workqueue can spawn enough threads to do the job? From the output of 'vmstat 1' during the free() test, I can see some 70+ processes in runnable state when I didn't set an upper limit for max_active of the workqueue. Thanks, Aaron
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
Hi, On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > Introduce a workqueue for all the free workers so that user can fine > tune how many workers can be active through sysfs interface: max_active. > More workers will normally lead to better performance, but too many can > cause severe lock contention. Let me ask a question. How well can workqueue distribute the jobs in multiple CPU? I don't ask about currency but parallelism. I guess benefit you are seeing comes from the parallelism and for your goal, unbound wq should spawn a thread per cpu and doing the work in every each CPU. does it work? > > Note that since the zone lock is global, the workqueue is also global > for all processes, i.e. if we set 8 to max_active, we will have at most > 8 workers active for all processes that are doing munmap()/exit()/etc. > > Signed-off-by: Aaron Lu> --- > mm/memory.c | 15 ++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 001c7720d773..19b25bb5f45b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -253,6 +253,19 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > __tlb_reset_range(tlb); > } > > +static struct workqueue_struct *batch_free_wq; > +static int __init batch_free_wq_init(void) > +{ > + batch_free_wq = alloc_workqueue("batch_free_wq", > + WQ_UNBOUND | WQ_SYSFS, 0); > + if (!batch_free_wq) { > + pr_warn("failed to create workqueue batch_free_wq\n"); > + return -ENOMEM; > + } > + return 0; > +} > +subsys_initcall(batch_free_wq_init); > + > static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start, > bool free_batch_page) > { > @@ -306,7 +319,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb) > batch_free->batch_start = tlb->local.next; > INIT_WORK(_free->work, batch_free_work); > list_add_tail(_free->list, >worker_list); > - queue_work(system_unbound_wq, _free->work); > + queue_work(batch_free_wq, _free->work); > > tlb->batch_count = 0; > tlb->local.next = NULL; > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org
Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers
Hi, On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > Introduce a workqueue for all the free workers so that user can fine > tune how many workers can be active through sysfs interface: max_active. > More workers will normally lead to better performance, but too many can > cause severe lock contention. Let me ask a question. How well can workqueue distribute the jobs in multiple CPU? I don't ask about currency but parallelism. I guess benefit you are seeing comes from the parallelism and for your goal, unbound wq should spawn a thread per cpu and doing the work in every each CPU. does it work? > > Note that since the zone lock is global, the workqueue is also global > for all processes, i.e. if we set 8 to max_active, we will have at most > 8 workers active for all processes that are doing munmap()/exit()/etc. > > Signed-off-by: Aaron Lu > --- > mm/memory.c | 15 ++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 001c7720d773..19b25bb5f45b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -253,6 +253,19 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > __tlb_reset_range(tlb); > } > > +static struct workqueue_struct *batch_free_wq; > +static int __init batch_free_wq_init(void) > +{ > + batch_free_wq = alloc_workqueue("batch_free_wq", > + WQ_UNBOUND | WQ_SYSFS, 0); > + if (!batch_free_wq) { > + pr_warn("failed to create workqueue batch_free_wq\n"); > + return -ENOMEM; > + } > + return 0; > +} > +subsys_initcall(batch_free_wq_init); > + > static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start, > bool free_batch_page) > { > @@ -306,7 +319,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb) > batch_free->batch_start = tlb->local.next; > INIT_WORK(_free->work, batch_free_work); > list_add_tail(_free->list, >worker_list); > - queue_work(system_unbound_wq, _free->work); > + queue_work(batch_free_wq, _free->work); > > tlb->batch_count = 0; > tlb->local.next = NULL; > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org