Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-12-27 Thread kemi



On 2018/12/28 上午10:55, Waiman Long wrote:
> On 12/27/2018 08:31 PM, Wang, Kemi wrote:
>> Hi, Waiman
>>Did you post that patch? Let's see if it helps.
> 
> I did post the patch a while ago. I will need to rebase it to a new
> baseline. Will do that in a week or 2.
> 

OK.I will take a look at it and try to rebase it on shi's patch to see if 
the regression can be fixed.
May I know where I can get that patch, I didn't find it in my inbox. Thanks

> -Longman
> 
>>
>> -Original Message-
>> From: LKP [mailto:lkp-boun...@lists.01.org] On Behalf Of Waiman Long
>> Sent: Tuesday, November 6, 2018 6:40 AM
>> To: Linus Torvalds ; vba...@suse.cz; 
>> Davidlohr Bueso 
>> Cc: yang@linux.alibaba.com; Linux Kernel Mailing List 
>> ; Matthew Wilcox ; 
>> mho...@kernel.org; Colin King ; Andrew Morton 
>> ; lduf...@linux.vnet.ibm.com; l...@01.org; 
>> kirill.shute...@linux.intel.com
>> Subject: Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% 
>> regression
>>
>> On 11/05/2018 05:14 PM, Linus Torvalds wrote:
>>> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>>>> I didn't spot an obvious mistake in the patch itself, so it looks
>>>> like some bad interaction between scheduler and the mmap downgrade?
>>> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
>>> the downgrade.
>>>
>>> It looks like the benchmark used to be basically CPU-bound, at about
>>> 800% CPU, and now it's somewhere in the 200% CPU region:
>>>
>>>   will-it-scale.time.percent_of_cpu_this_job_got
>>>
>>>   800 
>>> +-+---+
>>>   |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. 
>>> .+.+.+.|
>>>   700 +-+ +.+   
>>> |
>>>   | 
>>> |
>>>   600 +-+   
>>> |
>>>   | 
>>> |
>>>   500 +-+   
>>> |
>>>   | 
>>> |
>>>   400 +-+   
>>> |
>>>   | 
>>> |
>>>   300 +-+   
>>> |
>>>   | 
>>> |
>>>   200 O-O O O O OO  
>>> |
>>>   |   O O O  O O O O   O O O O O O O O O O O
>>> |
>>>   100 
>>> +-+---+
>>>
>>> which sounds like the downgrade really messes with the "spin waiting
>>> for lock" logic.
>>>
>>> I'm thinking it's the "wake up waiter" logic that has some bad
>>> interaction with spinning, and breaks that whole optimization.
>>>
>>> Adding Waiman and Davidlohr to the participants, because they seem to
>>> be the obvious experts in this area.
>>>
>>> Linus
>> Optimistic spinning on rwsem is done only on writers spinning on a
>> writer-owned rwsem. If a write-lock is downgraded to a read-lock, all
>> the spinning waiters will quit. That may explain the drop in cpu
>> utilization. I do have a old patch that enable a certain amount of
>> reader spinning which may help the situation. I can rebase that and send
>> it out for review if people have interest.
>>
>> Cheers,
>> Longman
>>
>>
>> ___
>> LKP mailing list
>> l...@lists.01.org
>> https://lists.01.org/mailman/listinfo/lkp
> 
> 


RE: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-12-27 Thread Wang, Kemi
Hi, Waiman
   Did you post that patch? Let's see if it helps.

-Original Message-
From: LKP [mailto:lkp-boun...@lists.01.org] On Behalf Of Waiman Long
Sent: Tuesday, November 6, 2018 6:40 AM
To: Linus Torvalds ; vba...@suse.cz; Davidlohr 
Bueso 
Cc: yang@linux.alibaba.com; Linux Kernel Mailing List 
; Matthew Wilcox ; 
mho...@kernel.org; Colin King ; Andrew Morton 
; lduf...@linux.vnet.ibm.com; l...@01.org; 
kirill.shute...@linux.intel.com
Subject: Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% 
regression

On 11/05/2018 05:14 PM, Linus Torvalds wrote:
> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>> I didn't spot an obvious mistake in the patch itself, so it looks
>> like some bad interaction between scheduler and the mmap downgrade?
> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
> the downgrade.
>
> It looks like the benchmark used to be basically CPU-bound, at about
> 800% CPU, and now it's somewhere in the 200% CPU region:
>
>   will-it-scale.time.percent_of_cpu_this_job_got
>
>   800 +-+---+
>   |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.|
>   700 +-+ +.+   |
>   | |
>   600 +-+   |
>   | |
>   500 +-+   |
>   | |
>   400 +-+   |
>   | |
>   300 +-+   |
>   | |
>   200 O-O O O O OO  |
>   |   O O O  O O O O   O O O O O O O O O O O|
>   100 +-+---+
>
> which sounds like the downgrade really messes with the "spin waiting
> for lock" logic.
>
> I'm thinking it's the "wake up waiter" logic that has some bad
> interaction with spinning, and breaks that whole optimization.
>
> Adding Waiman and Davidlohr to the participants, because they seem to
> be the obvious experts in this area.
>
> Linus

Optimistic spinning on rwsem is done only on writers spinning on a
writer-owned rwsem. If a write-lock is downgraded to a read-lock, all
the spinning waiters will quit. That may explain the drop in cpu
utilization. I do have a old patch that enable a certain amount of
reader spinning which may help the situation. I can rebase that and send
it out for review if people have interest.

Cheers,
Longman


___
LKP mailing list
l...@lists.01.org
https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% regression

2018-12-18 Thread kemi
Hi, All
  Do we have special reason to keep this patch (316ba5736c9:brd: Mark as 
non-rotational).
which leads to a performance regression when BRD is used as a disk on btrfs.

On 2018/7/10 下午1:27, kemi wrote:
> Hi, SeongJae
>   Do you have any input for this regression? thanks
> 
> On 2018年06月04日 13:52, kernel test robot wrote:
>>
>> Greeting,
>>
>> FYI, we noticed a -11.2% regression of aim7.jobs-per-min due to commit:
>>
>>
>> commit: 316ba5736c9caa5dbcd84085989862d2df57431d ("brd: Mark as 
>> non-rotational")
>> https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git 
>> for-4.18/block
>>
>> in testcase: aim7
>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 
>> 384G memory
>> with following parameters:
>>
>>  disk: 1BRD_48G
>>  fs: btrfs
>>  test: disk_rw
>>  load: 1500
>>  cpufreq_governor: performance
>>
>> test-description: AIM7 is a traditional UNIX system level benchmark suite 
>> which is used to test and measure the performance of multiuser system.
>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>
>>
>>
>> Details are as below:
>> -->
>>
>> =
>> compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>>   
>> gcc-7/performance/1BRD_48G/btrfs/x86_64-rhel-7.2/1500/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_rw/aim7
>>
>> commit: 
>>   522a777566 ("block: consolidate struct request timestamp fields")
>>   316ba5736c ("brd: Mark as non-rotational")
>>
>> 522a777566f56696 316ba5736c9caa5dbcd8408598 
>>  -- 
>>  %stddev %change %stddev
>>  \  |\  
>>  28321   -11.2%  25147aim7.jobs-per-min
>> 318.19   +12.6% 358.23aim7.time.elapsed_time
>> 318.19   +12.6% 358.23aim7.time.elapsed_time.max
>>1437526 ±  2% +14.6%1646849 ±  2%  
>> aim7.time.involuntary_context_switches
>>  11986   +14.2%  13691aim7.time.system_time
>>  73.06 ±  2%  -3.6%  70.43aim7.time.user_time
>>2449470 ±  2% -25.0%1837521 ±  4%  
>> aim7.time.voluntary_context_switches
>>  20.25 ± 58%   +1681.5% 360.75 ±109%  numa-meminfo.node1.Mlocked
>> 456062   -16.3% 381859softirqs.SCHED
>>   9015 ±  7% -21.3%   7098 ± 22%  meminfo.CmaFree
>>  47.50 ± 58%   +1355.8% 691.50 ± 92%  meminfo.Mlocked
>>   5.24 ±  3%  -1.23.99 ±  2%  mpstat.cpu.idle%
>>   0.61 ±  2%  -0.10.52 ±  2%  mpstat.cpu.usr%
>>  16627   +12.8%  18762 ±  4%  slabinfo.Acpi-State.active_objs
>>  16627   +12.9%  18775 ±  4%  slabinfo.Acpi-State.num_objs
>>  57.00 ±  2% +17.5%  67.00vmstat.procs.r
>>  20936   -24.8%  15752 ±  2%  vmstat.system.cs
>>  45474-1.7%  44681vmstat.system.in
>>   6.50 ± 59%   +1157.7%  81.75 ± 75%  numa-vmstat.node0.nr_mlock
>> 242870 ±  3% +13.2% 274913 ±  7%  numa-vmstat.node0.nr_written
>>   2278 ±  7% -22.6%   1763 ± 21%  numa-vmstat.node1.nr_free_cma
>>   4.75 ± 58%   +1789.5%  89.75 ±109%  numa-vmstat.node1.nr_mlock
>>   88018135 ±  3% -48.9%   44980457 ±  7%  cpuidle.C1.time
>>1398288 ±  3% -51.1% 683493 ±  9%  cpuidle.C1.usage
>>3499814 ±  2% -38.5%2153158 ±  5%  cpuidle.C1E.time
>>  52722 ±  4% -45.6%  28692 ±  6%  cpuidle.C1E.usage
>>9865857 ±  3% -40.1%5905155 ±  5%  cpuidle.C3.time
>>  69656 ±  2% -42.6%  39990 ±  5%  cpuidle.C3.usage
>> 590856 ±  2% -12.3% 517910cpuidle.C6.usage
>>  46160 ±  7% -53.7%  21372 ± 11%  cpuidle.POLL.time
>>   1716 ±  7% -46.6% 916.25 ± 14%  cpuidle.POLL.usage
>> 197656+4.1% 205732proc-vmstat.nr_active_file
>> 191867+4.1% 199647proc-vmstat.nr_dirty
>> 509282+1.6% 517318proc-vmstat.nr_file_pages
>>   2282 ±  8% -24.4%   1725 ± 22%  proc-vmstat.nr_free_cma
>> 357.50   +10.6% 395.25 ±  2%  proc-vmstat.nr_inactive_file
>>  11.50

RE: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% regression

2018-07-26 Thread Wang, Kemi
Hi, SeongJae
Any update or any info you need from my side?

-Original Message-
From: SeongJae Park [mailto:sj38.p...@gmail.com] 
Sent: Wednesday, July 11, 2018 12:53 AM
To: Wang, Kemi 
Cc: Ye, Xiaolong ; ax...@kernel.dk; ax...@fb.com; 
l...@01.org; linux-kernel@vger.kernel.org
Subject: Re: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% 
regression

Oops, I mistakenly found this mail now.  I will look inside for this though it 
will take some time because I will not be in office for this week.


Thanks,
SeongJae Park
On Tue, Jul 10, 2018 at 1:30 AM kemi  wrote:
>
> Hi, SeongJae
>   Do you have any input for this regression? thanks
>
> On 2018年06月04日 13:52, kernel test robot wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -11.2% regression of aim7.jobs-per-min due to commit:
> >
> >
> > commit: 316ba5736c9caa5dbcd84085989862d2df57431d ("brd: Mark as 
> > non-rotational") 
> > https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git 
> > for-4.18/block
> >
> > in testcase: aim7
> > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 
> > 3.00GHz with 384G memory with following parameters:
> >
> >   disk: 1BRD_48G
> >   fs: btrfs
> >   test: disk_rw
> >   load: 1500
> >   cpufreq_governor: performance
> >
> > test-description: AIM7 is a traditional UNIX system level benchmark suite 
> > which is used to test and measure the performance of multiuser system.
> > test-url: 
> > https://sourceforge.net/projects/aimbench/files/aim-suite7/
> >
> >
> >
> > Details are as below:
> > -->
> >
> > 
> > =
> > compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
> >   
> > gcc-7/performance/1BRD_48G/btrfs/x86_64-rhel-7.2/1500/debian-x86_64-
> > 2016-08-31.cgz/lkp-ivb-ep01/disk_rw/aim7
> >
> > commit:
> >   522a777566 ("block: consolidate struct request timestamp fields")
> >   316ba5736c ("brd: Mark as non-rotational")
> >
> > 522a777566f56696 316ba5736c9caa5dbcd8408598
> >  --
> >  %stddev %change %stddev
> >  \  |\
> >  28321   -11.2%  25147aim7.jobs-per-min
> > 318.19   +12.6% 358.23aim7.time.elapsed_time
> > 318.19   +12.6% 358.23aim7.time.elapsed_time.max
> >1437526 ±  2% +14.6%1646849 ±  2%  
> > aim7.time.involuntary_context_switches
> >  11986   +14.2%  13691aim7.time.system_time
> >  73.06 ±  2%  -3.6%  70.43aim7.time.user_time
> >2449470 ±  2% -25.0%1837521 ±  4%  
> > aim7.time.voluntary_context_switches
> >  20.25 ± 58%   +1681.5% 360.75 ±109%  numa-meminfo.node1.Mlocked
> > 456062   -16.3% 381859softirqs.SCHED
> >   9015 ±  7% -21.3%   7098 ± 22%  meminfo.CmaFree
> >  47.50 ± 58%   +1355.8% 691.50 ± 92%  meminfo.Mlocked
> >   5.24 ±  3%  -1.23.99 ±  2%  mpstat.cpu.idle%
> >   0.61 ±  2%  -0.10.52 ±  2%  mpstat.cpu.usr%
> >  16627   +12.8%  18762 ±  4%  
> > slabinfo.Acpi-State.active_objs
> >  16627   +12.9%  18775 ±  4%  slabinfo.Acpi-State.num_objs
> >  57.00 ±  2% +17.5%  67.00vmstat.procs.r
> >  20936   -24.8%  15752 ±  2%  vmstat.system.cs
> >  45474-1.7%  44681vmstat.system.in
> >   6.50 ± 59%   +1157.7%  81.75 ± 75%  numa-vmstat.node0.nr_mlock
> > 242870 ±  3% +13.2% 274913 ±  7%  numa-vmstat.node0.nr_written
> >   2278 ±  7% -22.6%   1763 ± 21%  numa-vmstat.node1.nr_free_cma
> >   4.75 ± 58%   +1789.5%  89.75 ±109%  numa-vmstat.node1.nr_mlock
> >   88018135 ±  3% -48.9%   44980457 ±  7%  cpuidle.C1.time
> >1398288 ±  3% -51.1% 683493 ±  9%  cpuidle.C1.usage
> >3499814 ±  2% -38.5%2153158 ±  5%  cpuidle.C1E.time
> >  52722 ±  4% -45.6%  28692 ±  6%  cpuidle.C1E.usage
> >9865857 ±  3% -40.1%5905155 ±  5%  cpuidle.C3.time
> >  69656 ±  2% -42.6%  39990 ±  5%  cpuidle.C3.usage
> > 590856 ±  2% -12.3% 517910cpuidle.C6.usage
> >  46160 ±  7% -53.7%  21372 ± 11%  cpuidle.POLL

RE: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% regression

2018-07-26 Thread Wang, Kemi
Hi, SeongJae
Any update or any info you need from my side?

-Original Message-
From: SeongJae Park [mailto:sj38.p...@gmail.com] 
Sent: Wednesday, July 11, 2018 12:53 AM
To: Wang, Kemi 
Cc: Ye, Xiaolong ; ax...@kernel.dk; ax...@fb.com; 
l...@01.org; linux-kernel@vger.kernel.org
Subject: Re: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% 
regression

Oops, I mistakenly found this mail now.  I will look inside for this though it 
will take some time because I will not be in office for this week.


Thanks,
SeongJae Park
On Tue, Jul 10, 2018 at 1:30 AM kemi  wrote:
>
> Hi, SeongJae
>   Do you have any input for this regression? thanks
>
> On 2018年06月04日 13:52, kernel test robot wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -11.2% regression of aim7.jobs-per-min due to commit:
> >
> >
> > commit: 316ba5736c9caa5dbcd84085989862d2df57431d ("brd: Mark as 
> > non-rotational") 
> > https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git 
> > for-4.18/block
> >
> > in testcase: aim7
> > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 
> > 3.00GHz with 384G memory with following parameters:
> >
> >   disk: 1BRD_48G
> >   fs: btrfs
> >   test: disk_rw
> >   load: 1500
> >   cpufreq_governor: performance
> >
> > test-description: AIM7 is a traditional UNIX system level benchmark suite 
> > which is used to test and measure the performance of multiuser system.
> > test-url: 
> > https://sourceforge.net/projects/aimbench/files/aim-suite7/
> >
> >
> >
> > Details are as below:
> > -->
> >
> > 
> > =
> > compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
> >   
> > gcc-7/performance/1BRD_48G/btrfs/x86_64-rhel-7.2/1500/debian-x86_64-
> > 2016-08-31.cgz/lkp-ivb-ep01/disk_rw/aim7
> >
> > commit:
> >   522a777566 ("block: consolidate struct request timestamp fields")
> >   316ba5736c ("brd: Mark as non-rotational")
> >
> > 522a777566f56696 316ba5736c9caa5dbcd8408598
> >  --
> >  %stddev %change %stddev
> >  \  |\
> >  28321   -11.2%  25147aim7.jobs-per-min
> > 318.19   +12.6% 358.23aim7.time.elapsed_time
> > 318.19   +12.6% 358.23aim7.time.elapsed_time.max
> >1437526 ±  2% +14.6%1646849 ±  2%  
> > aim7.time.involuntary_context_switches
> >  11986   +14.2%  13691aim7.time.system_time
> >  73.06 ±  2%  -3.6%  70.43aim7.time.user_time
> >2449470 ±  2% -25.0%1837521 ±  4%  
> > aim7.time.voluntary_context_switches
> >  20.25 ± 58%   +1681.5% 360.75 ±109%  numa-meminfo.node1.Mlocked
> > 456062   -16.3% 381859softirqs.SCHED
> >   9015 ±  7% -21.3%   7098 ± 22%  meminfo.CmaFree
> >  47.50 ± 58%   +1355.8% 691.50 ± 92%  meminfo.Mlocked
> >   5.24 ±  3%  -1.23.99 ±  2%  mpstat.cpu.idle%
> >   0.61 ±  2%  -0.10.52 ±  2%  mpstat.cpu.usr%
> >  16627   +12.8%  18762 ±  4%  
> > slabinfo.Acpi-State.active_objs
> >  16627   +12.9%  18775 ±  4%  slabinfo.Acpi-State.num_objs
> >  57.00 ±  2% +17.5%  67.00vmstat.procs.r
> >  20936   -24.8%  15752 ±  2%  vmstat.system.cs
> >  45474-1.7%  44681vmstat.system.in
> >   6.50 ± 59%   +1157.7%  81.75 ± 75%  numa-vmstat.node0.nr_mlock
> > 242870 ±  3% +13.2% 274913 ±  7%  numa-vmstat.node0.nr_written
> >   2278 ±  7% -22.6%   1763 ± 21%  numa-vmstat.node1.nr_free_cma
> >   4.75 ± 58%   +1789.5%  89.75 ±109%  numa-vmstat.node1.nr_mlock
> >   88018135 ±  3% -48.9%   44980457 ±  7%  cpuidle.C1.time
> >1398288 ±  3% -51.1% 683493 ±  9%  cpuidle.C1.usage
> >3499814 ±  2% -38.5%2153158 ±  5%  cpuidle.C1E.time
> >  52722 ±  4% -45.6%  28692 ±  6%  cpuidle.C1E.usage
> >9865857 ±  3% -40.1%5905155 ±  5%  cpuidle.C3.time
> >  69656 ±  2% -42.6%  39990 ±  5%  cpuidle.C3.usage
> > 590856 ±  2% -12.3% 517910cpuidle.C6.usage
> >  46160 ±  7% -53.7%  21372 ± 11%  cpuidle.POLL

Re: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% regression

2018-07-09 Thread kemi
Hi, SeongJae
  Do you have any input for this regression? thanks

On 2018年06月04日 13:52, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed a -11.2% regression of aim7.jobs-per-min due to commit:
> 
> 
> commit: 316ba5736c9caa5dbcd84085989862d2df57431d ("brd: Mark as 
> non-rotational")
> https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git 
> for-4.18/block
> 
> in testcase: aim7
> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 
> 384G memory
> with following parameters:
> 
>   disk: 1BRD_48G
>   fs: btrfs
>   test: disk_rw
>   load: 1500
>   cpufreq_governor: performance
> 
> test-description: AIM7 is a traditional UNIX system level benchmark suite 
> which is used to test and measure the performance of multiuser system.
> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
> 
> 
> 
> Details are as below:
> -->
> 
> =
> compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>   
> gcc-7/performance/1BRD_48G/btrfs/x86_64-rhel-7.2/1500/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_rw/aim7
> 
> commit: 
>   522a777566 ("block: consolidate struct request timestamp fields")
>   316ba5736c ("brd: Mark as non-rotational")
> 
> 522a777566f56696 316ba5736c9caa5dbcd8408598 
>  -- 
>  %stddev %change %stddev
>  \  |\  
>  28321   -11.2%  25147aim7.jobs-per-min
> 318.19   +12.6% 358.23aim7.time.elapsed_time
> 318.19   +12.6% 358.23aim7.time.elapsed_time.max
>1437526 ±  2% +14.6%1646849 ±  2%  
> aim7.time.involuntary_context_switches
>  11986   +14.2%  13691aim7.time.system_time
>  73.06 ±  2%  -3.6%  70.43aim7.time.user_time
>2449470 ±  2% -25.0%1837521 ±  4%  
> aim7.time.voluntary_context_switches
>  20.25 ± 58%   +1681.5% 360.75 ±109%  numa-meminfo.node1.Mlocked
> 456062   -16.3% 381859softirqs.SCHED
>   9015 ±  7% -21.3%   7098 ± 22%  meminfo.CmaFree
>  47.50 ± 58%   +1355.8% 691.50 ± 92%  meminfo.Mlocked
>   5.24 ±  3%  -1.23.99 ±  2%  mpstat.cpu.idle%
>   0.61 ±  2%  -0.10.52 ±  2%  mpstat.cpu.usr%
>  16627   +12.8%  18762 ±  4%  slabinfo.Acpi-State.active_objs
>  16627   +12.9%  18775 ±  4%  slabinfo.Acpi-State.num_objs
>  57.00 ±  2% +17.5%  67.00vmstat.procs.r
>  20936   -24.8%  15752 ±  2%  vmstat.system.cs
>  45474-1.7%  44681vmstat.system.in
>   6.50 ± 59%   +1157.7%  81.75 ± 75%  numa-vmstat.node0.nr_mlock
> 242870 ±  3% +13.2% 274913 ±  7%  numa-vmstat.node0.nr_written
>   2278 ±  7% -22.6%   1763 ± 21%  numa-vmstat.node1.nr_free_cma
>   4.75 ± 58%   +1789.5%  89.75 ±109%  numa-vmstat.node1.nr_mlock
>   88018135 ±  3% -48.9%   44980457 ±  7%  cpuidle.C1.time
>1398288 ±  3% -51.1% 683493 ±  9%  cpuidle.C1.usage
>3499814 ±  2% -38.5%2153158 ±  5%  cpuidle.C1E.time
>  52722 ±  4% -45.6%  28692 ±  6%  cpuidle.C1E.usage
>9865857 ±  3% -40.1%5905155 ±  5%  cpuidle.C3.time
>  69656 ±  2% -42.6%  39990 ±  5%  cpuidle.C3.usage
> 590856 ±  2% -12.3% 517910cpuidle.C6.usage
>  46160 ±  7% -53.7%  21372 ± 11%  cpuidle.POLL.time
>   1716 ±  7% -46.6% 916.25 ± 14%  cpuidle.POLL.usage
> 197656+4.1% 205732proc-vmstat.nr_active_file
> 191867+4.1% 199647proc-vmstat.nr_dirty
> 509282+1.6% 517318proc-vmstat.nr_file_pages
>   2282 ±  8% -24.4%   1725 ± 22%  proc-vmstat.nr_free_cma
> 357.50   +10.6% 395.25 ±  2%  proc-vmstat.nr_inactive_file
>  11.50 ± 58%   +1397.8% 172.25 ± 93%  proc-vmstat.nr_mlock
> 970355 ±  4% +14.6%549 ±  8%  proc-vmstat.nr_written
> 197984+4.1% 206034proc-vmstat.nr_zone_active_file
> 357.50   +10.6% 395.25 ±  2%  
> proc-vmstat.nr_zone_inactive_file
> 192282+4.1% 200126
> proc-vmstat.nr_zone_write_pending
>7901465 ±  3% -14.0%6795016 ± 16%  proc-vmstat.pgalloc_movable
> 886101   +10.2% 976329proc-vmstat.pgfault
>  2.169e+12   +15.2%  2.497e+12perf-stat.branch-instructions
>   0.41-0.10.35perf-stat.branch-miss-rate%
>  31.19 ±  2%  +1.6   32.82perf-stat.cache-miss-rate%
>  9.116e+09+8.3%  9.869e+09perf-stat.cache-misses
>  

Re: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% regression

2018-07-09 Thread kemi
Hi, SeongJae
  Do you have any input for this regression? thanks

On 2018年06月04日 13:52, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed a -11.2% regression of aim7.jobs-per-min due to commit:
> 
> 
> commit: 316ba5736c9caa5dbcd84085989862d2df57431d ("brd: Mark as 
> non-rotational")
> https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git 
> for-4.18/block
> 
> in testcase: aim7
> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 
> 384G memory
> with following parameters:
> 
>   disk: 1BRD_48G
>   fs: btrfs
>   test: disk_rw
>   load: 1500
>   cpufreq_governor: performance
> 
> test-description: AIM7 is a traditional UNIX system level benchmark suite 
> which is used to test and measure the performance of multiuser system.
> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
> 
> 
> 
> Details are as below:
> -->
> 
> =
> compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>   
> gcc-7/performance/1BRD_48G/btrfs/x86_64-rhel-7.2/1500/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_rw/aim7
> 
> commit: 
>   522a777566 ("block: consolidate struct request timestamp fields")
>   316ba5736c ("brd: Mark as non-rotational")
> 
> 522a777566f56696 316ba5736c9caa5dbcd8408598 
>  -- 
>  %stddev %change %stddev
>  \  |\  
>  28321   -11.2%  25147aim7.jobs-per-min
> 318.19   +12.6% 358.23aim7.time.elapsed_time
> 318.19   +12.6% 358.23aim7.time.elapsed_time.max
>1437526 ±  2% +14.6%1646849 ±  2%  
> aim7.time.involuntary_context_switches
>  11986   +14.2%  13691aim7.time.system_time
>  73.06 ±  2%  -3.6%  70.43aim7.time.user_time
>2449470 ±  2% -25.0%1837521 ±  4%  
> aim7.time.voluntary_context_switches
>  20.25 ± 58%   +1681.5% 360.75 ±109%  numa-meminfo.node1.Mlocked
> 456062   -16.3% 381859softirqs.SCHED
>   9015 ±  7% -21.3%   7098 ± 22%  meminfo.CmaFree
>  47.50 ± 58%   +1355.8% 691.50 ± 92%  meminfo.Mlocked
>   5.24 ±  3%  -1.23.99 ±  2%  mpstat.cpu.idle%
>   0.61 ±  2%  -0.10.52 ±  2%  mpstat.cpu.usr%
>  16627   +12.8%  18762 ±  4%  slabinfo.Acpi-State.active_objs
>  16627   +12.9%  18775 ±  4%  slabinfo.Acpi-State.num_objs
>  57.00 ±  2% +17.5%  67.00vmstat.procs.r
>  20936   -24.8%  15752 ±  2%  vmstat.system.cs
>  45474-1.7%  44681vmstat.system.in
>   6.50 ± 59%   +1157.7%  81.75 ± 75%  numa-vmstat.node0.nr_mlock
> 242870 ±  3% +13.2% 274913 ±  7%  numa-vmstat.node0.nr_written
>   2278 ±  7% -22.6%   1763 ± 21%  numa-vmstat.node1.nr_free_cma
>   4.75 ± 58%   +1789.5%  89.75 ±109%  numa-vmstat.node1.nr_mlock
>   88018135 ±  3% -48.9%   44980457 ±  7%  cpuidle.C1.time
>1398288 ±  3% -51.1% 683493 ±  9%  cpuidle.C1.usage
>3499814 ±  2% -38.5%2153158 ±  5%  cpuidle.C1E.time
>  52722 ±  4% -45.6%  28692 ±  6%  cpuidle.C1E.usage
>9865857 ±  3% -40.1%5905155 ±  5%  cpuidle.C3.time
>  69656 ±  2% -42.6%  39990 ±  5%  cpuidle.C3.usage
> 590856 ±  2% -12.3% 517910cpuidle.C6.usage
>  46160 ±  7% -53.7%  21372 ± 11%  cpuidle.POLL.time
>   1716 ±  7% -46.6% 916.25 ± 14%  cpuidle.POLL.usage
> 197656+4.1% 205732proc-vmstat.nr_active_file
> 191867+4.1% 199647proc-vmstat.nr_dirty
> 509282+1.6% 517318proc-vmstat.nr_file_pages
>   2282 ±  8% -24.4%   1725 ± 22%  proc-vmstat.nr_free_cma
> 357.50   +10.6% 395.25 ±  2%  proc-vmstat.nr_inactive_file
>  11.50 ± 58%   +1397.8% 172.25 ± 93%  proc-vmstat.nr_mlock
> 970355 ±  4% +14.6%549 ±  8%  proc-vmstat.nr_written
> 197984+4.1% 206034proc-vmstat.nr_zone_active_file
> 357.50   +10.6% 395.25 ±  2%  
> proc-vmstat.nr_zone_inactive_file
> 192282+4.1% 200126
> proc-vmstat.nr_zone_write_pending
>7901465 ±  3% -14.0%6795016 ± 16%  proc-vmstat.pgalloc_movable
> 886101   +10.2% 976329proc-vmstat.pgfault
>  2.169e+12   +15.2%  2.497e+12perf-stat.branch-instructions
>   0.41-0.10.35perf-stat.branch-miss-rate%
>  31.19 ±  2%  +1.6   32.82perf-stat.cache-miss-rate%
>  9.116e+09+8.3%  9.869e+09perf-stat.cache-misses
>  

RE: [PATCH v11 00/26] Speculative page faults

2018-05-28 Thread Wang, Kemi
Full run would take one or two weeks depended on our resource available. Could 
you pick some ones up, e.g. those have performance regression?

-Original Message-
From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of 
Laurent Dufour
Sent: Monday, May 28, 2018 4:55 PM
To: Song, HaiyanX <haiyanx.s...@intel.com>
Cc: a...@linux-foundation.org; mho...@kernel.org; pet...@infradead.org; 
kir...@shutemov.name; a...@linux.intel.com; d...@stgolabs.net; j...@suse.cz; 
Matthew Wilcox <wi...@infradead.org>; khand...@linux.vnet.ibm.com; 
aneesh.ku...@linux.vnet.ibm.com; b...@kernel.crashing.org; m...@ellerman.id.au; 
pau...@samba.org; Thomas Gleixner <t...@linutronix.de>; Ingo Molnar 
<mi...@redhat.com>; h...@zytor.com; Will Deacon <will.dea...@arm.com>; Sergey 
Senozhatsky <sergey.senozhat...@gmail.com>; sergey.senozhatsky.w...@gmail.com; 
Andrea Arcangeli <aarca...@redhat.com>; Alexei Starovoitov 
<alexei.starovoi...@gmail.com>; Wang, Kemi <kemi.w...@intel.com>; Daniel Jordan 
<daniel.m.jor...@oracle.com>; David Rientjes <rient...@google.com>; Jerome 
Glisse <jgli...@redhat.com>; Ganesh Mahendran <opensource.gan...@gmail.com>; 
Minchan Kim <minc...@kernel.org>; Punit Agrawal <punitagra...@gmail.com>; 
vinayak menon <vinayakm.l...@gmail.com>; Yang Shi <yang@linux.alibaba.com>; 
linux-kernel@vger.kernel.org; linux...@kvack.org; ha...@linux.vnet.ibm.com; 
npig...@gmail.com; bsinghar...@gmail.com; paul...@linux.vnet.ibm.com; Tim Chen 
<tim.c.c...@linux.intel.com>; linuxppc-...@lists.ozlabs.org; x...@kernel.org
Subject: Re: [PATCH v11 00/26] Speculative page faults

On 28/05/2018 10:22, Haiyan Song wrote:
> Hi Laurent,
> 
> Yes, these tests are done on V9 patch.

Do you plan to give this V11 a run ?

> 
> 
> Best regards,
> Haiyan Song
> 
> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>
>>> Some regression and improvements is found by LKP-tools(linux kernel 
>>> performance) on V9 patch series tested on Intel 4s Skylake platform.
>>
>> Hi,
>>
>> Thanks for reporting this benchmark results, but you mentioned the 
>> "V9 patch series" while responding to the v11 header series...
>> Were these tests done on v9 or v11 ?
>>
>> Cheers,
>> Laurent.
>>
>>>
>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 
>>> patch series) Commit id:
>>> base commit: d55f34411b1b126429a823d06c3124c16283231f
>>> head commit: 0355322b3577eeab7669066df42c550a56801110
>>> Benchmark suite: will-it-scale
>>> Download link:
>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>> Metrics:
>>> will-it-scale.per_process_ops=processes/nr_cpu
>>> will-it-scale.per_thread_ops=threads/nr_cpu
>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>> THP: enable / disable
>>> nr_task: 100%
>>>
>>> 1. Regressions:
>>> a) THP enabled:
>>> testcasebasechange  head   
>>> metric
>>> page_fault3/ enable THP 10092   -17.5%  8323   
>>> will-it-scale.per_thread_ops
>>> page_fault2/ enable THP  8300   -17.2%  6869   
>>> will-it-scale.per_thread_ops
>>> brk1/ enable THP  957.67 -7.6%   885   
>>> will-it-scale.per_thread_ops
>>> page_fault3/ enable THP172821-5.3%163692   
>>> will-it-scale.per_process_ops
>>> signal1/ enable THP  9125-3.2%  8834   
>>> will-it-scale.per_process_ops
>>>
>>> b) THP disabled:
>>> testcasebasechange  head   
>>> metric
>>> page_fault3/ disable THP10107   -19.1%  8180   
>>> will-it-scale.per_thread_ops
>>> page_fault2/ disable THP 8432   -17.8%  6931   
>>> will-it-scale.per_thread_ops
>>> context_switch1/ disable THP   215389-6.8%200776   
>>> will-it-scale.per_thread_ops
>>> brk1/ disable THP 939.67 -6.6%   877.33
>>> will-it-scale.per_thread_ops
>>> page_fault3/ disable THP   173145-4.7%165064   
>>> will-it-scale.per_process_ops
>>> signal1/ disable THP 9162  

RE: [PATCH v11 00/26] Speculative page faults

2018-05-28 Thread Wang, Kemi
Full run would take one or two weeks depended on our resource available. Could 
you pick some ones up, e.g. those have performance regression?

-Original Message-
From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of 
Laurent Dufour
Sent: Monday, May 28, 2018 4:55 PM
To: Song, HaiyanX 
Cc: a...@linux-foundation.org; mho...@kernel.org; pet...@infradead.org; 
kir...@shutemov.name; a...@linux.intel.com; d...@stgolabs.net; j...@suse.cz; 
Matthew Wilcox ; khand...@linux.vnet.ibm.com; 
aneesh.ku...@linux.vnet.ibm.com; b...@kernel.crashing.org; m...@ellerman.id.au; 
pau...@samba.org; Thomas Gleixner ; Ingo Molnar 
; h...@zytor.com; Will Deacon ; Sergey 
Senozhatsky ; sergey.senozhatsky.w...@gmail.com; 
Andrea Arcangeli ; Alexei Starovoitov 
; Wang, Kemi ; Daniel Jordan 
; David Rientjes ; Jerome 
Glisse ; Ganesh Mahendran ; 
Minchan Kim ; Punit Agrawal ; 
vinayak menon ; Yang Shi ; 
linux-kernel@vger.kernel.org; linux...@kvack.org; ha...@linux.vnet.ibm.com; 
npig...@gmail.com; bsinghar...@gmail.com; paul...@linux.vnet.ibm.com; Tim Chen 
; linuxppc-...@lists.ozlabs.org; x...@kernel.org
Subject: Re: [PATCH v11 00/26] Speculative page faults

On 28/05/2018 10:22, Haiyan Song wrote:
> Hi Laurent,
> 
> Yes, these tests are done on V9 patch.

Do you plan to give this V11 a run ?

> 
> 
> Best regards,
> Haiyan Song
> 
> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>
>>> Some regression and improvements is found by LKP-tools(linux kernel 
>>> performance) on V9 patch series tested on Intel 4s Skylake platform.
>>
>> Hi,
>>
>> Thanks for reporting this benchmark results, but you mentioned the 
>> "V9 patch series" while responding to the v11 header series...
>> Were these tests done on v9 or v11 ?
>>
>> Cheers,
>> Laurent.
>>
>>>
>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 
>>> patch series) Commit id:
>>> base commit: d55f34411b1b126429a823d06c3124c16283231f
>>> head commit: 0355322b3577eeab7669066df42c550a56801110
>>> Benchmark suite: will-it-scale
>>> Download link:
>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>> Metrics:
>>> will-it-scale.per_process_ops=processes/nr_cpu
>>> will-it-scale.per_thread_ops=threads/nr_cpu
>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>> THP: enable / disable
>>> nr_task: 100%
>>>
>>> 1. Regressions:
>>> a) THP enabled:
>>> testcasebasechange  head   
>>> metric
>>> page_fault3/ enable THP 10092   -17.5%  8323   
>>> will-it-scale.per_thread_ops
>>> page_fault2/ enable THP  8300   -17.2%  6869   
>>> will-it-scale.per_thread_ops
>>> brk1/ enable THP  957.67 -7.6%   885   
>>> will-it-scale.per_thread_ops
>>> page_fault3/ enable THP172821-5.3%163692   
>>> will-it-scale.per_process_ops
>>> signal1/ enable THP  9125-3.2%  8834   
>>> will-it-scale.per_process_ops
>>>
>>> b) THP disabled:
>>> testcasebasechange  head   
>>> metric
>>> page_fault3/ disable THP10107   -19.1%  8180   
>>> will-it-scale.per_thread_ops
>>> page_fault2/ disable THP 8432   -17.8%  6931   
>>> will-it-scale.per_thread_ops
>>> context_switch1/ disable THP   215389-6.8%200776   
>>> will-it-scale.per_thread_ops
>>> brk1/ disable THP 939.67 -6.6%   877.33
>>> will-it-scale.per_thread_ops
>>> page_fault3/ disable THP   173145-4.7%165064   
>>> will-it-scale.per_process_ops
>>> signal1/ disable THP 9162-3.9%  8802   
>>> will-it-scale.per_process_ops
>>>
>>> 2. Improvements:
>>> a) THP enabled:
>>> testcasebasechange  head   
>>> metric
>>> malloc1/ enable THP   66.33+469.8%   383.67
>>> will-it-scale.per_thread_ops
>>> writeseek3/ enable THP  2531 +4.5%  2646   
>>> will-it-scale.per_thread_ops
>>> si

Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

2018-03-15 Thread kemi
Hi, Jeff
   Today, I deleted the previous kernel images for commit 
3da90b159b146672f830bcd2489dd3a1f4e9e089
and commit c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e, respectively. And, re-run 
the same aim7 
jobs for three times for each commit. The aim7 score between two commit does 
not have obvious difference.

Perhaps something weird happen when compiling kernel. Please ignore this 
report, apologize for the bother.


On 2018年02月25日 23:41, Jeff Layton wrote:
> On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>>
>>
>> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make 
>> inode_cmp_iversion{+raw} return bool instead of s64")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: aim7
>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 
>> 384G memory
>> with following parameters:
>>
>>  disk: 4BRD_12G
>>  md: RAID0
>>  fs: xfs
>>  test: disk_src
>>  load: 3000
>>  cpufreq_governor: performance
>>
>> test-description: AIM7 is a traditional UNIX system level benchmark suite 
>> which is used to test and measure the performance of multiuser system.
>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>
>>
> 
> I'm a bit suspicious of this result.
> 
> This patch only changes inode_cmp_iversion{+raw} (since renamed to
> inode_eq_iversion{+raw}), and that neither should ever be called from
> xfs. The patch is fairly trivial too, and I wouldn't expect a big
> performance hit.
> 
> Is IMA involved here at all? I didn't see any evidence of it, but the
> kernel config did have it enabled.
> 
> 
>>
>> Details are as below:
>> -->
>>
>>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp install job.yaml  # job file is attached in this email
>> bin/lkp run job.yaml
>>
>> =
>> compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
>>   
>> gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
>>
>> commit: 
>>   3da90b159b (" f2fs-for-4.16-rc1")
>>   c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead 
>> of s64")
>>
>> 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8 
>>  -- 
>>  %stddev %change %stddev
>>  \  |\  
>>  40183   -18.0%  32964aim7.jobs-per-min
>> 448.60   +21.9% 546.68aim7.time.elapsed_time
>> 448.60   +21.9% 546.68aim7.time.elapsed_time.max
>>   5615 ±  5% +33.4%   7489 ±  4%  
>> aim7.time.involuntary_context_switches
>>   3086   +14.0%   3518aim7.time.system_time
>>   19439782-5.6%   18359474
>> aim7.time.voluntary_context_switches
>> 199333   +14.3% 227794 ±  2%  
>> interrupts.CAL:Function_call_interrupts
>>   0.59-0.10.50mpstat.cpu.usr%
>>2839401   +16.0%3293688softirqs.SCHED
>>7600068   +15.1%8747820softirqs.TIMER
>> 118.00 ± 43% +98.7% 234.50 ± 15%  vmstat.io.bo
>>  87840   -22.4%  68154vmstat.system.cs
>> 552798 ±  6% +15.8% 640107 ±  4%  numa-numastat.node0.local_node
>> 557345 ±  6% +15.7% 644666 ±  4%  numa-numastat.node0.numa_hit
>> 528341 ±  7% +21.7% 642933 ±  4%  numa-numastat.node1.local_node
>> 531604 ±  7% +21.6% 646209 ±  4%  numa-numastat.node1.numa_hit
>>  2.147e+09   -12.4%   1.88e+09cpuidle.C1.time
>>   13702041   -14.7%   11683737cpuidle.C1.usage
>>  2.082e+08 ±  4% +28.1%  2.667e+08 ±  5%  cpuidle.C1E.time
>>  4.719e+08 ±  2% +23.1%  5.807e+08 ±  4%  cpuidle.C3.time
>>  1.141e+10   +31.0%  1.496e+10cpuidle.C6.time
>>   15672622   +27.8%   20031028cpuidle.C6.usage
>>   13520572 ±  3% +29.5%   17514398 ±  9%  cpuidle.POLL.time
>> 278.25 ±  5% -46.0% 150.25 ± 73%  numa-vmstat.node0.nr_dirtied
>>   3200 ± 14% -20.6%   2542 ± 19%  numa-vmstat.node0.nr_mapped
>> 277.75 ±  5% -46.2% 149.50 ± 73%  numa-vmstat.node0.nr_written
>>  28.50 ± 52%+448.2% 156.25 ± 70%  numa-vmstat.node1.nr_dirtied
>>   2577 ± 19% +26.3%   3255 ± 15%  numa-vmstat.node1.nr_mapped
>> 634338 ±  4%  +7.8% 683959 ±  4%  numa-vmstat.node1.numa_hit
>> 457411 ±  6% +10.8% 506800 ±  5%  numa-vmstat.node1.numa_local
>>   3734 ±  8% -11.5%   3306 ±  6%  
>> 

Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

2018-03-15 Thread kemi
Hi, Jeff
   Today, I deleted the previous kernel images for commit 
3da90b159b146672f830bcd2489dd3a1f4e9e089
and commit c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e, respectively. And, re-run 
the same aim7 
jobs for three times for each commit. The aim7 score between two commit does 
not have obvious difference.

Perhaps something weird happen when compiling kernel. Please ignore this 
report, apologize for the bother.


On 2018年02月25日 23:41, Jeff Layton wrote:
> On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>>
>>
>> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make 
>> inode_cmp_iversion{+raw} return bool instead of s64")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: aim7
>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 
>> 384G memory
>> with following parameters:
>>
>>  disk: 4BRD_12G
>>  md: RAID0
>>  fs: xfs
>>  test: disk_src
>>  load: 3000
>>  cpufreq_governor: performance
>>
>> test-description: AIM7 is a traditional UNIX system level benchmark suite 
>> which is used to test and measure the performance of multiuser system.
>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>
>>
> 
> I'm a bit suspicious of this result.
> 
> This patch only changes inode_cmp_iversion{+raw} (since renamed to
> inode_eq_iversion{+raw}), and that neither should ever be called from
> xfs. The patch is fairly trivial too, and I wouldn't expect a big
> performance hit.
> 
> Is IMA involved here at all? I didn't see any evidence of it, but the
> kernel config did have it enabled.
> 
> 
>>
>> Details are as below:
>> -->
>>
>>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp install job.yaml  # job file is attached in this email
>> bin/lkp run job.yaml
>>
>> =
>> compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
>>   
>> gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
>>
>> commit: 
>>   3da90b159b (" f2fs-for-4.16-rc1")
>>   c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead 
>> of s64")
>>
>> 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8 
>>  -- 
>>  %stddev %change %stddev
>>  \  |\  
>>  40183   -18.0%  32964aim7.jobs-per-min
>> 448.60   +21.9% 546.68aim7.time.elapsed_time
>> 448.60   +21.9% 546.68aim7.time.elapsed_time.max
>>   5615 ±  5% +33.4%   7489 ±  4%  
>> aim7.time.involuntary_context_switches
>>   3086   +14.0%   3518aim7.time.system_time
>>   19439782-5.6%   18359474
>> aim7.time.voluntary_context_switches
>> 199333   +14.3% 227794 ±  2%  
>> interrupts.CAL:Function_call_interrupts
>>   0.59-0.10.50mpstat.cpu.usr%
>>2839401   +16.0%3293688softirqs.SCHED
>>7600068   +15.1%8747820softirqs.TIMER
>> 118.00 ± 43% +98.7% 234.50 ± 15%  vmstat.io.bo
>>  87840   -22.4%  68154vmstat.system.cs
>> 552798 ±  6% +15.8% 640107 ±  4%  numa-numastat.node0.local_node
>> 557345 ±  6% +15.7% 644666 ±  4%  numa-numastat.node0.numa_hit
>> 528341 ±  7% +21.7% 642933 ±  4%  numa-numastat.node1.local_node
>> 531604 ±  7% +21.6% 646209 ±  4%  numa-numastat.node1.numa_hit
>>  2.147e+09   -12.4%   1.88e+09cpuidle.C1.time
>>   13702041   -14.7%   11683737cpuidle.C1.usage
>>  2.082e+08 ±  4% +28.1%  2.667e+08 ±  5%  cpuidle.C1E.time
>>  4.719e+08 ±  2% +23.1%  5.807e+08 ±  4%  cpuidle.C3.time
>>  1.141e+10   +31.0%  1.496e+10cpuidle.C6.time
>>   15672622   +27.8%   20031028cpuidle.C6.usage
>>   13520572 ±  3% +29.5%   17514398 ±  9%  cpuidle.POLL.time
>> 278.25 ±  5% -46.0% 150.25 ± 73%  numa-vmstat.node0.nr_dirtied
>>   3200 ± 14% -20.6%   2542 ± 19%  numa-vmstat.node0.nr_mapped
>> 277.75 ±  5% -46.2% 149.50 ± 73%  numa-vmstat.node0.nr_written
>>  28.50 ± 52%+448.2% 156.25 ± 70%  numa-vmstat.node1.nr_dirtied
>>   2577 ± 19% +26.3%   3255 ± 15%  numa-vmstat.node1.nr_mapped
>> 634338 ±  4%  +7.8% 683959 ±  4%  numa-vmstat.node1.numa_hit
>> 457411 ±  6% +10.8% 506800 ±  5%  numa-vmstat.node1.numa_local
>>   3734 ±  8% -11.5%   3306 ±  6%  
>> 

Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

2018-03-01 Thread kemi


On 2018年02月28日 01:04, Linus Torvalds wrote:
> On Tue, Feb 27, 2018 at 5:43 AM, David Howells  wrote:
>> Is it possible there's a stall between the load of RCX and the subsequent
>> instructions because they all have to wait for RCX to become available?
> 
> No. Modern Intel big-core CPU's simply aren't that fragile. All these
> instructions should do OoO fine for trivial sequences like this, and
> as far as I can tell, the new code sequence should be better.
> 
> And even if it were worse for some odd reason, it would be worse by a cycle.
> 
> This kind of 18% change is something else, it is definitely not about
> instruction scheduling.
> 
> Now, if the change to inode_cmp_iversion() causes some actual
> _behavioral_ changes, and we get more IO, that's more like it. But the
> code really does seem to be equivalent. In both cases it is simply
> comparing 63 bits: the high 63 bits of 0x150(%rbp) - inode->i_version
> - with the low 63 bits of 0x20(%rax) - iint->version.
> 
> The only issue would be if the high bit of 0x20(%rax) was somehow set.
> The new code doesn't shift that bit away an more, but it should never
> be set since it comes from
> 
> i_version = inode_query_iversion(inode);
> ...
> iint->version = i_version;
> 
> and that inode_query_iversion() will have done the version shift.
> 
>> The interleaving between operating on RSI and RCX in the older code might
>> alleviate that.
>>
>> In addition, the load if the 20(%rax) value is now done in the CMP 
>> instruction
>> rather than earlier, so it might not get speculatively loaded in time, 
>> whereas
>> the earlier code explicitly loads it up front.
> 
> No again, OoO cores will generally hide details like that.
> 
> You can see effects of it, but it's hard, and it can go both ways.
> 
> Anyway, I think the _real_ change has nothing to with instruction
> scheduling, and everything to do with this:
> 
> 107.62 ± 37%+139.1% 257.38 ± 16%  vmstat.io.bo
>  48740 ± 36%+191.4% 142047 ± 16%  proc-vmstat.pgpgout
> 
> (There's fairly big variation in those numbers, but the changes are
> even bigger) or this:
> 
> 258.12  -100.0%   0.00turbostat.Avg_MHz
>  21.48   -21.50.00turbostat.Busy%
> 

This is caused by a limitation in current turbostat parse script of lkp. It 
treats a string including wildcard character (e.g. 30.**) in the output of 
turbostat
monitor as an error and set all the stats value as 0.

Turbostat monitor runs successfully during these tests.

> or this:
> 
>  27397 ±194%  +43598.3%   11972338 ±139%
> latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
>  27942 ±189%  +96489.5%   26989044 ±139%
> latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 
> but those all sound like something changed in the setup, not in the kernel.
> 
> Odd.
> 
> Linus
> 


Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

2018-03-01 Thread kemi


On 2018年02月28日 01:04, Linus Torvalds wrote:
> On Tue, Feb 27, 2018 at 5:43 AM, David Howells  wrote:
>> Is it possible there's a stall between the load of RCX and the subsequent
>> instructions because they all have to wait for RCX to become available?
> 
> No. Modern Intel big-core CPU's simply aren't that fragile. All these
> instructions should do OoO fine for trivial sequences like this, and
> as far as I can tell, the new code sequence should be better.
> 
> And even if it were worse for some odd reason, it would be worse by a cycle.
> 
> This kind of 18% change is something else, it is definitely not about
> instruction scheduling.
> 
> Now, if the change to inode_cmp_iversion() causes some actual
> _behavioral_ changes, and we get more IO, that's more like it. But the
> code really does seem to be equivalent. In both cases it is simply
> comparing 63 bits: the high 63 bits of 0x150(%rbp) - inode->i_version
> - with the low 63 bits of 0x20(%rax) - iint->version.
> 
> The only issue would be if the high bit of 0x20(%rax) was somehow set.
> The new code doesn't shift that bit away an more, but it should never
> be set since it comes from
> 
> i_version = inode_query_iversion(inode);
> ...
> iint->version = i_version;
> 
> and that inode_query_iversion() will have done the version shift.
> 
>> The interleaving between operating on RSI and RCX in the older code might
>> alleviate that.
>>
>> In addition, the load if the 20(%rax) value is now done in the CMP 
>> instruction
>> rather than earlier, so it might not get speculatively loaded in time, 
>> whereas
>> the earlier code explicitly loads it up front.
> 
> No again, OoO cores will generally hide details like that.
> 
> You can see effects of it, but it's hard, and it can go both ways.
> 
> Anyway, I think the _real_ change has nothing to with instruction
> scheduling, and everything to do with this:
> 
> 107.62 ± 37%+139.1% 257.38 ± 16%  vmstat.io.bo
>  48740 ± 36%+191.4% 142047 ± 16%  proc-vmstat.pgpgout
> 
> (There's fairly big variation in those numbers, but the changes are
> even bigger) or this:
> 
> 258.12  -100.0%   0.00turbostat.Avg_MHz
>  21.48   -21.50.00turbostat.Busy%
> 

This is caused by a limitation in current turbostat parse script of lkp. It 
treats a string including wildcard character (e.g. 30.**) in the output of 
turbostat
monitor as an error and set all the stats value as 0.

Turbostat monitor runs successfully during these tests.

> or this:
> 
>  27397 ±194%  +43598.3%   11972338 ±139%
> latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
>  27942 ±189%  +96489.5%   26989044 ±139%
> latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 
> but those all sound like something changed in the setup, not in the kernel.
> 
> Odd.
> 
> Linus
> 


Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

2018-02-26 Thread kemi


On 2018年02月26日 20:33, Jeff Layton wrote:
> On Mon, 2018-02-26 at 06:43 -0500, Jeff Layton wrote:
>> On Mon, 2018-02-26 at 16:38 +0800, Ye Xiaolong wrote:
>>> On 02/25, Jeff Layton wrote:
 On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>
>
> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make 
> inode_cmp_iversion{+raw} return bool instead of s64")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: aim7
> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz 
> with 384G memory
> with following parameters:
>
>   disk: 4BRD_12G
>   md: RAID0
>   fs: xfs
>   test: disk_src
>   load: 3000
>   cpufreq_governor: performance
>
> test-description: AIM7 is a traditional UNIX system level benchmark suite 
> which is used to test and measure the performance of multiuser system.
> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>
>

 I'm a bit suspicious of this result.

 This patch only changes inode_cmp_iversion{+raw} (since renamed to
 inode_eq_iversion{+raw}), and that neither should ever be called from
 xfs. The patch is fairly trivial too, and I wouldn't expect a big
 performance hit.
>>>
>>> I tried to queue 4 more times test for both commit c0cef30e4f and its 
>>> parent,
>>> the result seems quite stable.
>>>
>>> c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:
>>>  "aim7.jobs-per-min": [
>>> 32964.01,
>>> 32938.68,
>>> 33068.18,
>>> 32886.32,
>>> 32843.72,
>>> 32798.83,
>>> 32898.34,
>>> 32952.55
>>>   ],
>>>
>>> 3da90b159b146672f830bcd2489dd3a1f4e9e089:
>>>   "aim7.jobs-per-min": [
>>> 40239.65,
>>> 40163.33,
>>> 40353.32,
>>> 39976.9,
>>> 40185.75,
>>> 40411.3,
>>> 40213.58,
>>> 39900.69
>>>   ],
>>>
>>> Any other test data you may need?
>>>

 Is IMA involved here at all? I didn't see any evidence of it, but the
 kernel config did have it enabled.

>>>
>>> Sorry, not quite familiar with IMA, could you tell more about how to check 
>>> it?
>>>
>>
>> Thanks for retesting it, but I'm at a loss for why we're seeing this:
>>
>> IMA is the the integrity management subsystem. It will use the iversion
>> field to determine whether to remeasure files during remeasurement.  It
>> looks like the kernel config has it enabled, but it doesn't look like
>> it's in use, based on the info in the initial report.
>>
>> This patch only affects two inlined functions inode_cmp_iversion and
>> inode_cmp_iversion_raw. The patch is pretty trivial (as Linus points
>> out). These functions are only called from IMA and fs-specific code
>> (usually in readdir implementations to detect directory changes).
>>
>> XFS does not call either of these functions however, so I'm a little
>> unclear on how this patch could slow anything down on this test. The
>> only thing I can think to do here would be to profile this and see what
>> stands out.
>>
>> Note that we do need to keep this in perspective too. This 18%
>> regression on this test follows around a ~230% improvement that occurred
>> when we merged the bulk of these patches. It's should still be quite a
>> bit faster than the v4.15 in this regard.
>>
>> Still, it'd be good to understand what's going on here.
>>
>>
> 
> Could we see the dmesg from this boot? It'd be good to confirm that IMA
> is not involved here, as that's the only place that I can see that would
> call into this code at all here.
> 

See attachment for info on dmesg/perf-profile/compare_result.
Feel free to let Xiaolong or me know if anything else you would like to check.

> Thanks,
> Jeff
> 
> 
>>> Thanks,
>>> Xiaolong

>
> Details are as below:
> -->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
>
> =
> compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
>   
> gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
>
> commit: 
>   3da90b159b (" f2fs-for-4.16-rc1")
>   c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool 
> instead of s64")
>
> 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8 
>  -- 
>  %stddev %change %stddev
>  \  |\  
>  40183   -18.0%  32964  

Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

2018-02-26 Thread kemi


On 2018年02月26日 20:33, Jeff Layton wrote:
> On Mon, 2018-02-26 at 06:43 -0500, Jeff Layton wrote:
>> On Mon, 2018-02-26 at 16:38 +0800, Ye Xiaolong wrote:
>>> On 02/25, Jeff Layton wrote:
 On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>
>
> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make 
> inode_cmp_iversion{+raw} return bool instead of s64")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: aim7
> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz 
> with 384G memory
> with following parameters:
>
>   disk: 4BRD_12G
>   md: RAID0
>   fs: xfs
>   test: disk_src
>   load: 3000
>   cpufreq_governor: performance
>
> test-description: AIM7 is a traditional UNIX system level benchmark suite 
> which is used to test and measure the performance of multiuser system.
> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>
>

 I'm a bit suspicious of this result.

 This patch only changes inode_cmp_iversion{+raw} (since renamed to
 inode_eq_iversion{+raw}), and that neither should ever be called from
 xfs. The patch is fairly trivial too, and I wouldn't expect a big
 performance hit.
>>>
>>> I tried to queue 4 more times test for both commit c0cef30e4f and its 
>>> parent,
>>> the result seems quite stable.
>>>
>>> c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:
>>>  "aim7.jobs-per-min": [
>>> 32964.01,
>>> 32938.68,
>>> 33068.18,
>>> 32886.32,
>>> 32843.72,
>>> 32798.83,
>>> 32898.34,
>>> 32952.55
>>>   ],
>>>
>>> 3da90b159b146672f830bcd2489dd3a1f4e9e089:
>>>   "aim7.jobs-per-min": [
>>> 40239.65,
>>> 40163.33,
>>> 40353.32,
>>> 39976.9,
>>> 40185.75,
>>> 40411.3,
>>> 40213.58,
>>> 39900.69
>>>   ],
>>>
>>> Any other test data you may need?
>>>

 Is IMA involved here at all? I didn't see any evidence of it, but the
 kernel config did have it enabled.

>>>
>>> Sorry, not quite familiar with IMA, could you tell more about how to check 
>>> it?
>>>
>>
>> Thanks for retesting it, but I'm at a loss for why we're seeing this:
>>
>> IMA is the the integrity management subsystem. It will use the iversion
>> field to determine whether to remeasure files during remeasurement.  It
>> looks like the kernel config has it enabled, but it doesn't look like
>> it's in use, based on the info in the initial report.
>>
>> This patch only affects two inlined functions inode_cmp_iversion and
>> inode_cmp_iversion_raw. The patch is pretty trivial (as Linus points
>> out). These functions are only called from IMA and fs-specific code
>> (usually in readdir implementations to detect directory changes).
>>
>> XFS does not call either of these functions however, so I'm a little
>> unclear on how this patch could slow anything down on this test. The
>> only thing I can think to do here would be to profile this and see what
>> stands out.
>>
>> Note that we do need to keep this in perspective too. This 18%
>> regression on this test follows around a ~230% improvement that occurred
>> when we merged the bulk of these patches. It's should still be quite a
>> bit faster than the v4.15 in this regard.
>>
>> Still, it'd be good to understand what's going on here.
>>
>>
> 
> Could we see the dmesg from this boot? It'd be good to confirm that IMA
> is not involved here, as that's the only place that I can see that would
> call into this code at all here.
> 

See attachment for info on dmesg/perf-profile/compare_result.
Feel free to let Xiaolong or me know if anything else you would like to check.

> Thanks,
> Jeff
> 
> 
>>> Thanks,
>>> Xiaolong

>
> Details are as below:
> -->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
>
> =
> compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
>   
> gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
>
> commit: 
>   3da90b159b (" f2fs-for-4.16-rc1")
>   c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool 
> instead of s64")
>
> 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8 
>  -- 
>  %stddev %change %stddev
>  \  |\  
>  40183   -18.0%  32964  

Re: [PATCH v2] buffer: Avoid setting buffer bits that are already set

2018-02-02 Thread kemi
Hi, Jens
  Could you help to merge this patch to your tree? Thanks

On 2017年11月03日 10:29, kemi wrote:
> 
> 
> On 2017年10月24日 09:16, Kemi Wang wrote:
>> It's expensive to set buffer flags that are already set, because that
>> causes a costly cache line transition.
>>
>> A common case is setting the "verified" flag during ext4 writes.
>> This patch checks for the flag being set first.
>>
>> With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
>> file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
>> a 2-sockets broadwell platform.
>>
>> What the benchmark does is: it forks 3000 processes, and each  process do
>> the following:
>> a) open a new file
>> b) close the file
>> c) delete the file
>> until loop=100*1000 times.
>>
>> The original patch is contributed by Andi Kleen.
>>
>> Signed-off-by: Andi Kleen <a...@linux.intel.com>
>> Signed-off-by: Kemi Wang <kemi.w...@intel.com>
>> Tested-by: Kemi Wang <kemi.w...@intel.com>
>> Reviewed-by: Jens Axboe <ax...@kernel.dk>
>> ---
> 
> Seems that this patch is still not merged. Anything wrong with that? thanks
> 
>>  include/linux/buffer_head.h | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index c8dae55..211d8f5 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -80,11 +80,14 @@ struct buffer_head {
>>  /*
>>   * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
>>   * and buffer_foo() functions.
>> + * To avoid reset buffer flags that are already set, because that causes
>> + * a costly cache line transition, check the flag first.
>>   */
>>  #define BUFFER_FNS(bit, name)   
>> \
>>  static __always_inline void set_buffer_##name(struct buffer_head *bh)   
>> \
>>  {   \
>> -set_bit(BH_##bit, &(bh)->b_state);  \
>> +if (!test_bit(BH_##bit, &(bh)->b_state))\
>> +set_bit(BH_##bit, &(bh)->b_state);  \
>>  }   \
>>  static __always_inline void clear_buffer_##name(struct buffer_head *bh) 
>> \
>>  {   \
>>


Re: [PATCH v2] buffer: Avoid setting buffer bits that are already set

2018-02-02 Thread kemi
Hi, Jens
  Could you help to merge this patch to your tree? Thanks

On 2017年11月03日 10:29, kemi wrote:
> 
> 
> On 2017年10月24日 09:16, Kemi Wang wrote:
>> It's expensive to set buffer flags that are already set, because that
>> causes a costly cache line transition.
>>
>> A common case is setting the "verified" flag during ext4 writes.
>> This patch checks for the flag being set first.
>>
>> With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
>> file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
>> a 2-sockets broadwell platform.
>>
>> What the benchmark does is: it forks 3000 processes, and each  process do
>> the following:
>> a) open a new file
>> b) close the file
>> c) delete the file
>> until loop=100*1000 times.
>>
>> The original patch is contributed by Andi Kleen.
>>
>> Signed-off-by: Andi Kleen 
>> Signed-off-by: Kemi Wang 
>> Tested-by: Kemi Wang 
>> Reviewed-by: Jens Axboe 
>> ---
> 
> Seems that this patch is still not merged. Anything wrong with that? thanks
> 
>>  include/linux/buffer_head.h | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index c8dae55..211d8f5 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -80,11 +80,14 @@ struct buffer_head {
>>  /*
>>   * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
>>   * and buffer_foo() functions.
>> + * To avoid reset buffer flags that are already set, because that causes
>> + * a costly cache line transition, check the flag first.
>>   */
>>  #define BUFFER_FNS(bit, name)   
>> \
>>  static __always_inline void set_buffer_##name(struct buffer_head *bh)   
>> \
>>  {   \
>> -set_bit(BH_##bit, &(bh)->b_state);  \
>> +if (!test_bit(BH_##bit, &(bh)->b_state))\
>> +set_bit(BH_##bit, &(bh)->b_state);  \
>>  }   \
>>  static __always_inline void clear_buffer_##name(struct buffer_head *bh) 
>> \
>>  {   \
>>


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月22日 01:10, Christopher Lameter wrote:
> On Thu, 21 Dec 2017, kemi wrote:
> 
>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
>> path
>> severely increase with more and more CPUs cores
>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>> optimization can
>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>> service providers.
> 
> I think you are fighting a lost battle there. As evident from the timing
> constraints on packet processing in a 10/40G you will have a hard time to
> process data if the packets are of regular ethernet size. And we alrady
> have 100G NICs in operation here.
> 

Not really.
For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network 
rather than kernel driver in production environment.
That's due to the slow page allocator and long pipeline processing in network 
protocol stack.
That's not easy to change this state in short time, but if we can do something
here to change it a little, why not.

> We can try to get the performance as high as possible but full rate high
> speed networking invariable must use offload mechanisms and thus the
> statistics would only be available from the hardware devices that can do
> wire speed processing.
> 

I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + 
VF pass through). That's usually used in virtualization environment to 
eliminate 
the overhead from device emulation and packet processing in software virtual 
switch(OVS or linux bridge). 

What I have done in this patch series is to improve page allocator performance,
that's also helpful in offload environment (guest kernel at least), IMHO.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月22日 01:10, Christopher Lameter wrote:
> On Thu, 21 Dec 2017, kemi wrote:
> 
>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
>> path
>> severely increase with more and more CPUs cores
>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>> optimization can
>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>> service providers.
> 
> I think you are fighting a lost battle there. As evident from the timing
> constraints on packet processing in a 10/40G you will have a hard time to
> process data if the packets are of regular ethernet size. And we alrady
> have 100G NICs in operation here.
> 

Not really.
For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network 
rather than kernel driver in production environment.
That's due to the slow page allocator and long pipeline processing in network 
protocol stack.
That's not easy to change this state in short time, but if we can do something
here to change it a little, why not.

> We can try to get the performance as high as possible but full rate high
> speed networking invariable must use offload mechanisms and thus the
> statistics would only be available from the hardware devices that can do
> wire speed processing.
> 

I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + 
VF pass through). That's usually used in virtualization environment to 
eliminate 
the overhead from device emulation and packet processing in software virtual 
switch(OVS or linux bridge). 

What I have done in this patch series is to improve page allocator performance,
that's also helpful in offload environment (guest kernel at least), IMHO.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月21日 16:59, Michal Hocko wrote:
> On Thu 21-12-17 16:23:23, kemi wrote:
>>
>>
>> On 2017年12月21日 16:17, Michal Hocko wrote:
> [...]
>>> Can you see any difference with a more generic workload?
>>>
>>
>> I didn't see obvious improvement for will-it-scale.page_fault1
>> Two reasons for that:
>> 1) too long code path
>> 2) server zone lock and lru lock contention (access to buddy system 
>> frequently) 
> 
> OK. So does the patch helps for anything other than a microbenchmark?
> 
>>>> Some thinking about that:
>>>> a) the overhead due to cache bouncing caused by NUMA counter update in 
>>>> fast path 
>>>> severely increase with more and more CPUs cores
>>>
>>> What is an effect on a smaller system with fewer CPUs?
>>>
>>
>> Several CPU cycles can be saved using single thread for that.
>>
>>>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>>>> optimization can 
>>>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>>>> service providers.
>>>
>>> I would expect those would disable the numa accounting altogether.
>>>
>>
>> Yes, but it is still worthy to do some optimization, isn't?
> 
> Ohh, I am not opposing optimizations but you should make sure that they
> are worth the additional code and special casing. As I've said I am not
> convinced special casing numa counters is good. You can play with the
> threshold scaling for larger CPU count but let's make sure that the
> benefit is really measurable for normal workloads. Special ones will
> disable the numa accounting anyway.
> 

I understood. Could you give me some suggestion for those normal workloads, 
Thanks.
I will have a try and post the data ASAP. 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月21日 16:59, Michal Hocko wrote:
> On Thu 21-12-17 16:23:23, kemi wrote:
>>
>>
>> On 2017年12月21日 16:17, Michal Hocko wrote:
> [...]
>>> Can you see any difference with a more generic workload?
>>>
>>
>> I didn't see obvious improvement for will-it-scale.page_fault1
>> Two reasons for that:
>> 1) too long code path
>> 2) server zone lock and lru lock contention (access to buddy system 
>> frequently) 
> 
> OK. So does the patch helps for anything other than a microbenchmark?
> 
>>>> Some thinking about that:
>>>> a) the overhead due to cache bouncing caused by NUMA counter update in 
>>>> fast path 
>>>> severely increase with more and more CPUs cores
>>>
>>> What is an effect on a smaller system with fewer CPUs?
>>>
>>
>> Several CPU cycles can be saved using single thread for that.
>>
>>>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>>>> optimization can 
>>>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>>>> service providers.
>>>
>>> I would expect those would disable the numa accounting altogether.
>>>
>>
>> Yes, but it is still worthy to do some optimization, isn't?
> 
> Ohh, I am not opposing optimizations but you should make sure that they
> are worth the additional code and special casing. As I've said I am not
> convinced special casing numa counters is good. You can play with the
> threshold scaling for larger CPU count but let's make sure that the
> benefit is really measurable for normal workloads. Special ones will
> disable the numa accounting anyway.
> 

I understood. Could you give me some suggestion for those normal workloads, 
Thanks.
I will have a try and post the data ASAP. 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月21日 16:17, Michal Hocko wrote:
> On Thu 21-12-17 16:06:50, kemi wrote:
>>
>>
>> On 2017年12月20日 18:12, Michal Hocko wrote:
>>> On Wed 20-12-17 13:52:14, kemi wrote:
>>>>
>>>>
>>>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>>>> We have seen significant overhead in cache bouncing caused by NUMA 
>>>>>> counters
>>>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>>>> update NUMA counter threshold size")' for more details.
>>>>>>
>>>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and 
>>>>>> deals
>>>>>> with global counter update using different threshold size for node page
>>>>>> stats.
>>>>>
>>>>> Again, no numbers.
>>>>
>>>> Compare to vanilla kernel, I don't think it has performance improvement, so
>>>> I didn't post performance data here.
>>>> But, if you would like to see performance gain from enlarging threshold 
>>>> size
>>>> for NUMA stats (compare to the first patch), I will do that later. 
>>>
>>> Please do. I would also like to hear _why_ all counters cannot simply
>>> behave same. In other words why we cannot simply increase
>>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
>>> for larger machines.
>>>
>>
>> I will add this performance data to changelog in V3 patch series.
>>
>> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
>> Benchmark: page_bench03
>> Description: 112 threads do single page allocation/deallocation in parallel.
>>before   after
>>(enlarge threshold size)   
>> CPU cycles 722  379(-47.5%)
> 
> Please describe the numbers some more. Is this an average?

Yes

> What is the std? 

I increase the loop times to 10m, so the std is quite slow (repeat 3 times)

> Can you see any difference with a more generic workload?
> 

I didn't see obvious improvement for will-it-scale.page_fault1
Two reasons for that:
1) too long code path
2) server zone lock and lru lock contention (access to buddy system frequently) 

>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
>> path 
>> severely increase with more and more CPUs cores
> 
> What is an effect on a smaller system with fewer CPUs?
> 

Several CPU cycles can be saved using single thread for that.

>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>> optimization can 
>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>> service providers.
> 
> I would expect those would disable the numa accounting altogether.
> 

Yes, but it is still worthy to do some optimization, isn't?


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月21日 16:17, Michal Hocko wrote:
> On Thu 21-12-17 16:06:50, kemi wrote:
>>
>>
>> On 2017年12月20日 18:12, Michal Hocko wrote:
>>> On Wed 20-12-17 13:52:14, kemi wrote:
>>>>
>>>>
>>>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>>>> We have seen significant overhead in cache bouncing caused by NUMA 
>>>>>> counters
>>>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>>>> update NUMA counter threshold size")' for more details.
>>>>>>
>>>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and 
>>>>>> deals
>>>>>> with global counter update using different threshold size for node page
>>>>>> stats.
>>>>>
>>>>> Again, no numbers.
>>>>
>>>> Compare to vanilla kernel, I don't think it has performance improvement, so
>>>> I didn't post performance data here.
>>>> But, if you would like to see performance gain from enlarging threshold 
>>>> size
>>>> for NUMA stats (compare to the first patch), I will do that later. 
>>>
>>> Please do. I would also like to hear _why_ all counters cannot simply
>>> behave same. In other words why we cannot simply increase
>>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
>>> for larger machines.
>>>
>>
>> I will add this performance data to changelog in V3 patch series.
>>
>> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
>> Benchmark: page_bench03
>> Description: 112 threads do single page allocation/deallocation in parallel.
>>before   after
>>(enlarge threshold size)   
>> CPU cycles 722  379(-47.5%)
> 
> Please describe the numbers some more. Is this an average?

Yes

> What is the std? 

I increase the loop times to 10m, so the std is quite slow (repeat 3 times)

> Can you see any difference with a more generic workload?
> 

I didn't see obvious improvement for will-it-scale.page_fault1
Two reasons for that:
1) too long code path
2) server zone lock and lru lock contention (access to buddy system frequently) 

>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
>> path 
>> severely increase with more and more CPUs cores
> 
> What is an effect on a smaller system with fewer CPUs?
> 

Several CPU cycles can be saved using single thread for that.

>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>> optimization can 
>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>> service providers.
> 
> I would expect those would disable the numa accounting altogether.
> 

Yes, but it is still worthy to do some optimization, isn't?


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>> We have seen significant overhead in cache bouncing caused by NUMA counters
>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>> update NUMA counter threshold size")' for more details.
>>>>
>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>>>> with global counter update using different threshold size for node page
>>>> stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later. 
> 
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
> 

I will add this performance data to changelog in V3 patch series.

Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
Benchmark: page_bench03
Description: 112 threads do single page allocation/deallocation in parallel.
   before   after
   (enlarge threshold size)   
CPU cycles 722  379(-47.5%)

Some thinking about that:
a) the overhead due to cache bouncing caused by NUMA counter update in fast 
path 
severely increase with more and more CPUs cores
b) AFAIK, the typical usage scenario (similar at least)for which this 
optimization can 
benefit is 10/40G NIC used in high-speed data center network of cloud service 
providers.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>> We have seen significant overhead in cache bouncing caused by NUMA counters
>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>> update NUMA counter threshold size")' for more details.
>>>>
>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>>>> with global counter update using different threshold size for node page
>>>> stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later. 
> 
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
> 

I will add this performance data to changelog in V3 patch series.

Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
Benchmark: page_bench03
Description: 112 threads do single page allocation/deallocation in parallel.
   before   after
   (enlarge threshold size)   
CPU cycles 722  379(-47.5%)

Some thinking about that:
a) the overhead due to cache bouncing caused by NUMA counter update in fast 
path 
severely increase with more and more CPUs cores
b) AFAIK, the typical usage scenario (similar at least)for which this 
optimization can 
benefit is 10/40G NIC used in high-speed data center network of cloud service 
providers.


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread kemi


On 2017年12月20日 23:58, Christopher Lameter wrote:
> On Wed, 20 Dec 2017, kemi wrote:
> 
>>> You are making numastats special and I yet haven't heard any sounds
>>> arguments for that. But that should be discussed in the respective
>>> patch.
>>>
>>
>> That is because we have much larger threshold size for NUMA counters, that 
>> means larger
>> deviation. So, the number in local cpus may not be simply ignored.
> 
> Some numbers showing the effect of these changes would be helpful. You can
> probably create some in kernel synthetic tests to start with which would
> allow you to see any significant effects of those changes.
> 
> Then run the larger testsuites (f.e. those that Mel has published) and
> benchmarks to figure out how behavior of real apps *may* change?
> 

OK.
I will do that when available.
Let's just drop this patch in this series and consider this issue
in another patch. 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread kemi


On 2017年12月20日 23:58, Christopher Lameter wrote:
> On Wed, 20 Dec 2017, kemi wrote:
> 
>>> You are making numastats special and I yet haven't heard any sounds
>>> arguments for that. But that should be discussed in the respective
>>> patch.
>>>
>>
>> That is because we have much larger threshold size for NUMA counters, that 
>> means larger
>> deviation. So, the number in local cpus may not be simply ignored.
> 
> Some numbers showing the effect of these changes would be helpful. You can
> probably create some in kernel synthetic tests to start with which would
> allow you to see any significant effects of those changes.
> 
> Then run the larger testsuites (f.e. those that Mel has published) and
> benchmarks to figure out how behavior of real apps *may* change?
> 

OK.
I will do that when available.
Let's just drop this patch in this series and consider this issue
in another patch. 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread kemi


On 2017年12月20日 18:06, Michal Hocko wrote:
> On Wed 20-12-17 14:07:35, kemi wrote:
>>
>>
>> On 2017年12月19日 20:43, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
>>>> To avoid deviation, this patch uses node_page_state_snapshot instead of
>>>> node_page_state for node page stats query.
>>>> e.g. cat /proc/zoneinfo
>>>>  cat /sys/devices/system/node/node*/vmstat
>>>>  cat /sys/devices/system/node/node*/numastat
>>>>
>>>> As it is a slow path and would not be read frequently, I would worry about
>>>> it.
>>>
>>> The changelog doesn't explain why these counters needs any special
>>> treatment. _snapshot variants where used only for internal handling
>>> where the precision really mattered. We do not have any in-tree user and
>>> Jack has removed this by 
>>> http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
>>> which is already sitting in the mmotm tree. We can re-add it but that
>>> would really require a _very good_ reason.
>>>
>>
>> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
>> deviation is nr*t.
>> Currently, Skylake platform has hundreds of CPUs numbers and the number is 
>> still 
>> increasing. Also, even the threshold size is kept to 125 at maximum (32765 
>> for NUMA counters now), the deviation is just a little too big as I have 
>> mentioned in 
>> the log. I tend to sum the number in local cpus up when query the global 
>> stats.
> 
> This is a general problem of pcp accounting. So if it needs to be
> addressed then do it the same way for all stats.
> 
>> Also, node_page_state_snapshot is only called in slow path and I don't think 
>> that
>> would be a big problem. 
>>
>> Anyway, it is a matter of taste. I just think it's better to have.
> 
> You are making numastats special and I yet haven't heard any sounds
> arguments for that. But that should be discussed in the respective
> patch.
> 

That is because we have much larger threshold size for NUMA counters, that 
means larger 
deviation. So, the number in local cpus may not be simply ignored.

>>>> Signed-off-by: Kemi Wang <kemi.w...@intel.com>
>>>> ---
>>>>  drivers/base/node.c | 17 ++---
>>>>  mm/vmstat.c |  2 +-
>>>>  2 files changed, 11 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index a045ea1..cf303f8 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>>>>   "interleave_hit %lu\n"
>>>>   "local_node %lu\n"
>>>>   "other_node %lu\n",
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_FOREIGN),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_INTERLEAVE_HIT),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_OTHER));
>>>>  }
>>>>  
>>>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>>>>for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>>>n += sprintf(buf+n, "%s %lu\n",
>>>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>>> -   node_page_state(pgdat, i));
>>>> +   node_page_state_snapshot(pgdat, i));
>>>>  
>>>>return n;
>>>>  }
>>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>>> index 64e08ae..d65f28d 100644
>>>> --- a/mm/vmstat.c
>>>> +++ b/mm/vmstat.c
>>>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
>>>> pg_data_t *pgdat,
>>>>for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>>>>seq_printf(m, "\n  %-12s %lu",
>>>>vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>>> -  node_page_state(pgdat, i));
>>>> +  node_page_state_snapshot(pgdat, i));
>>>>}
>>>>}
>>>>seq_printf(m,
>>>> -- 
>>>> 2.7.4
>>>>
>>>
> 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread kemi


On 2017年12月20日 18:06, Michal Hocko wrote:
> On Wed 20-12-17 14:07:35, kemi wrote:
>>
>>
>> On 2017年12月19日 20:43, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
>>>> To avoid deviation, this patch uses node_page_state_snapshot instead of
>>>> node_page_state for node page stats query.
>>>> e.g. cat /proc/zoneinfo
>>>>  cat /sys/devices/system/node/node*/vmstat
>>>>  cat /sys/devices/system/node/node*/numastat
>>>>
>>>> As it is a slow path and would not be read frequently, I would worry about
>>>> it.
>>>
>>> The changelog doesn't explain why these counters needs any special
>>> treatment. _snapshot variants where used only for internal handling
>>> where the precision really mattered. We do not have any in-tree user and
>>> Jack has removed this by 
>>> http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
>>> which is already sitting in the mmotm tree. We can re-add it but that
>>> would really require a _very good_ reason.
>>>
>>
>> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
>> deviation is nr*t.
>> Currently, Skylake platform has hundreds of CPUs numbers and the number is 
>> still 
>> increasing. Also, even the threshold size is kept to 125 at maximum (32765 
>> for NUMA counters now), the deviation is just a little too big as I have 
>> mentioned in 
>> the log. I tend to sum the number in local cpus up when query the global 
>> stats.
> 
> This is a general problem of pcp accounting. So if it needs to be
> addressed then do it the same way for all stats.
> 
>> Also, node_page_state_snapshot is only called in slow path and I don't think 
>> that
>> would be a big problem. 
>>
>> Anyway, it is a matter of taste. I just think it's better to have.
> 
> You are making numastats special and I yet haven't heard any sounds
> arguments for that. But that should be discussed in the respective
> patch.
> 

That is because we have much larger threshold size for NUMA counters, that 
means larger 
deviation. So, the number in local cpus may not be simply ignored.

>>>> Signed-off-by: Kemi Wang 
>>>> ---
>>>>  drivers/base/node.c | 17 ++---
>>>>  mm/vmstat.c |  2 +-
>>>>  2 files changed, 11 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index a045ea1..cf303f8 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>>>>   "interleave_hit %lu\n"
>>>>   "local_node %lu\n"
>>>>   "other_node %lu\n",
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_FOREIGN),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_INTERLEAVE_HIT),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_OTHER));
>>>>  }
>>>>  
>>>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>>>>for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>>>n += sprintf(buf+n, "%s %lu\n",
>>>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>>> -   node_page_state(pgdat, i));
>>>> +   node_page_state_snapshot(pgdat, i));
>>>>  
>>>>return n;
>>>>  }
>>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>>> index 64e08ae..d65f28d 100644
>>>> --- a/mm/vmstat.c
>>>> +++ b/mm/vmstat.c
>>>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
>>>> pg_data_t *pgdat,
>>>>for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>>>>seq_printf(m, "\n  %-12s %lu",
>>>>vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>>> -  node_page_state(pgdat, i));
>>>> +  node_page_state_snapshot(pgdat, i));
>>>>}
>>>>}
>>>>seq_printf(m,
>>>> -- 
>>>> 2.7.4
>>>>
>>>
> 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-20 Thread kemi


On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>> We have seen significant overhead in cache bouncing caused by NUMA counters
>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>> update NUMA counter threshold size")' for more details.
>>>>
>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>>>> with global counter update using different threshold size for node page
>>>> stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later. 
> 
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
> 

Agree. We may consider that.
But, unlike NUMA counters which do not effect system decision.
We need consider very carefully when increase stat_threshold for all the 
counters
for larger machines. BTW, this is another topic that we may discuss it in 
different
thread.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-20 Thread kemi


On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>> We have seen significant overhead in cache bouncing caused by NUMA counters
>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>> update NUMA counter threshold size")' for more details.
>>>>
>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>>>> with global counter update using different threshold size for node page
>>>> stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later. 
> 
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
> 

Agree. We may consider that.
But, unlike NUMA counters which do not effect system decision.
We need consider very carefully when increase stat_threshold for all the 
counters
for larger machines. BTW, this is another topic that we may discuss it in 
different
thread.


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread kemi


On 2017年12月20日 01:21, Christopher Lameter wrote:
> On Tue, 19 Dec 2017, Michal Hocko wrote:
> 
>>> Well the reason for s8 was to keep the data structures small so that they
>>> fit in the higher level cpu caches. The large these structures become the
>>> more cachelines are used by the counters and the larger the performance
>>> influence on the code that should not be impacted by the overhead.
>>
>> I am not sure I understand. We usually do not access more counters in
>> the single code path (well, PGALLOC and NUMA counteres is more of an
>> exception). So it is rarely an advantage that the whole array is in the
>> same cache line. Besides that this is allocated by the percpu allocator
>> aligns to the type size rather than cache lines AFAICS.
> 
> I thought we are talking about NUMA counters here?
> 
> Regardless: A typical fault, system call or OS action will access multiple
> zone and node counters when allocating or freeing memory. Enlarging the
> fields will increase the number of cachelines touched.
> 

Yes, we add one more cache line footprint access theoretically.
But I don't think it would be a problem.
1) Not all the counters need to be accessed in fast path of page allocation,
the counters covered in a single cache line usually is enough for that, we
probably don't need to access one more cache line. I tend to agree Michal's
argument.
Besides, in some slow path in which code is protected by zone lock or lru lock,
access one more cache line would be a big problem since many other cache lines 
are also be accessed.

2) Enlarging vm_node_stat_diff from s8 to s16 gives an opportunity to keep
more number in local cpus that provides the possibility of reducing the global
counter update frequency. Thus, we can gain the benefit by reducing expensive 
cache bouncing.  

Well, if you still have some concerns, I can post some data for 
will-it-scale.page_fault1.
What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Or you can provide some other benchmarks on which you want to see performance 
impact.

>> Maybe it used to be all different back then when the code has been added
>> but arguing about cache lines seems to be a bit problematic here. Maybe
>> you have some specific workloads which can prove me wrong?
> 
> Run a workload that does some page faults? Heavy allocation and freeing of
> memory?
> 
> Maybe that is no longer relevant since the number of the counters is
> large that the accesses are so sparse that each action pulls in a whole
> cacheline. That would be something we tried to avoid when implementing
> the differentials.
> 
> 


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread kemi


On 2017年12月20日 01:21, Christopher Lameter wrote:
> On Tue, 19 Dec 2017, Michal Hocko wrote:
> 
>>> Well the reason for s8 was to keep the data structures small so that they
>>> fit in the higher level cpu caches. The large these structures become the
>>> more cachelines are used by the counters and the larger the performance
>>> influence on the code that should not be impacted by the overhead.
>>
>> I am not sure I understand. We usually do not access more counters in
>> the single code path (well, PGALLOC and NUMA counteres is more of an
>> exception). So it is rarely an advantage that the whole array is in the
>> same cache line. Besides that this is allocated by the percpu allocator
>> aligns to the type size rather than cache lines AFAICS.
> 
> I thought we are talking about NUMA counters here?
> 
> Regardless: A typical fault, system call or OS action will access multiple
> zone and node counters when allocating or freeing memory. Enlarging the
> fields will increase the number of cachelines touched.
> 

Yes, we add one more cache line footprint access theoretically.
But I don't think it would be a problem.
1) Not all the counters need to be accessed in fast path of page allocation,
the counters covered in a single cache line usually is enough for that, we
probably don't need to access one more cache line. I tend to agree Michal's
argument.
Besides, in some slow path in which code is protected by zone lock or lru lock,
access one more cache line would be a big problem since many other cache lines 
are also be accessed.

2) Enlarging vm_node_stat_diff from s8 to s16 gives an opportunity to keep
more number in local cpus that provides the possibility of reducing the global
counter update frequency. Thus, we can gain the benefit by reducing expensive 
cache bouncing.  

Well, if you still have some concerns, I can post some data for 
will-it-scale.page_fault1.
What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Or you can provide some other benchmarks on which you want to see performance 
impact.

>> Maybe it used to be all different back then when the code has been added
>> but arguing about cache lines seems to be a bit problematic here. Maybe
>> you have some specific workloads which can prove me wrong?
> 
> Run a workload that does some page faults? Heavy allocation and freeing of
> memory?
> 
> Maybe that is no longer relevant since the number of the counters is
> large that the accesses are so sparse that each action pulls in a whole
> cacheline. That would be something we tried to avoid when implementing
> the differentials.
> 
> 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-19 Thread kemi


On 2017年12月19日 20:43, Michal Hocko wrote:
> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
>> To avoid deviation, this patch uses node_page_state_snapshot instead of
>> node_page_state for node page stats query.
>> e.g. cat /proc/zoneinfo
>>  cat /sys/devices/system/node/node*/vmstat
>>  cat /sys/devices/system/node/node*/numastat
>>
>> As it is a slow path and would not be read frequently, I would worry about
>> it.
> 
> The changelog doesn't explain why these counters needs any special
> treatment. _snapshot variants where used only for internal handling
> where the precision really mattered. We do not have any in-tree user and
> Jack has removed this by 
> http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
> which is already sitting in the mmotm tree. We can re-add it but that
> would really require a _very good_ reason.
> 

Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
deviation is nr*t.
Currently, Skylake platform has hundreds of CPUs numbers and the number is 
still 
increasing. Also, even the threshold size is kept to 125 at maximum (32765 
for NUMA counters now), the deviation is just a little too big as I have 
mentioned in 
the log. I tend to sum the number in local cpus up when query the global stats.

Also, node_page_state_snapshot is only called in slow path and I don't think 
that
would be a big problem. 

Anyway, it is a matter of taste. I just think it's better to have.

>> Signed-off-by: Kemi Wang <kemi.w...@intel.com>
>> ---
>>  drivers/base/node.c | 17 ++---
>>  mm/vmstat.c |  2 +-
>>  2 files changed, 11 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index a045ea1..cf303f8 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> -   node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_FOREIGN),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_INTERLEAVE_HIT),
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_OTHER));
>>  }
>>  
>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>  n += sprintf(buf+n, "%s %lu\n",
>>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - node_page_state(pgdat, i));
>> + node_page_state_snapshot(pgdat, i));
>>  
>>  return n;
>>  }
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 64e08ae..d65f28d 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
>> pg_data_t *pgdat,
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>>  seq_printf(m, "\n  %-12s %lu",
>>  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> -node_page_state(pgdat, i));
>> +node_page_state_snapshot(pgdat, i));
>>  }
>>  }
>>  seq_printf(m,
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-19 Thread kemi


On 2017年12月19日 20:43, Michal Hocko wrote:
> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
>> To avoid deviation, this patch uses node_page_state_snapshot instead of
>> node_page_state for node page stats query.
>> e.g. cat /proc/zoneinfo
>>  cat /sys/devices/system/node/node*/vmstat
>>  cat /sys/devices/system/node/node*/numastat
>>
>> As it is a slow path and would not be read frequently, I would worry about
>> it.
> 
> The changelog doesn't explain why these counters needs any special
> treatment. _snapshot variants where used only for internal handling
> where the precision really mattered. We do not have any in-tree user and
> Jack has removed this by 
> http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
> which is already sitting in the mmotm tree. We can re-add it but that
> would really require a _very good_ reason.
> 

Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
deviation is nr*t.
Currently, Skylake platform has hundreds of CPUs numbers and the number is 
still 
increasing. Also, even the threshold size is kept to 125 at maximum (32765 
for NUMA counters now), the deviation is just a little too big as I have 
mentioned in 
the log. I tend to sum the number in local cpus up when query the global stats.

Also, node_page_state_snapshot is only called in slow path and I don't think 
that
would be a big problem. 

Anyway, it is a matter of taste. I just think it's better to have.

>> Signed-off-by: Kemi Wang 
>> ---
>>  drivers/base/node.c | 17 ++---
>>  mm/vmstat.c |  2 +-
>>  2 files changed, 11 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index a045ea1..cf303f8 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> -   node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_FOREIGN),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_INTERLEAVE_HIT),
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_OTHER));
>>  }
>>  
>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>  n += sprintf(buf+n, "%s %lu\n",
>>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - node_page_state(pgdat, i));
>> + node_page_state_snapshot(pgdat, i));
>>  
>>  return n;
>>  }
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 64e08ae..d65f28d 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
>> pg_data_t *pgdat,
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>>  seq_printf(m, "\n  %-12s %lu",
>>  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> -node_page_state(pgdat, i));
>> +node_page_state_snapshot(pgdat, i));
>>  }
>>  }
>>  seq_printf(m,
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-19 Thread kemi


On 2017年12月19日 20:40, Michal Hocko wrote:
> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>> We have seen significant overhead in cache bouncing caused by NUMA counters
>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>> update NUMA counter threshold size")' for more details.
>>
>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>> with global counter update using different threshold size for node page
>> stats.
> 
> Again, no numbers.

Compare to vanilla kernel, I don't think it has performance improvement, so
I didn't post performance data here.
But, if you would like to see performance gain from enlarging threshold size
for NUMA stats (compare to the first patch), I will do that later. 

> To be honest I do not really like the special casing
> here. Why are numa counters any different from PGALLOC which is
> incremented for _every_ single page allocation?
> 

I guess you meant to PGALLOC event.
The number of this event is kept in local cpu and sum up (for_each_online_cpu)
when need. It uses the similar way to what I used before for NUMA stats in V1 
patch series. Good enough.

>> ---
>>  mm/vmstat.c | 13 +++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 9c681cc..64e08ae 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
>> +
>>  #ifdef CONFIG_NUMA
>>  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>>  
>> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
>> node_stat_item item)
>>  s16 v, t;
>>  
>>  v = __this_cpu_inc_return(*p);
>> -t = __this_cpu_read(pcp->stat_threshold);
>> +if (item >= NR_VM_NUMA_STAT_ITEMS)
>> +t = __this_cpu_read(pcp->stat_threshold);
>> +else
>> +t = VM_NUMA_STAT_THRESHOLD;
>> +
>>  if (unlikely(v > t)) {
>>  s16 overstep = t >> 1;
>>  
>> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
>> *pgdat,
>>   * Most of the time the thresholds are the same anyways
>>   * for all cpus in a node.
>>   */
>> -t = this_cpu_read(pcp->stat_threshold);
>> +if (item >= NR_VM_NUMA_STAT_ITEMS)
>> +t = this_cpu_read(pcp->stat_threshold);
>> +else
>> +t = VM_NUMA_STAT_THRESHOLD;
>>  
>>  o = this_cpu_read(*p);
>>  n = delta + o;
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-19 Thread kemi


On 2017年12月19日 20:40, Michal Hocko wrote:
> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>> We have seen significant overhead in cache bouncing caused by NUMA counters
>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>> update NUMA counter threshold size")' for more details.
>>
>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>> with global counter update using different threshold size for node page
>> stats.
> 
> Again, no numbers.

Compare to vanilla kernel, I don't think it has performance improvement, so
I didn't post performance data here.
But, if you would like to see performance gain from enlarging threshold size
for NUMA stats (compare to the first patch), I will do that later. 

> To be honest I do not really like the special casing
> here. Why are numa counters any different from PGALLOC which is
> incremented for _every_ single page allocation?
> 

I guess you meant to PGALLOC event.
The number of this event is kept in local cpu and sum up (for_each_online_cpu)
when need. It uses the similar way to what I used before for NUMA stats in V1 
patch series. Good enough.

>> ---
>>  mm/vmstat.c | 13 +++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 9c681cc..64e08ae 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
>> +
>>  #ifdef CONFIG_NUMA
>>  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>>  
>> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
>> node_stat_item item)
>>  s16 v, t;
>>  
>>  v = __this_cpu_inc_return(*p);
>> -t = __this_cpu_read(pcp->stat_threshold);
>> +if (item >= NR_VM_NUMA_STAT_ITEMS)
>> +t = __this_cpu_read(pcp->stat_threshold);
>> +else
>> +t = VM_NUMA_STAT_THRESHOLD;
>> +
>>  if (unlikely(v > t)) {
>>  s16 overstep = t >> 1;
>>  
>> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
>> *pgdat,
>>   * Most of the time the thresholds are the same anyways
>>   * for all cpus in a node.
>>   */
>> -t = this_cpu_read(pcp->stat_threshold);
>> +if (item >= NR_VM_NUMA_STAT_ITEMS)
>> +t = this_cpu_read(pcp->stat_threshold);
>> +else
>> +t = VM_NUMA_STAT_THRESHOLD;
>>  
>>  o = this_cpu_read(*p);
>>  n = delta + o;
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node

2017-12-19 Thread kemi


On 2017年12月19日 20:28, Michal Hocko wrote:
> On Tue 19-12-17 14:39:22, Kemi Wang wrote:
>> There is not really any use to get NUMA stats separated by zone, and
>> current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code
>> cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the
>> existed per-cpu infrastructure.
> 
> Let's hope that nobody really depends on the per-zone numbers. It would
> be really strange as those counters are inherently per-node and that is
> what users should care about but who knows...
> 
> Anyway, I hoped we could get rid of NR_VM_NUMA_STAT_ITEMS but your patch
> keeps it and follow up patches even use it further. I will comment on
> those separately but this still makes these few counters really special
> which I think is wrong.
> 

Well, that's what I can think of to keep a balance between performance 
and simplification. If you have a better idea, please post it and 
I will follow that surely.
 
>> Suggested-by: Andi Kleen <a...@linux.intel.com>
>> Suggested-by: Michal Hocko <mho...@kernel.com>
>> Signed-off-by: Kemi Wang <kemi.w...@intel.com>
> 
> I have to fully grasp the rest of the series before I'll give my Ack,
> but I _really_ like the simplification this adds to the code. I believe
> it can be even simpler.
> 
>> ---
>>  drivers/base/node.c|  23 +++
>>  include/linux/mmzone.h |  27 
>>  include/linux/vmstat.h |  31 -
>>  mm/mempolicy.c |   2 +-
>>  mm/page_alloc.c|  16 +++--
>>  mm/vmstat.c| 177 
>> +
>>  6 files changed, 46 insertions(+), 230 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index ee090ab..a045ea1 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> -   sum_zone_numa_state(dev->id, NUMA_HIT),
>> -   sum_zone_numa_state(dev->id, NUMA_MISS),
>> -   sum_zone_numa_state(dev->id, NUMA_FOREIGN),
>> -   sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
>> -   sum_zone_numa_state(dev->id, NUMA_LOCAL),
>> -   sum_zone_numa_state(dev->id, NUMA_OTHER));
>> +   node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>>  }
>> +
>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>  
>>  static ssize_t node_read_vmstat(struct device *dev,
>> @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>>   sum_zone_node_page_state(nid, i));
>>  
>> -#ifdef CONFIG_NUMA
>> -for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
>> -n += sprintf(buf+n, "%s %lu\n",
>> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - sum_zone_numa_state(nid, i));
>> -#endif
>> -
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>  n += sprintf(buf+n, "%s %lu\n",
>> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
>> - NR_VM_NUMA_STAT_ITEMS],
>> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>   node_page_state(pgdat, i));
>>  
>>  return n;
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 67f2e3c..c06d880 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -115,20 +115,6 @@ struct zone_padding {
>>  #define ZONE_PADDING(name)
>>  #endif
>>  
>> -#ifdef CONFIG_NUMA
>> -enum numa_stat_item {
>> -NUMA_HIT,   /* allocated in intended node */
>> -NUMA_MISS,  /* allocated in non intended node */
>> -NUMA_FOREIGN,   /* was intended here, hit elsewhere */
>> -NUMA_INTERLEAVE_HIT, 

Re: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node

2017-12-19 Thread kemi


On 2017年12月19日 20:28, Michal Hocko wrote:
> On Tue 19-12-17 14:39:22, Kemi Wang wrote:
>> There is not really any use to get NUMA stats separated by zone, and
>> current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code
>> cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the
>> existed per-cpu infrastructure.
> 
> Let's hope that nobody really depends on the per-zone numbers. It would
> be really strange as those counters are inherently per-node and that is
> what users should care about but who knows...
> 
> Anyway, I hoped we could get rid of NR_VM_NUMA_STAT_ITEMS but your patch
> keeps it and follow up patches even use it further. I will comment on
> those separately but this still makes these few counters really special
> which I think is wrong.
> 

Well, that's what I can think of to keep a balance between performance 
and simplification. If you have a better idea, please post it and 
I will follow that surely.
 
>> Suggested-by: Andi Kleen 
>> Suggested-by: Michal Hocko 
>> Signed-off-by: Kemi Wang 
> 
> I have to fully grasp the rest of the series before I'll give my Ack,
> but I _really_ like the simplification this adds to the code. I believe
> it can be even simpler.
> 
>> ---
>>  drivers/base/node.c|  23 +++
>>  include/linux/mmzone.h |  27 
>>  include/linux/vmstat.h |  31 -
>>  mm/mempolicy.c |   2 +-
>>  mm/page_alloc.c|  16 +++--
>>  mm/vmstat.c| 177 
>> +
>>  6 files changed, 46 insertions(+), 230 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index ee090ab..a045ea1 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> -   sum_zone_numa_state(dev->id, NUMA_HIT),
>> -   sum_zone_numa_state(dev->id, NUMA_MISS),
>> -   sum_zone_numa_state(dev->id, NUMA_FOREIGN),
>> -   sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
>> -   sum_zone_numa_state(dev->id, NUMA_LOCAL),
>> -   sum_zone_numa_state(dev->id, NUMA_OTHER));
>> +   node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>> +   node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>>  }
>> +
>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>  
>>  static ssize_t node_read_vmstat(struct device *dev,
>> @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>>   sum_zone_node_page_state(nid, i));
>>  
>> -#ifdef CONFIG_NUMA
>> -for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
>> -n += sprintf(buf+n, "%s %lu\n",
>> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - sum_zone_numa_state(nid, i));
>> -#endif
>> -
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>  n += sprintf(buf+n, "%s %lu\n",
>> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
>> - NR_VM_NUMA_STAT_ITEMS],
>> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>   node_page_state(pgdat, i));
>>  
>>  return n;
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 67f2e3c..c06d880 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -115,20 +115,6 @@ struct zone_padding {
>>  #define ZONE_PADDING(name)
>>  #endif
>>  
>> -#ifdef CONFIG_NUMA
>> -enum numa_stat_item {
>> -NUMA_HIT,   /* allocated in intended node */
>> -NUMA_MISS,  /* allocated in non intended node */
>> -NUMA_FOREIGN,   /* was intended here, hit elsewhere */
>> -NUMA_INTERLEAVE_HIT,/* interleaver preferred this zone */
>> -NUMA_LOCAL,

Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread kemi


On 2017年12月19日 20:38, Michal Hocko wrote:
> On Tue 19-12-17 14:39:23, Kemi Wang wrote:
>> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
>> limitation of global counters update frequency, especially for those
>> monotone increasing type of counters like NUMA counters with more and more
>> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
>> without any functionality change.
>>
>>  before after
>> sizeof(struct per_cpu_nodestat)28 68
> 
> So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO
> but the changelog is a bit silent about any numbers. This is a
> performance optimization so it should better give us some.
>  

This patch does not have any functionality change. So no performance gain 
I suppose. 
I guess you are talking about performance gain from the third patch which 
increases threshold size of NUMA counters.

>> Signed-off-by: Kemi Wang <kemi.w...@intel.com>
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  mm/vmstat.c| 16 
>>  2 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index c06d880..2da6b6f 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -289,8 +289,8 @@ struct per_cpu_pageset {
>>  };
>>  
>>  struct per_cpu_nodestat {
>> -s8 stat_threshold;
>> -s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>> +s16 stat_threshold;
>> +s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>>  };
>>  
>>  #endif /* !__GENERATING_BOUNDS.H */
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 1dd12ae..9c681cc 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, 
>> enum node_stat_item item,
>>  long delta)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>>  long x;
>>  long t;
>>  
>> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum 
>> zone_stat_item item)
>>  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> -s8 v, t;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 v, t;
>>  
>>  v = __this_cpu_inc_return(*p);
>>  t = __this_cpu_read(pcp->stat_threshold);
>>  if (unlikely(v > t)) {
>> -s8 overstep = t >> 1;
>> +s16 overstep = t >> 1;
>>  
>>  node_page_state_add(v + overstep, pgdat, item);
>>  __this_cpu_write(*p, -overstep);
>> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum 
>> zone_stat_item item)
>>  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> -s8 v, t;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 v, t;
>>  
>>  v = __this_cpu_dec_return(*p);
>>  t = __this_cpu_read(pcp->stat_threshold);
>>  if (unlikely(v < - t)) {
>> -s8 overstep = t >> 1;
>> +s16 overstep = t >> 1;
>>  
>>  node_page_state_add(v - overstep, pgdat, item);
>>  __this_cpu_write(*p, overstep);
>> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data 
>> *pgdat,
>> enum node_stat_item item, int delta, int overstep_mode)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>>  long o, n, t, z;
>>  
>>  do {
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-19 Thread kemi


On 2017年12月19日 20:38, Michal Hocko wrote:
> On Tue 19-12-17 14:39:23, Kemi Wang wrote:
>> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
>> limitation of global counters update frequency, especially for those
>> monotone increasing type of counters like NUMA counters with more and more
>> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
>> without any functionality change.
>>
>>  before after
>> sizeof(struct per_cpu_nodestat)28 68
> 
> So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO
> but the changelog is a bit silent about any numbers. This is a
> performance optimization so it should better give us some.
>  

This patch does not have any functionality change. So no performance gain 
I suppose. 
I guess you are talking about performance gain from the third patch which 
increases threshold size of NUMA counters.

>> Signed-off-by: Kemi Wang 
>> ---
>>  include/linux/mmzone.h |  4 ++--
>>  mm/vmstat.c| 16 
>>  2 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index c06d880..2da6b6f 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -289,8 +289,8 @@ struct per_cpu_pageset {
>>  };
>>  
>>  struct per_cpu_nodestat {
>> -s8 stat_threshold;
>> -s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>> +s16 stat_threshold;
>> +s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>>  };
>>  
>>  #endif /* !__GENERATING_BOUNDS.H */
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 1dd12ae..9c681cc 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, 
>> enum node_stat_item item,
>>  long delta)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>>  long x;
>>  long t;
>>  
>> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum 
>> zone_stat_item item)
>>  void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> -s8 v, t;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 v, t;
>>  
>>  v = __this_cpu_inc_return(*p);
>>  t = __this_cpu_read(pcp->stat_threshold);
>>  if (unlikely(v > t)) {
>> -s8 overstep = t >> 1;
>> +s16 overstep = t >> 1;
>>  
>>  node_page_state_add(v + overstep, pgdat, item);
>>  __this_cpu_write(*p, -overstep);
>> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum 
>> zone_stat_item item)
>>  void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> -s8 v, t;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 v, t;
>>  
>>  v = __this_cpu_dec_return(*p);
>>  t = __this_cpu_read(pcp->stat_threshold);
>>  if (unlikely(v < - t)) {
>> -s8 overstep = t >> 1;
>> +s16 overstep = t >> 1;
>>  
>>  node_page_state_add(v - overstep, pgdat, item);
>>  __this_cpu_write(*p, overstep);
>> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data 
>> *pgdat,
>> enum node_stat_item item, int delta, int overstep_mode)
>>  {
>>  struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> -s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> +s16 __percpu *p = pcp->vm_node_stat_diff + item;
>>  long o, n, t, z;
>>  
>>  do {
>> -- 
>> 2.7.4
>>
> 


[PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node

2017-12-18 Thread Kemi Wang
There is not really any use to get NUMA stats separated by zone, and
current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code
cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the
existed per-cpu infrastructure.

Suggested-by: Andi Kleen <a...@linux.intel.com>
Suggested-by: Michal Hocko <mho...@kernel.com>
Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 drivers/base/node.c|  23 +++
 include/linux/mmzone.h |  27 
 include/linux/vmstat.h |  31 -
 mm/mempolicy.c |   2 +-
 mm/page_alloc.c|  16 +++--
 mm/vmstat.c| 177 +
 6 files changed, 46 insertions(+), 230 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab..a045ea1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
-  sum_zone_numa_state(dev->id, NUMA_HIT),
-  sum_zone_numa_state(dev->id, NUMA_MISS),
-  sum_zone_numa_state(dev->id, NUMA_FOREIGN),
-  sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
-  sum_zone_numa_state(dev->id, NUMA_LOCAL),
-  sum_zone_numa_state(dev->id, NUMA_OTHER));
+  node_page_state(NODE_DATA(dev->id), NUMA_HIT),
+  node_page_state(NODE_DATA(dev->id), NUMA_MISS),
+  node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
+  node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
+  node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
+  node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
 }
+
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
 static ssize_t node_read_vmstat(struct device *dev,
@@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev,
n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
 sum_zone_node_page_state(nid, i));
 
-#ifdef CONFIG_NUMA
-   for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-   n += sprintf(buf+n, "%s %lu\n",
-vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-sum_zone_numa_state(nid, i));
-#endif
-
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
-vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
-NR_VM_NUMA_STAT_ITEMS],
+vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 node_page_state(pgdat, i));
 
return n;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c..c06d880 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,20 +115,6 @@ struct zone_padding {
 #define ZONE_PADDING(name)
 #endif
 
-#ifdef CONFIG_NUMA
-enum numa_stat_item {
-   NUMA_HIT,   /* allocated in intended node */
-   NUMA_MISS,  /* allocated in non intended node */
-   NUMA_FOREIGN,   /* was intended here, hit elsewhere */
-   NUMA_INTERLEAVE_HIT,/* interleaver preferred this zone */
-   NUMA_LOCAL, /* allocation from local node */
-   NUMA_OTHER, /* allocation from other node */
-   NR_VM_NUMA_STAT_ITEMS
-};
-#else
-#define NR_VM_NUMA_STAT_ITEMS 0
-#endif
-
 enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
@@ -151,7 +137,18 @@ enum zone_stat_item {
NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
-   NR_LRU_BASE,
+#ifdef CONFIG_NUMA
+   NUMA_HIT,   /* allocated in intended node */
+   NUMA_MISS,  /* allocated in non intended node */
+   NUMA_FOREIGN,   /* was intended here, hit elsewhere */
+   NUMA_INTERLEAVE_HIT,/* interleaver preferred this zone */
+   NUMA_LOCAL, /* allocation from local node */
+   NUMA_OTHER, /* allocation from other node */
+   NR_VM_NUMA_STAT_ITEMS,
+#else
+#defineNR_VM_NUMA_STAT_ITEMS 0
+#endif
+   NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS,
NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
NR_ACTIVE_ANON, /*  " " "   "   " */
NR_INACTIVE_FILE,   /*  " " "   "   " */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1779c98..80bf290 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -118,37 +118,8 @@ static inline vo

[PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node

2017-12-18 Thread Kemi Wang
There is not really any use to get NUMA stats separated by zone, and
current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code
cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the
existed per-cpu infrastructure.

Suggested-by: Andi Kleen 
Suggested-by: Michal Hocko 
Signed-off-by: Kemi Wang 
---
 drivers/base/node.c|  23 +++
 include/linux/mmzone.h |  27 
 include/linux/vmstat.h |  31 -
 mm/mempolicy.c |   2 +-
 mm/page_alloc.c|  16 +++--
 mm/vmstat.c| 177 +
 6 files changed, 46 insertions(+), 230 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab..a045ea1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
-  sum_zone_numa_state(dev->id, NUMA_HIT),
-  sum_zone_numa_state(dev->id, NUMA_MISS),
-  sum_zone_numa_state(dev->id, NUMA_FOREIGN),
-  sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
-  sum_zone_numa_state(dev->id, NUMA_LOCAL),
-  sum_zone_numa_state(dev->id, NUMA_OTHER));
+  node_page_state(NODE_DATA(dev->id), NUMA_HIT),
+  node_page_state(NODE_DATA(dev->id), NUMA_MISS),
+  node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
+  node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
+  node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
+  node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
 }
+
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
 static ssize_t node_read_vmstat(struct device *dev,
@@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev,
n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
 sum_zone_node_page_state(nid, i));
 
-#ifdef CONFIG_NUMA
-   for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-   n += sprintf(buf+n, "%s %lu\n",
-vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-sum_zone_numa_state(nid, i));
-#endif
-
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
-vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
-NR_VM_NUMA_STAT_ITEMS],
+vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 node_page_state(pgdat, i));
 
return n;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c..c06d880 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,20 +115,6 @@ struct zone_padding {
 #define ZONE_PADDING(name)
 #endif
 
-#ifdef CONFIG_NUMA
-enum numa_stat_item {
-   NUMA_HIT,   /* allocated in intended node */
-   NUMA_MISS,  /* allocated in non intended node */
-   NUMA_FOREIGN,   /* was intended here, hit elsewhere */
-   NUMA_INTERLEAVE_HIT,/* interleaver preferred this zone */
-   NUMA_LOCAL, /* allocation from local node */
-   NUMA_OTHER, /* allocation from other node */
-   NR_VM_NUMA_STAT_ITEMS
-};
-#else
-#define NR_VM_NUMA_STAT_ITEMS 0
-#endif
-
 enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
@@ -151,7 +137,18 @@ enum zone_stat_item {
NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
-   NR_LRU_BASE,
+#ifdef CONFIG_NUMA
+   NUMA_HIT,   /* allocated in intended node */
+   NUMA_MISS,  /* allocated in non intended node */
+   NUMA_FOREIGN,   /* was intended here, hit elsewhere */
+   NUMA_INTERLEAVE_HIT,/* interleaver preferred this zone */
+   NUMA_LOCAL, /* allocation from local node */
+   NUMA_OTHER, /* allocation from other node */
+   NR_VM_NUMA_STAT_ITEMS,
+#else
+#defineNR_VM_NUMA_STAT_ITEMS 0
+#endif
+   NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS,
NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
NR_ACTIVE_ANON, /*  " " "   "   " */
NR_INACTIVE_FILE,   /*  " " "   "   " */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1779c98..80bf290 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with p

[PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics()

2017-12-18 Thread Kemi Wang
Since the functionality of zone_statistics() updates numa counters, but
numa statistics has been separated from zone statistics framework. Thus,
the function name makes people confused. So, change the name to
numa_statistics() as well as its call sites accordingly.

Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 mm/page_alloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e8d8f..f7583de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2790,7 +2790,7 @@ int __isolate_free_page(struct page *page, unsigned int 
order)
  *
  * Must be called with interrupts disabled.
  */
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void numa_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
int preferred_nid = preferred_zone->node;
@@ -2854,7 +2854,7 @@ static struct page *rmqueue_pcplist(struct zone 
*preferred_zone,
page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-   zone_statistics(preferred_zone, zone);
+   numa_statistics(preferred_zone, zone);
}
local_irq_restore(flags);
return page;
@@ -2902,7 +2902,7 @@ struct page *rmqueue(struct zone *preferred_zone,
  get_pcppage_migratetype(page));
 
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-   zone_statistics(preferred_zone, zone);
+   numa_statistics(preferred_zone, zone);
local_irq_restore(flags);
 
 out:
-- 
2.7.4



[PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics()

2017-12-18 Thread Kemi Wang
Since the functionality of zone_statistics() updates numa counters, but
numa statistics has been separated from zone statistics framework. Thus,
the function name makes people confused. So, change the name to
numa_statistics() as well as its call sites accordingly.

Signed-off-by: Kemi Wang 
---
 mm/page_alloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e8d8f..f7583de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2790,7 +2790,7 @@ int __isolate_free_page(struct page *page, unsigned int 
order)
  *
  * Must be called with interrupts disabled.
  */
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void numa_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
int preferred_nid = preferred_zone->node;
@@ -2854,7 +2854,7 @@ static struct page *rmqueue_pcplist(struct zone 
*preferred_zone,
page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-   zone_statistics(preferred_zone, zone);
+   numa_statistics(preferred_zone, zone);
}
local_irq_restore(flags);
return page;
@@ -2902,7 +2902,7 @@ struct page *rmqueue(struct zone *preferred_zone,
  get_pcppage_migratetype(page));
 
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-   zone_statistics(preferred_zone, zone);
+   numa_statistics(preferred_zone, zone);
local_irq_restore(flags);
 
 out:
-- 
2.7.4



[PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-18 Thread Kemi Wang
To avoid deviation, this patch uses node_page_state_snapshot instead of
node_page_state for node page stats query.
e.g. cat /proc/zoneinfo
 cat /sys/devices/system/node/node*/vmstat
 cat /sys/devices/system/node/node*/numastat

As it is a slow path and would not be read frequently, I would worry about
it.

Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 drivers/base/node.c | 17 ++---
 mm/vmstat.c |  2 +-
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a045ea1..cf303f8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
-  node_page_state(NODE_DATA(dev->id), NUMA_HIT),
-  node_page_state(NODE_DATA(dev->id), NUMA_MISS),
-  node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
-  node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
-  node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
-  node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_FOREIGN),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_INTERLEAVE_HIT),
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_OTHER));
 }
 
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
@@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
 vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-node_page_state(pgdat, i));
+node_page_state_snapshot(pgdat, i));
 
return n;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 64e08ae..d65f28d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
pg_data_t *pgdat,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n  %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-   node_page_state(pgdat, i));
+   node_page_state_snapshot(pgdat, i));
}
}
seq_printf(m,
-- 
2.7.4



[PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-18 Thread Kemi Wang
To avoid deviation, this patch uses node_page_state_snapshot instead of
node_page_state for node page stats query.
e.g. cat /proc/zoneinfo
 cat /sys/devices/system/node/node*/vmstat
 cat /sys/devices/system/node/node*/numastat

As it is a slow path and would not be read frequently, I would worry about
it.

Signed-off-by: Kemi Wang 
---
 drivers/base/node.c | 17 ++---
 mm/vmstat.c |  2 +-
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a045ea1..cf303f8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
-  node_page_state(NODE_DATA(dev->id), NUMA_HIT),
-  node_page_state(NODE_DATA(dev->id), NUMA_MISS),
-  node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
-  node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
-  node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
-  node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_FOREIGN),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_INTERLEAVE_HIT),
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_OTHER));
 }
 
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
@@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
 vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-node_page_state(pgdat, i));
+node_page_state_snapshot(pgdat, i));
 
return n;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 64e08ae..d65f28d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
pg_data_t *pgdat,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n  %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-   node_page_state(pgdat, i));
+   node_page_state_snapshot(pgdat, i));
}
}
seq_printf(m,
-- 
2.7.4



[PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-18 Thread Kemi Wang
The type s8 used for vm_diff_nodestat[] as local cpu counters has the
limitation of global counters update frequency, especially for those
monotone increasing type of counters like NUMA counters with more and more
cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
without any functionality change.

 before after
sizeof(struct per_cpu_nodestat)28 68

Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 include/linux/mmzone.h |  4 ++--
 mm/vmstat.c| 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c06d880..2da6b6f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -289,8 +289,8 @@ struct per_cpu_pageset {
 };
 
 struct per_cpu_nodestat {
-   s8 stat_threshold;
-   s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+   s16 stat_threshold;
+   s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1dd12ae..9c681cc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum 
node_stat_item item,
long delta)
 {
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-   s8 __percpu *p = pcp->vm_node_stat_diff + item;
+   s16 __percpu *p = pcp->vm_node_stat_diff + item;
long x;
long t;
 
@@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum 
zone_stat_item item)
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-   s8 __percpu *p = pcp->vm_node_stat_diff + item;
-   s8 v, t;
+   s16 __percpu *p = pcp->vm_node_stat_diff + item;
+   s16 v, t;
 
v = __this_cpu_inc_return(*p);
t = __this_cpu_read(pcp->stat_threshold);
if (unlikely(v > t)) {
-   s8 overstep = t >> 1;
+   s16 overstep = t >> 1;
 
node_page_state_add(v + overstep, pgdat, item);
__this_cpu_write(*p, -overstep);
@@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum 
zone_stat_item item)
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-   s8 __percpu *p = pcp->vm_node_stat_diff + item;
-   s8 v, t;
+   s16 __percpu *p = pcp->vm_node_stat_diff + item;
+   s16 v, t;
 
v = __this_cpu_dec_return(*p);
t = __this_cpu_read(pcp->stat_threshold);
if (unlikely(v < - t)) {
-   s8 overstep = t >> 1;
+   s16 overstep = t >> 1;
 
node_page_state_add(v - overstep, pgdat, item);
__this_cpu_write(*p, overstep);
@@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat,
enum node_stat_item item, int delta, int overstep_mode)
 {
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-   s8 __percpu *p = pcp->vm_node_stat_diff + item;
+   s16 __percpu *p = pcp->vm_node_stat_diff + item;
long o, n, t, z;
 
do {
-- 
2.7.4



[PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-18 Thread Kemi Wang
We have seen significant overhead in cache bouncing caused by NUMA counters
update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
update NUMA counter threshold size")' for more details.

This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
with global counter update using different threshold size for node page
stats.

Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 mm/vmstat.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9c681cc..64e08ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@
 
 #include "internal.h"
 
+#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
+
 #ifdef CONFIG_NUMA
 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
 
@@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
node_stat_item item)
s16 v, t;
 
v = __this_cpu_inc_return(*p);
-   t = __this_cpu_read(pcp->stat_threshold);
+   if (item >= NR_VM_NUMA_STAT_ITEMS)
+   t = __this_cpu_read(pcp->stat_threshold);
+   else
+   t = VM_NUMA_STAT_THRESHOLD;
+
if (unlikely(v > t)) {
s16 overstep = t >> 1;
 
@@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
*pgdat,
 * Most of the time the thresholds are the same anyways
 * for all cpus in a node.
 */
-   t = this_cpu_read(pcp->stat_threshold);
+   if (item >= NR_VM_NUMA_STAT_ITEMS)
+   t = this_cpu_read(pcp->stat_threshold);
+   else
+   t = VM_NUMA_STAT_THRESHOLD;
 
o = this_cpu_read(*p);
n = delta + o;
-- 
2.7.4



[PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

2017-12-18 Thread Kemi Wang
The type s8 used for vm_diff_nodestat[] as local cpu counters has the
limitation of global counters update frequency, especially for those
monotone increasing type of counters like NUMA counters with more and more
cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
without any functionality change.

 before after
sizeof(struct per_cpu_nodestat)28 68

Signed-off-by: Kemi Wang 
---
 include/linux/mmzone.h |  4 ++--
 mm/vmstat.c| 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c06d880..2da6b6f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -289,8 +289,8 @@ struct per_cpu_pageset {
 };
 
 struct per_cpu_nodestat {
-   s8 stat_threshold;
-   s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+   s16 stat_threshold;
+   s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1dd12ae..9c681cc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum 
node_stat_item item,
long delta)
 {
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-   s8 __percpu *p = pcp->vm_node_stat_diff + item;
+   s16 __percpu *p = pcp->vm_node_stat_diff + item;
long x;
long t;
 
@@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum 
zone_stat_item item)
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-   s8 __percpu *p = pcp->vm_node_stat_diff + item;
-   s8 v, t;
+   s16 __percpu *p = pcp->vm_node_stat_diff + item;
+   s16 v, t;
 
v = __this_cpu_inc_return(*p);
t = __this_cpu_read(pcp->stat_threshold);
if (unlikely(v > t)) {
-   s8 overstep = t >> 1;
+   s16 overstep = t >> 1;
 
node_page_state_add(v + overstep, pgdat, item);
__this_cpu_write(*p, -overstep);
@@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum 
zone_stat_item item)
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-   s8 __percpu *p = pcp->vm_node_stat_diff + item;
-   s8 v, t;
+   s16 __percpu *p = pcp->vm_node_stat_diff + item;
+   s16 v, t;
 
v = __this_cpu_dec_return(*p);
t = __this_cpu_read(pcp->stat_threshold);
if (unlikely(v < - t)) {
-   s8 overstep = t >> 1;
+   s16 overstep = t >> 1;
 
node_page_state_add(v - overstep, pgdat, item);
__this_cpu_write(*p, overstep);
@@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat,
enum node_stat_item item, int delta, int overstep_mode)
 {
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-   s8 __percpu *p = pcp->vm_node_stat_diff + item;
+   s16 __percpu *p = pcp->vm_node_stat_diff + item;
long o, n, t, z;
 
do {
-- 
2.7.4



[PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-18 Thread Kemi Wang
We have seen significant overhead in cache bouncing caused by NUMA counters
update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
update NUMA counter threshold size")' for more details.

This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
with global counter update using different threshold size for node page
stats.

Signed-off-by: Kemi Wang 
---
 mm/vmstat.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9c681cc..64e08ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@
 
 #include "internal.h"
 
+#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
+
 #ifdef CONFIG_NUMA
 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
 
@@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
node_stat_item item)
s16 v, t;
 
v = __this_cpu_inc_return(*p);
-   t = __this_cpu_read(pcp->stat_threshold);
+   if (item >= NR_VM_NUMA_STAT_ITEMS)
+   t = __this_cpu_read(pcp->stat_threshold);
+   else
+   t = VM_NUMA_STAT_THRESHOLD;
+
if (unlikely(v > t)) {
s16 overstep = t >> 1;
 
@@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
*pgdat,
 * Most of the time the thresholds are the same anyways
 * for all cpus in a node.
 */
-   t = this_cpu_read(pcp->stat_threshold);
+   if (item >= NR_VM_NUMA_STAT_ITEMS)
+   t = this_cpu_read(pcp->stat_threshold);
+   else
+   t = VM_NUMA_STAT_THRESHOLD;
 
o = this_cpu_read(*p);
n = delta + o;
-- 
2.7.4



[PATCH v2 0/5] mm: NUMA stats code cleanup and enhancement

2017-12-18 Thread Kemi Wang
The existed implementation of NUMA counters is per logical CPU along with
zone->vm_numa_stat[] separated by zone, plus a global numa counter array
vm_numa_stat[]. However, unlike the other vmstat counters, NUMA stats don't
effect system's decision and are only consumed when reading from /proc and
/sys. Also, usually nodes only have a single zone, except for node 0, and
there isn't really any use where you need these hits counts separated by
zone.

Therefore, we can migrate the implementation of numa stats from per-zone to
per-node (as suggested by Andi Kleen), and reuse the existed per-cpu
infrastructure with a little enhancement for NUMA stats. In this way, we
can get rid of the special way for NUMA stats and keep the performance gain
at the same time. With this patch series, about 170 lines code can be
saved.

The first patch migrates NUMA stats from per-zone to pre-node using the
existed per-cpu infrastructure. There is a little user-visual change when
read /proc/zoneinfo listed below:
 Before   After
Node 0, zone  DMA   Node 0, zone  DMA
  per-node stats  per-node stats
  nr_inactive_anon 7244  *numa_hit 98665086*
  nr_active_anon 177064  *numa_miss0*
  ...*numa_foreign 0*
  nr_bounce0 *numa_interleave 21059*
  nr_free_cma  0 *numa_local   98665086*
 *numa_hit 0**numa_other   0*
 *numa_miss0* nr_inactive_anon 20055
 *numa_foreign 0* nr_active_anon 389771
 *numa_interleave 0*  ...
 *numa_local   0* nr_bounce0
 *numa_other   0* nr_free_cma  0

The second patch extends the local cpu counter vm_stat_node_diff from s8 to
s16. It does not have any functionality change.

The third patch uses a large and constant threshold size for NUMA counters
to reduce the global NUMA counters update frequency.

The forth patch uses node_page_state_snapshot instead of node_page_state
when query a node stats (e.g. cat /sys/devices/system/node/node*/vmstat).
The only differece is that the stats value in local cpus are also included
in node_page_state_snapshot.

The last patch renames zone_statistics() to numa_statistics().

At last, I want to extend my heartiest appreciation for Michal Hocko's
suggestion of reusing the existed per-cpu infrastructure making it much
better than before.

Changelog:
  v1->v2:
  a) enhance the existed per-cpu infrastructure for node page stats by
  entending local cpu counters vm_node_stat_diff from s8 to s16
  b) reuse the per-cpu infrastrcuture for NUMA stats

Kemi Wang (5):
  mm: migrate NUMA stats from per-zone to per-node
  mm: Extends local cpu counter vm_diff_nodestat from s8 to s16
  mm: enlarge NUMA counters threshold size
  mm: use node_page_state_snapshot to avoid deviation
  mm: Rename zone_statistics() to numa_statistics()

 drivers/base/node.c|  28 +++
 include/linux/mmzone.h |  31 
 include/linux/vmstat.h |  31 
 mm/mempolicy.c |   2 +-
 mm/page_alloc.c|  22 +++---
 mm/vmstat.c| 206 +
 6 files changed, 74 insertions(+), 246 deletions(-)

-- 
2.7.4



[PATCH v2 0/5] mm: NUMA stats code cleanup and enhancement

2017-12-18 Thread Kemi Wang
The existed implementation of NUMA counters is per logical CPU along with
zone->vm_numa_stat[] separated by zone, plus a global numa counter array
vm_numa_stat[]. However, unlike the other vmstat counters, NUMA stats don't
effect system's decision and are only consumed when reading from /proc and
/sys. Also, usually nodes only have a single zone, except for node 0, and
there isn't really any use where you need these hits counts separated by
zone.

Therefore, we can migrate the implementation of numa stats from per-zone to
per-node (as suggested by Andi Kleen), and reuse the existed per-cpu
infrastructure with a little enhancement for NUMA stats. In this way, we
can get rid of the special way for NUMA stats and keep the performance gain
at the same time. With this patch series, about 170 lines code can be
saved.

The first patch migrates NUMA stats from per-zone to pre-node using the
existed per-cpu infrastructure. There is a little user-visual change when
read /proc/zoneinfo listed below:
 Before   After
Node 0, zone  DMA   Node 0, zone  DMA
  per-node stats  per-node stats
  nr_inactive_anon 7244  *numa_hit 98665086*
  nr_active_anon 177064  *numa_miss0*
  ...*numa_foreign 0*
  nr_bounce0 *numa_interleave 21059*
  nr_free_cma  0 *numa_local   98665086*
 *numa_hit 0**numa_other   0*
 *numa_miss0* nr_inactive_anon 20055
 *numa_foreign 0* nr_active_anon 389771
 *numa_interleave 0*  ...
 *numa_local   0* nr_bounce0
 *numa_other   0* nr_free_cma  0

The second patch extends the local cpu counter vm_stat_node_diff from s8 to
s16. It does not have any functionality change.

The third patch uses a large and constant threshold size for NUMA counters
to reduce the global NUMA counters update frequency.

The forth patch uses node_page_state_snapshot instead of node_page_state
when query a node stats (e.g. cat /sys/devices/system/node/node*/vmstat).
The only differece is that the stats value in local cpus are also included
in node_page_state_snapshot.

The last patch renames zone_statistics() to numa_statistics().

At last, I want to extend my heartiest appreciation for Michal Hocko's
suggestion of reusing the existed per-cpu infrastructure making it much
better than before.

Changelog:
  v1->v2:
  a) enhance the existed per-cpu infrastructure for node page stats by
  entending local cpu counters vm_node_stat_diff from s8 to s16
  b) reuse the per-cpu infrastrcuture for NUMA stats

Kemi Wang (5):
  mm: migrate NUMA stats from per-zone to per-node
  mm: Extends local cpu counter vm_diff_nodestat from s8 to s16
  mm: enlarge NUMA counters threshold size
  mm: use node_page_state_snapshot to avoid deviation
  mm: Rename zone_statistics() to numa_statistics()

 drivers/base/node.c|  28 +++
 include/linux/mmzone.h |  31 
 include/linux/vmstat.h |  31 
 mm/mempolicy.c |   2 +-
 mm/page_alloc.c|  22 +++---
 mm/vmstat.c| 206 +
 6 files changed, 74 insertions(+), 246 deletions(-)

-- 
2.7.4



Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-12-14 Thread kemi


On 2017年12月14日 15:29, Michal Hocko wrote:
> On Thu 14-12-17 09:40:32, kemi wrote:
>>
>>
>> or sometimes 
>> NUMA stats can't be disabled in their environments.
> 
> why?
> 
>> That's the reason
>> why we spent time to do that optimization other than simply adding a runtime
>> configuration interface.
>>
>> Furthermore, the code we optimized for is the core area of kernel that can
>> benefit most of kernel actions, more or less I think.
>>
>> All right, let's think about it in another way, does a u64 percpu array 
>> per-node
>> for NUMA stats really make code too much complicated and hard to maintain?
>> I'm afraid not IMHO.
> 
> I disagree. The whole numa stat things has turned out to be nasty to
> maintain. For a very limited gain. Now you are just shifting that
> elsewhere. Look, there are other counters taken in the allocator, we do
> not want to treat them specially. We have a nice per-cpu infrastructure
> here so I really fail to see why we should code-around it. If that can
> be improved then by all means let's do it.
> 

Yes, I agree with you that we may improve current per-cpu infrastructure.
May we have a chance to increase the size of vm_node_stat_diff from s8 to s16 
for
this "per-cpu infrastructure" (s32 in per-cpu counter infrastructure)? The 
limitation of type s8 seems not enough with more and more cpu cores, especially
for those monotone increasing type of counters like NUMA counters.

   before after(moving numa to per_cpu_nodestat
  and change s8 to s16)   
sizeof(struct per_cpu_nodestat)  28 68

If ok, we can also keep that improvement in a nice way.



Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-12-14 Thread kemi


On 2017年12月14日 15:29, Michal Hocko wrote:
> On Thu 14-12-17 09:40:32, kemi wrote:
>>
>>
>> or sometimes 
>> NUMA stats can't be disabled in their environments.
> 
> why?
> 
>> That's the reason
>> why we spent time to do that optimization other than simply adding a runtime
>> configuration interface.
>>
>> Furthermore, the code we optimized for is the core area of kernel that can
>> benefit most of kernel actions, more or less I think.
>>
>> All right, let's think about it in another way, does a u64 percpu array 
>> per-node
>> for NUMA stats really make code too much complicated and hard to maintain?
>> I'm afraid not IMHO.
> 
> I disagree. The whole numa stat things has turned out to be nasty to
> maintain. For a very limited gain. Now you are just shifting that
> elsewhere. Look, there are other counters taken in the allocator, we do
> not want to treat them specially. We have a nice per-cpu infrastructure
> here so I really fail to see why we should code-around it. If that can
> be improved then by all means let's do it.
> 

Yes, I agree with you that we may improve current per-cpu infrastructure.
May we have a chance to increase the size of vm_node_stat_diff from s8 to s16 
for
this "per-cpu infrastructure" (s32 in per-cpu counter infrastructure)? The 
limitation of type s8 seems not enough with more and more cpu cores, especially
for those monotone increasing type of counters like NUMA counters.

   before after(moving numa to per_cpu_nodestat
  and change s8 to s16)   
sizeof(struct per_cpu_nodestat)  28 68

If ok, we can also keep that improvement in a nice way.



Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-12-13 Thread kemi


On 2017年12月12日 16:11, Michal Hocko wrote:
> On Tue 12-12-17 10:05:26, kemi wrote:
>>
>>
>> On 2017年12月08日 16:47, Michal Hocko wrote:
>>> On Fri 08-12-17 16:38:46, kemi wrote:
>>>>
>>>>
>>>> On 2017年11月30日 17:45, Michal Hocko wrote:
>>>>> On Thu 30-11-17 17:32:08, kemi wrote:
>>>>
>>>> After thinking about how to optimize our per-node stats more gracefully, 
>>>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
>>>> we can keep everything in per cpu counter and sum them up when read /proc
>>>> or /sys for numa stats. 
>>>> What's your idea for that? thanks
>>>
>>> I would like to see a strong argument why we cannot make it a _standard_
>>> node counter.
>>>
>>
>> all right. 
>> This issue is first reported and discussed in 2017 MM summit, referred to
>> the topic "Provoking and fixing memory bottlenecks -Focused on the page 
>> allocator presentation" presented by Jesper.
>>
>> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
>> 2017-JesperBrouer.pdf (slide 15/16)
>>
>> As you know, page allocator is too slow and has becomes a bottleneck
>> in high-speed network.
>> Jesper also showed some data in that presentation: with micro benchmark 
>> stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost 
>> (143->97) comes from CONFIG_NUMA. 
>>
>> When I took a look at this issue, I reproduced this issue and got a
>> similar result to Jesper's. Furthermore, with the help from Jesper, 
>> the overhead is root caused and the real cause of this overhead comes
>> from an extra level of function calls such as zone_statistics() (*10%*,
>> nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
>> policy_nodemask and etc (perf profiling cpu cycles).  zone_statistics() 
>> is the biggest one introduced by CONFIG_NUMA in fast path that we can 
>> do something for optimizing page allocator. Plus, the overhead of 
>> zone_statistics() significantly increase with more and more cpu 
>> cores and nodes due to cache bouncing.
>>
>> Therefore, we submitted a patch before to mitigate the overhead of 
>> zone_statistics() by reducing global NUMA counter update frequency 
>> (enlarge threshold size, as suggested by Dave Hansen). I also would
>> like to have an implementation of a "_standard_node counter" for NUMA
>> stats, but I wonder how we can keep the performance gain at the
>> same time.
> 
> I understand all that. But we do have a way to put all that overhead
> away by disabling the stats altogether. I presume that CPU cycle
> sensitive workloads would simply use that option because the stats are
> quite limited in their usefulness anyway IMHO. So we are back to: Do
> normal workloads care all that much to have 3rd way to account for
> events? I haven't heard a sound argument for that.
> 

I'm not a fan of adding code that nobody(or 0.001%) cares.
We can't depend on that tunable interface too much, because our customers 
or even kernel hacker may not know that new added interface, or sometimes 
NUMA stats can't be disabled in their environments. That's the reason
why we spent time to do that optimization other than simply adding a runtime
configuration interface.

Furthermore, the code we optimized for is the core area of kernel that can
benefit most of kernel actions, more or less I think.

All right, let's think about it in another way, does a u64 percpu array per-node
for NUMA stats really make code too much complicated and hard to maintain?
I'm afraid not IMHO.





Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-12-13 Thread kemi


On 2017年12月12日 16:11, Michal Hocko wrote:
> On Tue 12-12-17 10:05:26, kemi wrote:
>>
>>
>> On 2017年12月08日 16:47, Michal Hocko wrote:
>>> On Fri 08-12-17 16:38:46, kemi wrote:
>>>>
>>>>
>>>> On 2017年11月30日 17:45, Michal Hocko wrote:
>>>>> On Thu 30-11-17 17:32:08, kemi wrote:
>>>>
>>>> After thinking about how to optimize our per-node stats more gracefully, 
>>>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
>>>> we can keep everything in per cpu counter and sum them up when read /proc
>>>> or /sys for numa stats. 
>>>> What's your idea for that? thanks
>>>
>>> I would like to see a strong argument why we cannot make it a _standard_
>>> node counter.
>>>
>>
>> all right. 
>> This issue is first reported and discussed in 2017 MM summit, referred to
>> the topic "Provoking and fixing memory bottlenecks -Focused on the page 
>> allocator presentation" presented by Jesper.
>>
>> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
>> 2017-JesperBrouer.pdf (slide 15/16)
>>
>> As you know, page allocator is too slow and has becomes a bottleneck
>> in high-speed network.
>> Jesper also showed some data in that presentation: with micro benchmark 
>> stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost 
>> (143->97) comes from CONFIG_NUMA. 
>>
>> When I took a look at this issue, I reproduced this issue and got a
>> similar result to Jesper's. Furthermore, with the help from Jesper, 
>> the overhead is root caused and the real cause of this overhead comes
>> from an extra level of function calls such as zone_statistics() (*10%*,
>> nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
>> policy_nodemask and etc (perf profiling cpu cycles).  zone_statistics() 
>> is the biggest one introduced by CONFIG_NUMA in fast path that we can 
>> do something for optimizing page allocator. Plus, the overhead of 
>> zone_statistics() significantly increase with more and more cpu 
>> cores and nodes due to cache bouncing.
>>
>> Therefore, we submitted a patch before to mitigate the overhead of 
>> zone_statistics() by reducing global NUMA counter update frequency 
>> (enlarge threshold size, as suggested by Dave Hansen). I also would
>> like to have an implementation of a "_standard_node counter" for NUMA
>> stats, but I wonder how we can keep the performance gain at the
>> same time.
> 
> I understand all that. But we do have a way to put all that overhead
> away by disabling the stats altogether. I presume that CPU cycle
> sensitive workloads would simply use that option because the stats are
> quite limited in their usefulness anyway IMHO. So we are back to: Do
> normal workloads care all that much to have 3rd way to account for
> events? I haven't heard a sound argument for that.
> 

I'm not a fan of adding code that nobody(or 0.001%) cares.
We can't depend on that tunable interface too much, because our customers 
or even kernel hacker may not know that new added interface, or sometimes 
NUMA stats can't be disabled in their environments. That's the reason
why we spent time to do that optimization other than simply adding a runtime
configuration interface.

Furthermore, the code we optimized for is the core area of kernel that can
benefit most of kernel actions, more or less I think.

All right, let's think about it in another way, does a u64 percpu array per-node
for NUMA stats really make code too much complicated and hard to maintain?
I'm afraid not IMHO.





Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-12-11 Thread kemi


On 2017年12月08日 16:47, Michal Hocko wrote:
> On Fri 08-12-17 16:38:46, kemi wrote:
>>
>>
>> On 2017年11月30日 17:45, Michal Hocko wrote:
>>> On Thu 30-11-17 17:32:08, kemi wrote:
>>
>> After thinking about how to optimize our per-node stats more gracefully, 
>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
>> we can keep everything in per cpu counter and sum them up when read /proc
>> or /sys for numa stats. 
>> What's your idea for that? thanks
> 
> I would like to see a strong argument why we cannot make it a _standard_
> node counter.
> 

all right. 
This issue is first reported and discussed in 2017 MM summit, referred to
the topic "Provoking and fixing memory bottlenecks -Focused on the page 
allocator presentation" presented by Jesper.

http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
2017-JesperBrouer.pdf (slide 15/16)

As you know, page allocator is too slow and has becomes a bottleneck
in high-speed network.
Jesper also showed some data in that presentation: with micro benchmark 
stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost 
(143->97) comes from CONFIG_NUMA. 

When I took a look at this issue, I reproduced this issue and got a
similar result to Jesper's. Furthermore, with the help from Jesper, 
the overhead is root caused and the real cause of this overhead comes
from an extra level of function calls such as zone_statistics() (*10%*,
nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
policy_nodemask and etc (perf profiling cpu cycles).  zone_statistics() 
is the biggest one introduced by CONFIG_NUMA in fast path that we can 
do something for optimizing page allocator. Plus, the overhead of 
zone_statistics() significantly increase with more and more cpu 
cores and nodes due to cache bouncing.

Therefore, we submitted a patch before to mitigate the overhead of 
zone_statistics() by reducing global NUMA counter update frequency 
(enlarge threshold size, as suggested by Dave Hansen). I also would
like to have an implementation of a "_standard_node counter" for NUMA
stats, but I wonder how we can keep the performance gain at the
same time.


Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-12-11 Thread kemi


On 2017年12月08日 16:47, Michal Hocko wrote:
> On Fri 08-12-17 16:38:46, kemi wrote:
>>
>>
>> On 2017年11月30日 17:45, Michal Hocko wrote:
>>> On Thu 30-11-17 17:32:08, kemi wrote:
>>
>> After thinking about how to optimize our per-node stats more gracefully, 
>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
>> we can keep everything in per cpu counter and sum them up when read /proc
>> or /sys for numa stats. 
>> What's your idea for that? thanks
> 
> I would like to see a strong argument why we cannot make it a _standard_
> node counter.
> 

all right. 
This issue is first reported and discussed in 2017 MM summit, referred to
the topic "Provoking and fixing memory bottlenecks -Focused on the page 
allocator presentation" presented by Jesper.

http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
2017-JesperBrouer.pdf (slide 15/16)

As you know, page allocator is too slow and has becomes a bottleneck
in high-speed network.
Jesper also showed some data in that presentation: with micro benchmark 
stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost 
(143->97) comes from CONFIG_NUMA. 

When I took a look at this issue, I reproduced this issue and got a
similar result to Jesper's. Furthermore, with the help from Jesper, 
the overhead is root caused and the real cause of this overhead comes
from an extra level of function calls such as zone_statistics() (*10%*,
nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
policy_nodemask and etc (perf profiling cpu cycles).  zone_statistics() 
is the biggest one introduced by CONFIG_NUMA in fast path that we can 
do something for optimizing page allocator. Plus, the overhead of 
zone_statistics() significantly increase with more and more cpu 
cores and nodes due to cache bouncing.

Therefore, we submitted a patch before to mitigate the overhead of 
zone_statistics() by reducing global NUMA counter update frequency 
(enlarge threshold size, as suggested by Dave Hansen). I also would
like to have an implementation of a "_standard_node counter" for NUMA
stats, but I wonder how we can keep the performance gain at the
same time.


Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-12-08 Thread kemi


On 2017年11月30日 17:45, Michal Hocko wrote:
> On Thu 30-11-17 17:32:08, kemi wrote:

> Do not get me wrong. If we want to make per-node stats more optimal,
> then by all means let's do that. But having 3 sets of counters is just
> way to much.
> 

Hi, Michal
  Apologize to respond later in this email thread.

After thinking about how to optimize our per-node stats more gracefully, 
we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
we can keep everything in per cpu counter and sum them up when read /proc
or /sys for numa stats. 
What's your idea for that? thanks

The motivation for that modification is listed below:
1) thanks to 0-day system, a bug is reported for the V1 patch:

[0.00] BUG: unable to handle kernel paging request at 0392b000
[0.00] IP: __inc_numa_state+0x2a/0x34
[0.00] *pdpt =  *pde = f000ff53f000ff53 
[0.00] Oops: 0002 [#1] PREEMPT SMP
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-12996-g81611e2 #1
[0.00] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[0.00] task: cbf56000 task.stack: cbf4e000
[0.00] EIP: __inc_numa_state+0x2a/0x34
[0.00] EFLAGS: 00210006 CPU: 0
[0.00] EAX: 0392b000 EBX:  ECX:  EDX: cbef90ef
[0.00] ESI: cffdb320 EDI: 0004 EBP: cbf4fd80 ESP: cbf4fd7c
[0.00]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[0.00] CR0: 80050033 CR2: 0392b000 CR3: 0c0a8000 CR4: 000406b0
[0.00] DR0:  DR1:  DR2:  DR3: 
[0.00] DR6: fffe0ff0 DR7: 0400
[0.00] Call Trace:
[0.00]  zone_statistics+0x4d/0x5b
[0.00]  get_page_from_freelist+0x257/0x993
[0.00]  __alloc_pages_nodemask+0x108/0x8c8
[0.00]  ? __bitmap_weight+0x38/0x41
[0.00]  ? pcpu_next_md_free_region+0xe/0xab
[0.00]  ? pcpu_chunk_refresh_hint+0x8b/0xbc
[0.00]  ? pcpu_chunk_slot+0x1e/0x24
[0.00]  ? pcpu_chunk_relocate+0x15/0x6d
[0.00]  ? find_next_bit+0xa/0xd
[0.00]  ? cpumask_next+0x15/0x18
[0.00]  ? pcpu_alloc+0x399/0x538
[0.00]  cache_grow_begin+0x85/0x31c
[0.00]  cache_alloc+0x147/0x1e0
[0.00]  ? debug_smp_processor_id+0x12/0x14
[0.00]  kmem_cache_alloc+0x80/0x145
[0.00]  create_kmalloc_cache+0x22/0x64
[0.00]  kmem_cache_init+0xf9/0x16c
[0.00]  start_kernel+0x1d4/0x3d6
[0.00]  i386_start_kernel+0x9a/0x9e
[0.00]  startup_32_smp+0x15f/0x170

That is because u64 percpu pointer vm_numa_stat is used before initialization.

[...]
> +extern u64 __percpu *vm_numa_stat;
[...]
> +#ifdef CONFIG_NUMA
> + size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
> + align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
> + vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
> +#endif

The pointer is used in mm_init->kmem_cache_init->create_kmalloc_cache->...->
__alloc_pages() when CONFIG_SLAB/CONFIG_ZONE_DMA is set in kconfig, while the
vm_numa_stat is initialized in setup_per_cpu_pageset after mm_init is called.
The proposal mentioned above can fix it by making the numa stats counter ready
before calling mm_init (start_kernel->build_all_zonelists() can help to do that)

2) Compare to the V1 patch, this modification makes the semantics of per-node 
numa
stats more clear for review and maintenance. 


Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-12-08 Thread kemi


On 2017年11月30日 17:45, Michal Hocko wrote:
> On Thu 30-11-17 17:32:08, kemi wrote:

> Do not get me wrong. If we want to make per-node stats more optimal,
> then by all means let's do that. But having 3 sets of counters is just
> way to much.
> 

Hi, Michal
  Apologize to respond later in this email thread.

After thinking about how to optimize our per-node stats more gracefully, 
we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
we can keep everything in per cpu counter and sum them up when read /proc
or /sys for numa stats. 
What's your idea for that? thanks

The motivation for that modification is listed below:
1) thanks to 0-day system, a bug is reported for the V1 patch:

[0.00] BUG: unable to handle kernel paging request at 0392b000
[0.00] IP: __inc_numa_state+0x2a/0x34
[0.00] *pdpt =  *pde = f000ff53f000ff53 
[0.00] Oops: 0002 [#1] PREEMPT SMP
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-12996-g81611e2 #1
[0.00] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[0.00] task: cbf56000 task.stack: cbf4e000
[0.00] EIP: __inc_numa_state+0x2a/0x34
[0.00] EFLAGS: 00210006 CPU: 0
[0.00] EAX: 0392b000 EBX:  ECX:  EDX: cbef90ef
[0.00] ESI: cffdb320 EDI: 0004 EBP: cbf4fd80 ESP: cbf4fd7c
[0.00]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[0.00] CR0: 80050033 CR2: 0392b000 CR3: 0c0a8000 CR4: 000406b0
[0.00] DR0:  DR1:  DR2:  DR3: 
[0.00] DR6: fffe0ff0 DR7: 0400
[0.00] Call Trace:
[0.00]  zone_statistics+0x4d/0x5b
[0.00]  get_page_from_freelist+0x257/0x993
[0.00]  __alloc_pages_nodemask+0x108/0x8c8
[0.00]  ? __bitmap_weight+0x38/0x41
[0.00]  ? pcpu_next_md_free_region+0xe/0xab
[0.00]  ? pcpu_chunk_refresh_hint+0x8b/0xbc
[0.00]  ? pcpu_chunk_slot+0x1e/0x24
[0.00]  ? pcpu_chunk_relocate+0x15/0x6d
[0.00]  ? find_next_bit+0xa/0xd
[0.00]  ? cpumask_next+0x15/0x18
[0.00]  ? pcpu_alloc+0x399/0x538
[0.00]  cache_grow_begin+0x85/0x31c
[0.00]  cache_alloc+0x147/0x1e0
[0.00]  ? debug_smp_processor_id+0x12/0x14
[0.00]  kmem_cache_alloc+0x80/0x145
[0.00]  create_kmalloc_cache+0x22/0x64
[0.00]  kmem_cache_init+0xf9/0x16c
[0.00]  start_kernel+0x1d4/0x3d6
[0.00]  i386_start_kernel+0x9a/0x9e
[0.00]  startup_32_smp+0x15f/0x170

That is because u64 percpu pointer vm_numa_stat is used before initialization.

[...]
> +extern u64 __percpu *vm_numa_stat;
[...]
> +#ifdef CONFIG_NUMA
> + size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
> + align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
> + vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
> +#endif

The pointer is used in mm_init->kmem_cache_init->create_kmalloc_cache->...->
__alloc_pages() when CONFIG_SLAB/CONFIG_ZONE_DMA is set in kconfig, while the
vm_numa_stat is initialized in setup_per_cpu_pageset after mm_init is called.
The proposal mentioned above can fix it by making the numa stats counter ready
before calling mm_init (start_kernel->build_all_zonelists() can help to do that)

2) Compare to the V1 patch, this modification makes the semantics of per-node 
numa
stats more clear for review and maintenance. 


RE: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-30 Thread Wang, Kemi
Of course, we should do that AFAP. Thanks for your comments :)

-Original Message-
From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of 
Michal Hocko
Sent: Thursday, November 30, 2017 5:45 PM
To: Wang, Kemi <kemi.w...@intel.com>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>; Andrew Morton 
<a...@linux-foundation.org>; Vlastimil Babka <vba...@suse.cz>; Mel Gorman 
<mgor...@techsingularity.net>; Johannes Weiner <han...@cmpxchg.org>; 
Christopher Lameter <c...@linux.com>; YASUAKI ISHIMATSU 
<yasu.isim...@gmail.com>; Andrey Ryabinin <aryabi...@virtuozzo.com>; Nikolay 
Borisov <nbori...@suse.com>; Pavel Tatashin <pasha.tatas...@oracle.com>; David 
Rientjes <rient...@google.com>; Sebastian Andrzej Siewior 
<bige...@linutronix.de>; Dave <dave.han...@linux.intel.com>; Kleen, Andi 
<andi.kl...@intel.com>; Chen, Tim C <tim.c.c...@intel.com>; Jesper Dangaard 
Brouer <bro...@redhat.com>; Huang, Ying <ying.hu...@intel.com>; Lu, Aaron 
<aaron...@intel.com>; Li, Aubrey <aubrey...@intel.com>; Linux MM 
<linux...@kvack.org>; Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

On Thu 30-11-17 17:32:08, kemi wrote:
[...]
> Your patch saves more code than mine because the node stats framework 
> is reused for numa stats. But it has a performance regression because 
> of the limitation of threshold size (125 at most, see 
> calculate_normal_threshold() in vmstat.c) in inc_node_state().

But this "regression" would be visible only on those workloads which really 
need to squeeze every single cycle out of the allocation hot path and those are 
supposed to disable the accounting altogether. Or is this visible on a wider 
variety of workloads.

Do not get me wrong. If we want to make per-node stats more optimal, then by 
all means let's do that. But having 3 sets of counters is just way to much.

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to 
majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


RE: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-30 Thread Wang, Kemi
Of course, we should do that AFAP. Thanks for your comments :)

-Original Message-
From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of 
Michal Hocko
Sent: Thursday, November 30, 2017 5:45 PM
To: Wang, Kemi 
Cc: Greg Kroah-Hartman ; Andrew Morton 
; Vlastimil Babka ; Mel Gorman 
; Johannes Weiner ; 
Christopher Lameter ; YASUAKI ISHIMATSU 
; Andrey Ryabinin ; Nikolay 
Borisov ; Pavel Tatashin ; David 
Rientjes ; Sebastian Andrzej Siewior 
; Dave ; Kleen, Andi 
; Chen, Tim C ; Jesper Dangaard 
Brouer ; Huang, Ying ; Lu, Aaron 
; Li, Aubrey ; Linux MM 
; Linux Kernel 
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

On Thu 30-11-17 17:32:08, kemi wrote:
[...]
> Your patch saves more code than mine because the node stats framework 
> is reused for numa stats. But it has a performance regression because 
> of the limitation of threshold size (125 at most, see 
> calculate_normal_threshold() in vmstat.c) in inc_node_state().

But this "regression" would be visible only on those workloads which really 
need to squeeze every single cycle out of the allocation hot path and those are 
supposed to disable the accounting altogether. Or is this visible on a wider 
variety of workloads.

Do not get me wrong. If we want to make per-node stats more optimal, then by 
all means let's do that. But having 3 sets of counters is just way to much.

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to 
majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-30 Thread kemi


On 2017年11月30日 16:53, Michal Hocko wrote:
> On Thu 30-11-17 13:56:13, kemi wrote:
>>
>>
>> On 2017年11月29日 20:17, Michal Hocko wrote:
>>> On Tue 28-11-17 14:00:23, Kemi Wang wrote:
>>>> The existed implementation of NUMA counters is per logical CPU along with
>>>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>>>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>>>> effect system's decision and are only read from /proc and /sys, it is a
>>>> slow path operation and likely tolerate higher overhead. Additionally,
>>>> usually nodes only have a single zone, except for node 0. And there isn't
>>>> really any use where you need these hits counts separated by zone.
>>>>
>>>> Therefore, we can migrate the implementation of numa stats from per-zone to
>>>> per-node, and get rid of these global numa counters. It's good enough to
>>>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>>>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>>>> (e.g. save more than 130+ lines code).
>>>
>>> I agree. Having these stats per zone is a bit of overcomplication. The
>>> only consumer is /proc/zoneinfo and I would argue this doesn't justify
>>> the additional complexity. Who does really need to know per zone broken
>>> out numbers?
>>>
>>> Anyway, I haven't checked your implementation too deeply but why don't
>>> you simply define static percpu array for each numa node?
>>
>> To be honest, there are another two ways I can think of listed below. but I 
>> don't
>> think they are simpler than my current implementation. Maybe you have better 
>> idea.
>>
>> static u64 __percpu vm_stat_numa[num_possible_nodes() * 
>> NR_VM_NUMA_STAT_ITEMS];
>> But it's not correct.
>>
>> Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in 
>> struct pglist_data.
>>
>> My current implementation is quite straightforward by combining all of local 
>> counters
>> together, only one percpu array with size of 
>> num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS 
>> is enough for that.
> 
> Well, this is certainly a matter of taste. But let's have a look what we
> have currently. We have per zone, per node and numa stats. That looks one
> way to many to me. Why don't we simply move the whole numa stat thingy
> into per node stats? The code would simplify even more. We are going to
> lose /proc/zoneinfo per-zone data but we are losing those without your
> patch anyway. So I've just scratched the following on your patch and the
> cumulative diff looks even better
> 
>  drivers/base/node.c|  22 ++---
>  include/linux/mmzone.h |  22 ++---
>  include/linux/vmstat.h |  38 +
>  mm/mempolicy.c |   2 +-
>  mm/page_alloc.c|  20 ++---
>  mm/vmstat.c| 221 
> +
>  6 files changed, 30 insertions(+), 295 deletions(-)
> 
> I haven't tested it at all yet. This is just to show the idea.
> ---
> commit 92f8f58d1b6cb5c54a5a197a42e02126a5f7ea1a
> Author: Michal Hocko <mho...@suse.com>
> Date:   Thu Nov 30 09:49:45 2017 +0100
> 
> - move NUMA stats to node stats
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0be5fbdadaac..315156310c99 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -190,17 +190,9 @@ static ssize_t node_read_vmstat(struct device *dev,
>   n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>sum_zone_node_page_state(nid, i));
>  
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> - n += sprintf(buf+n, "%s %lu\n",
> -  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> -  node_numa_state_snapshot(nid, i));
> -#endif
> -
>   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>   n += sprintf(buf+n, "%s %lu\n",
> -  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> -  NR_VM_NUMA_STAT_ITEMS],
> +  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>node_page_state(pgdat, i));
>  
>   return n;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b2d264f8c0c6..2c9c8b13c44b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -115,20 +115,6 @@ struct zone_padding {
>  #define ZONE_PADDING(name)
>  

Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-30 Thread kemi


On 2017年11月30日 16:53, Michal Hocko wrote:
> On Thu 30-11-17 13:56:13, kemi wrote:
>>
>>
>> On 2017年11月29日 20:17, Michal Hocko wrote:
>>> On Tue 28-11-17 14:00:23, Kemi Wang wrote:
>>>> The existed implementation of NUMA counters is per logical CPU along with
>>>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>>>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>>>> effect system's decision and are only read from /proc and /sys, it is a
>>>> slow path operation and likely tolerate higher overhead. Additionally,
>>>> usually nodes only have a single zone, except for node 0. And there isn't
>>>> really any use where you need these hits counts separated by zone.
>>>>
>>>> Therefore, we can migrate the implementation of numa stats from per-zone to
>>>> per-node, and get rid of these global numa counters. It's good enough to
>>>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>>>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>>>> (e.g. save more than 130+ lines code).
>>>
>>> I agree. Having these stats per zone is a bit of overcomplication. The
>>> only consumer is /proc/zoneinfo and I would argue this doesn't justify
>>> the additional complexity. Who does really need to know per zone broken
>>> out numbers?
>>>
>>> Anyway, I haven't checked your implementation too deeply but why don't
>>> you simply define static percpu array for each numa node?
>>
>> To be honest, there are another two ways I can think of listed below. but I 
>> don't
>> think they are simpler than my current implementation. Maybe you have better 
>> idea.
>>
>> static u64 __percpu vm_stat_numa[num_possible_nodes() * 
>> NR_VM_NUMA_STAT_ITEMS];
>> But it's not correct.
>>
>> Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in 
>> struct pglist_data.
>>
>> My current implementation is quite straightforward by combining all of local 
>> counters
>> together, only one percpu array with size of 
>> num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS 
>> is enough for that.
> 
> Well, this is certainly a matter of taste. But let's have a look what we
> have currently. We have per zone, per node and numa stats. That looks one
> way to many to me. Why don't we simply move the whole numa stat thingy
> into per node stats? The code would simplify even more. We are going to
> lose /proc/zoneinfo per-zone data but we are losing those without your
> patch anyway. So I've just scratched the following on your patch and the
> cumulative diff looks even better
> 
>  drivers/base/node.c|  22 ++---
>  include/linux/mmzone.h |  22 ++---
>  include/linux/vmstat.h |  38 +
>  mm/mempolicy.c |   2 +-
>  mm/page_alloc.c|  20 ++---
>  mm/vmstat.c| 221 
> +
>  6 files changed, 30 insertions(+), 295 deletions(-)
> 
> I haven't tested it at all yet. This is just to show the idea.
> ---
> commit 92f8f58d1b6cb5c54a5a197a42e02126a5f7ea1a
> Author: Michal Hocko 
> Date:   Thu Nov 30 09:49:45 2017 +0100
> 
> - move NUMA stats to node stats
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0be5fbdadaac..315156310c99 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -190,17 +190,9 @@ static ssize_t node_read_vmstat(struct device *dev,
>   n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>sum_zone_node_page_state(nid, i));
>  
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> - n += sprintf(buf+n, "%s %lu\n",
> -  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> -  node_numa_state_snapshot(nid, i));
> -#endif
> -
>   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>   n += sprintf(buf+n, "%s %lu\n",
> -  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> -  NR_VM_NUMA_STAT_ITEMS],
> +  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>node_page_state(pgdat, i));
>  
>   return n;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b2d264f8c0c6..2c9c8b13c44b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -115,20 +115,6 @@ struct zone_padding {
>  #define ZONE_PADDING(name)
>  #endif
>  
> -#ifdef CONFIG_

Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-29 Thread kemi


On 2017年11月29日 20:17, Michal Hocko wrote:
> On Tue 28-11-17 14:00:23, Kemi Wang wrote:
>> The existed implementation of NUMA counters is per logical CPU along with
>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>> effect system's decision and are only read from /proc and /sys, it is a
>> slow path operation and likely tolerate higher overhead. Additionally,
>> usually nodes only have a single zone, except for node 0. And there isn't
>> really any use where you need these hits counts separated by zone.
>>
>> Therefore, we can migrate the implementation of numa stats from per-zone to
>> per-node, and get rid of these global numa counters. It's good enough to
>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>> (e.g. save more than 130+ lines code).
> 
> I agree. Having these stats per zone is a bit of overcomplication. The
> only consumer is /proc/zoneinfo and I would argue this doesn't justify
> the additional complexity. Who does really need to know per zone broken
> out numbers?
> 
> Anyway, I haven't checked your implementation too deeply but why don't
> you simply define static percpu array for each numa node?

To be honest, there are another two ways I can think of listed below. but I 
don't
think they are simpler than my current implementation. Maybe you have better 
idea.

static u64 __percpu vm_stat_numa[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS];
But it's not correct.

Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in struct 
pglist_data.

My current implementation is quite straightforward by combining all of local 
counters
together, only one percpu array with size of 
num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS 
is enough for that.

> [...]
>> +extern u64 __percpu *vm_numa_stat;
> [...]
>> +#ifdef CONFIG_NUMA
>> +size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
>> +align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
>> +vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
>> +#endif


Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-29 Thread kemi


On 2017年11月29日 20:17, Michal Hocko wrote:
> On Tue 28-11-17 14:00:23, Kemi Wang wrote:
>> The existed implementation of NUMA counters is per logical CPU along with
>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>> effect system's decision and are only read from /proc and /sys, it is a
>> slow path operation and likely tolerate higher overhead. Additionally,
>> usually nodes only have a single zone, except for node 0. And there isn't
>> really any use where you need these hits counts separated by zone.
>>
>> Therefore, we can migrate the implementation of numa stats from per-zone to
>> per-node, and get rid of these global numa counters. It's good enough to
>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>> (e.g. save more than 130+ lines code).
> 
> I agree. Having these stats per zone is a bit of overcomplication. The
> only consumer is /proc/zoneinfo and I would argue this doesn't justify
> the additional complexity. Who does really need to know per zone broken
> out numbers?
> 
> Anyway, I haven't checked your implementation too deeply but why don't
> you simply define static percpu array for each numa node?

To be honest, there are another two ways I can think of listed below. but I 
don't
think they are simpler than my current implementation. Maybe you have better 
idea.

static u64 __percpu vm_stat_numa[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS];
But it's not correct.

Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in struct 
pglist_data.

My current implementation is quite straightforward by combining all of local 
counters
together, only one percpu array with size of 
num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS 
is enough for that.

> [...]
>> +extern u64 __percpu *vm_numa_stat;
> [...]
>> +#ifdef CONFIG_NUMA
>> +size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
>> +align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
>> +vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
>> +#endif


Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-28 Thread kemi


On 2017年11月28日 16:09, Vlastimil Babka wrote:
> On 11/28/2017 07:00 AM, Kemi Wang wrote:
>> The existed implementation of NUMA counters is per logical CPU along with
>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>> effect system's decision and are only read from /proc and /sys, it is a
>> slow path operation and likely tolerate higher overhead. Additionally,
>> usually nodes only have a single zone, except for node 0. And there isn't
>> really any use where you need these hits counts separated by zone.
>>
>> Therefore, we can migrate the implementation of numa stats from per-zone to
>> per-node, and get rid of these global numa counters. It's good enough to
>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>> (e.g. save more than 130+ lines code).
> 
> OK.
> 
>> With this patch, we can see 1.8%(335->329) drop of CPU cycles for single
>> page allocation and deallocation concurrently with 112 threads tested on a
>> 2-sockets skylake platform using Jesper's page_bench03 benchmark.
> 
> To be fair, one can now avoid the overhead completely since 4518085e127d
> ("mm, sysctl: make NUMA stats configurable"). But if we can still
> optimize it, sure.
> 

Yes, I did that several months ago. And both Dave Hansen and me thought that
auto tuning should be better because people probably do not touch this 
interface,
but Michal had some concerns about that. 

This patch aims to cleanup up the code for numa stats with a little performance
improvement.

>> Benchmark provided by Jesper D Brouer(increase loop times to 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> Also, it does not cause obvious latency increase when read /proc and /sys
>> on a 2-sockets skylake platform. Latency shown by time command:
>>base head
>> /proc/vmstatsys 0m0.001s sys 0m0.001s
>>
>> /sys/devices/system/sys 0m0.001s sys 0m0.000s
>> node/node*/numastat
> 
> Well, here I have to point out that the coarse "time" command resolution
> here means the comparison of a single read cannot be compared. You would
> have to e.g. time a loop with enough iterations (which would then be all
> cache-hot, but better than nothing I guess).
> 

It indeed is a coarse comparison to show that it does not cause obvious
overhead in a slow path.

All right, I will do that to get a more accurate value.

>> We would not worry it much as it is a slow path and will not be read
>> frequently.
>>
>> Suggested-by: Andi Kleen <a...@linux.intel.com>
>> Signed-off-by: Kemi Wang <kemi.w...@intel.com>
> 
> ...
> 
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1779c98..7383d66 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
>>   * Zone and node-based page accounting with per cpu differentials.
>>   */
>>  extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
>> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>>  extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
>> -
>> -#ifdef CONFIG_NUMA
>> -static inline void zone_numa_state_add(long x, struct zone *zone,
>> - enum numa_stat_item item)
>> -{
>> -atomic_long_add(x, >vm_numa_stat[item]);
>> -atomic_long_add(x, _numa_stat[item]);
>> -}
>> -
>> -static inline unsigned long global_numa_state(enum numa_stat_item item)
>> -{
>> -long x = atomic_long_read(_numa_stat[item]);
>> -
>> -return x;
>> -}
>> -
>> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> -enum numa_stat_item item)
>> -{
>> -long x = atomic_long_read(>vm_numa_stat[item]);
>> -int cpu;
>> -
>> -for_each_online_cpu(cpu)
>> -x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>> -
>> -return x;
>> -}
>> -#endif /* CONFIG_NUMA */
>> +extern u64 __percpu *vm_numa_stat;
>>  
>>  static inline void zone_page_state_add(long x, struct zone *zone,
>>   enum zone_stat_item item)
>> @@ -234,10 +206,39 @@ static inline unsigned long 
>> node_page_state_snapshot(pg_data_t *pgdat,
>>  
>> 

Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-28 Thread kemi


On 2017年11月28日 16:09, Vlastimil Babka wrote:
> On 11/28/2017 07:00 AM, Kemi Wang wrote:
>> The existed implementation of NUMA counters is per logical CPU along with
>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>> effect system's decision and are only read from /proc and /sys, it is a
>> slow path operation and likely tolerate higher overhead. Additionally,
>> usually nodes only have a single zone, except for node 0. And there isn't
>> really any use where you need these hits counts separated by zone.
>>
>> Therefore, we can migrate the implementation of numa stats from per-zone to
>> per-node, and get rid of these global numa counters. It's good enough to
>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>> (e.g. save more than 130+ lines code).
> 
> OK.
> 
>> With this patch, we can see 1.8%(335->329) drop of CPU cycles for single
>> page allocation and deallocation concurrently with 112 threads tested on a
>> 2-sockets skylake platform using Jesper's page_bench03 benchmark.
> 
> To be fair, one can now avoid the overhead completely since 4518085e127d
> ("mm, sysctl: make NUMA stats configurable"). But if we can still
> optimize it, sure.
> 

Yes, I did that several months ago. And both Dave Hansen and me thought that
auto tuning should be better because people probably do not touch this 
interface,
but Michal had some concerns about that. 

This patch aims to cleanup up the code for numa stats with a little performance
improvement.

>> Benchmark provided by Jesper D Brouer(increase loop times to 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> Also, it does not cause obvious latency increase when read /proc and /sys
>> on a 2-sockets skylake platform. Latency shown by time command:
>>base head
>> /proc/vmstatsys 0m0.001s sys 0m0.001s
>>
>> /sys/devices/system/sys 0m0.001s sys 0m0.000s
>> node/node*/numastat
> 
> Well, here I have to point out that the coarse "time" command resolution
> here means the comparison of a single read cannot be compared. You would
> have to e.g. time a loop with enough iterations (which would then be all
> cache-hot, but better than nothing I guess).
> 

It indeed is a coarse comparison to show that it does not cause obvious
overhead in a slow path.

All right, I will do that to get a more accurate value.

>> We would not worry it much as it is a slow path and will not be read
>> frequently.
>>
>> Suggested-by: Andi Kleen 
>> Signed-off-by: Kemi Wang 
> 
> ...
> 
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1779c98..7383d66 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
>>   * Zone and node-based page accounting with per cpu differentials.
>>   */
>>  extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
>> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>>  extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
>> -
>> -#ifdef CONFIG_NUMA
>> -static inline void zone_numa_state_add(long x, struct zone *zone,
>> - enum numa_stat_item item)
>> -{
>> -atomic_long_add(x, >vm_numa_stat[item]);
>> -atomic_long_add(x, _numa_stat[item]);
>> -}
>> -
>> -static inline unsigned long global_numa_state(enum numa_stat_item item)
>> -{
>> -long x = atomic_long_read(_numa_stat[item]);
>> -
>> -return x;
>> -}
>> -
>> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> -enum numa_stat_item item)
>> -{
>> -long x = atomic_long_read(>vm_numa_stat[item]);
>> -int cpu;
>> -
>> -for_each_online_cpu(cpu)
>> -x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>> -
>> -return x;
>> -}
>> -#endif /* CONFIG_NUMA */
>> +extern u64 __percpu *vm_numa_stat;
>>  
>>  static inline void zone_page_state_add(long x, struct zone *zone,
>>   enum zone_stat_item item)
>> @@ -234,10 +206,39 @@ static inline unsigned long 
>> node_page_state_snapshot(pg_data_t *pgdat,
>>  
>>  
>>  #ifdef CONFIG_NUMA
>> +static inline 

[PATCH 2/2] mm: Rename zone_statistics() to numa_statistics()

2017-11-27 Thread Kemi Wang
Since numa statistics has been separated from zone statistics framework,
but the functionality of zone_statistics() updates numa counters. Thus, the
function name makes people confused. So, change the name to
numa_statistics() as well as its call sites accordingly.

Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 mm/page_alloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 142e1ba..61fa717 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2783,7 +2783,7 @@ int __isolate_free_page(struct page *page, unsigned int 
order)
  *
  * Must be called with interrupts disabled.
  */
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void numa_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
enum numa_stat_item local_stat = NUMA_LOCAL;
@@ -2845,7 +2845,7 @@ static struct page *rmqueue_pcplist(struct zone 
*preferred_zone,
page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-   zone_statistics(preferred_zone, zone);
+   numa_statistics(preferred_zone, zone);
}
local_irq_restore(flags);
return page;
@@ -2893,7 +2893,7 @@ struct page *rmqueue(struct zone *preferred_zone,
  get_pcppage_migratetype(page));
 
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-   zone_statistics(preferred_zone, zone);
+   numa_statistics(preferred_zone, zone);
local_irq_restore(flags);
 
 out:
-- 
2.7.4



[PATCH 2/2] mm: Rename zone_statistics() to numa_statistics()

2017-11-27 Thread Kemi Wang
Since numa statistics has been separated from zone statistics framework,
but the functionality of zone_statistics() updates numa counters. Thus, the
function name makes people confused. So, change the name to
numa_statistics() as well as its call sites accordingly.

Signed-off-by: Kemi Wang 
---
 mm/page_alloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 142e1ba..61fa717 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2783,7 +2783,7 @@ int __isolate_free_page(struct page *page, unsigned int 
order)
  *
  * Must be called with interrupts disabled.
  */
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void numa_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
enum numa_stat_item local_stat = NUMA_LOCAL;
@@ -2845,7 +2845,7 @@ static struct page *rmqueue_pcplist(struct zone 
*preferred_zone,
page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-   zone_statistics(preferred_zone, zone);
+   numa_statistics(preferred_zone, zone);
}
local_irq_restore(flags);
return page;
@@ -2893,7 +2893,7 @@ struct page *rmqueue(struct zone *preferred_zone,
  get_pcppage_migratetype(page));
 
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-   zone_statistics(preferred_zone, zone);
+   numa_statistics(preferred_zone, zone);
local_irq_restore(flags);
 
 out:
-- 
2.7.4



[PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-27 Thread Kemi Wang
The existed implementation of NUMA counters is per logical CPU along with
zone->vm_numa_stat[] separated by zone, plus a global numa counter array
vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
effect system's decision and are only read from /proc and /sys, it is a
slow path operation and likely tolerate higher overhead. Additionally,
usually nodes only have a single zone, except for node 0. And there isn't
really any use where you need these hits counts separated by zone.

Therefore, we can migrate the implementation of numa stats from per-zone to
per-node, and get rid of these global numa counters. It's good enough to
keep everything in a per cpu ptr of type u64, and sum them up when need, as
suggested by Andi Kleen. That's helpful for code cleanup and enhancement
(e.g. save more than 130+ lines code).

With this patch, we can see 1.8%(335->329) drop of CPU cycles for single
page allocation and deallocation concurrently with 112 threads tested on a
2-sockets skylake platform using Jesper's page_bench03 benchmark.

Benchmark provided by Jesper D Brouer(increase loop times to 1000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

Also, it does not cause obvious latency increase when read /proc and /sys
on a 2-sockets skylake platform. Latency shown by time command:
   base head
/proc/vmstatsys 0m0.001s sys 0m0.001s

/sys/devices/system/sys 0m0.001s sys 0m0.000s
node/node*/numastat

We would not worry it much as it is a slow path and will not be read
frequently.

Suggested-by: Andi Kleen <a...@linux.intel.com>
Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 drivers/base/node.c|  14 ++---
 include/linux/mmzone.h |   2 -
 include/linux/vmstat.h |  61 +-
 mm/page_alloc.c|   7 +++
 mm/vmstat.c| 167 -
 5 files changed, 56 insertions(+), 195 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab..0be5fbd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,12 +169,12 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
-  sum_zone_numa_state(dev->id, NUMA_HIT),
-  sum_zone_numa_state(dev->id, NUMA_MISS),
-  sum_zone_numa_state(dev->id, NUMA_FOREIGN),
-  sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
-  sum_zone_numa_state(dev->id, NUMA_LOCAL),
-  sum_zone_numa_state(dev->id, NUMA_OTHER));
+  node_numa_state_snapshot(dev->id, NUMA_HIT),
+  node_numa_state_snapshot(dev->id, NUMA_MISS),
+  node_numa_state_snapshot(dev->id, NUMA_FOREIGN),
+  node_numa_state_snapshot(dev->id, NUMA_INTERLEAVE_HIT),
+  node_numa_state_snapshot(dev->id, NUMA_LOCAL),
+  node_numa_state_snapshot(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
@@ -194,7 +194,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
 vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-sum_zone_numa_state(nid, i));
+node_numa_state_snapshot(nid, i));
 #endif
 
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c..b2d264f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -283,7 +283,6 @@ struct per_cpu_pageset {
struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
s8 expire;
-   u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
 #endif
 #ifdef CONFIG_SMP
s8 stat_threshold;
@@ -504,7 +503,6 @@ struct zone {
ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
-   atomic_long_t   vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
 } cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1779c98..7383d66 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with per cpu differentials.
  */
 extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
-extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
 extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
-
-#ifdef CONFIG_NUMA
-static inline void zone_numa_state_add(long x, struct zone *zone,
-  

[PATCH 1/2] mm: NUMA stats code cleanup and enhancement

2017-11-27 Thread Kemi Wang
The existed implementation of NUMA counters is per logical CPU along with
zone->vm_numa_stat[] separated by zone, plus a global numa counter array
vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
effect system's decision and are only read from /proc and /sys, it is a
slow path operation and likely tolerate higher overhead. Additionally,
usually nodes only have a single zone, except for node 0. And there isn't
really any use where you need these hits counts separated by zone.

Therefore, we can migrate the implementation of numa stats from per-zone to
per-node, and get rid of these global numa counters. It's good enough to
keep everything in a per cpu ptr of type u64, and sum them up when need, as
suggested by Andi Kleen. That's helpful for code cleanup and enhancement
(e.g. save more than 130+ lines code).

With this patch, we can see 1.8%(335->329) drop of CPU cycles for single
page allocation and deallocation concurrently with 112 threads tested on a
2-sockets skylake platform using Jesper's page_bench03 benchmark.

Benchmark provided by Jesper D Brouer(increase loop times to 1000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

Also, it does not cause obvious latency increase when read /proc and /sys
on a 2-sockets skylake platform. Latency shown by time command:
   base head
/proc/vmstatsys 0m0.001s sys 0m0.001s

/sys/devices/system/sys 0m0.001s sys 0m0.000s
node/node*/numastat

We would not worry it much as it is a slow path and will not be read
frequently.

Suggested-by: Andi Kleen 
Signed-off-by: Kemi Wang 
---
 drivers/base/node.c|  14 ++---
 include/linux/mmzone.h |   2 -
 include/linux/vmstat.h |  61 +-
 mm/page_alloc.c|   7 +++
 mm/vmstat.c| 167 -
 5 files changed, 56 insertions(+), 195 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab..0be5fbd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,12 +169,12 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
-  sum_zone_numa_state(dev->id, NUMA_HIT),
-  sum_zone_numa_state(dev->id, NUMA_MISS),
-  sum_zone_numa_state(dev->id, NUMA_FOREIGN),
-  sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
-  sum_zone_numa_state(dev->id, NUMA_LOCAL),
-  sum_zone_numa_state(dev->id, NUMA_OTHER));
+  node_numa_state_snapshot(dev->id, NUMA_HIT),
+  node_numa_state_snapshot(dev->id, NUMA_MISS),
+  node_numa_state_snapshot(dev->id, NUMA_FOREIGN),
+  node_numa_state_snapshot(dev->id, NUMA_INTERLEAVE_HIT),
+  node_numa_state_snapshot(dev->id, NUMA_LOCAL),
+  node_numa_state_snapshot(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
@@ -194,7 +194,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
 vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-sum_zone_numa_state(nid, i));
+node_numa_state_snapshot(nid, i));
 #endif
 
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c..b2d264f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -283,7 +283,6 @@ struct per_cpu_pageset {
struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
s8 expire;
-   u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
 #endif
 #ifdef CONFIG_SMP
s8 stat_threshold;
@@ -504,7 +503,6 @@ struct zone {
ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
-   atomic_long_t   vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
 } cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1779c98..7383d66 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with per cpu differentials.
  */
 extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
-extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
 extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
-
-#ifdef CONFIG_NUMA
-static inline void zone_numa_state_add(long x, struct zone *zone,
-enum numa_stat_item item)
-{
-  

Re: [01/18] x86/asm/64: Remove the restore_c_regs_and_iret label

2017-11-09 Thread kemi
Some performance regression/improvement is reported by LKP-tools for this patch 
series
tested with Intel Atom processor. So, post the data here for your reference.

Branch:x86/entry_consolidation
Commit id:
 base:50da9d439392fdd91601d36e7f05728265bff262
 head:69af865668fdb86a95e4e948b1f48b2689d60b73
Benchmark suite:will-it-scale
Download link:https://github.com/antonblanchard/will-it-scale/tree/master/tests
Metrics:
 will-it-scale.per_process_ops=processes/nr_cpu
 will-it-scale.per_thread_ops=threads/nr_cpu

tbox:lkp-avoton3(nr_cpu=8,memory=16G)
CPU: Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz
Performance regression with will-it-scale benchmark suite:
testcasebasechange  headmetric  
 
eventfd11505677 -5.9%   1416132 
will-it-scale.per_process_ops
1352716 -3.0%   1311943 
will-it-scale.per_thread_ops
lseek2  7306698 -4.3%   6991473 
will-it-scale.per_process_ops
4906388 -3.6%   4730531 
will-it-scale.per_thread_ops
lseek1  7355365 -4.2%   7046224 
will-it-scale.per_process_ops
4928961 -3.7%   4748791 
will-it-scale.per_thread_ops
getppid18479806 -4.1%   8129026 
will-it-scale.per_process_ops
8515252 -4.1%   8162076 
will-it-scale.per_thread_ops
lock1   1054249 -3.2%   1020895 
will-it-scale.per_process_ops
989145  -2.6%   963578  
will-it-scale.per_thread_ops
dup12675825 -3.0%   2596257 
will-it-scale.per_process_ops
futex3  4986520 -2.8%   4846640 
will-it-scale.per_process_ops
5009388 -2.7%   4875126 
will-it-scale.per_thread_ops
futex4  3932936 -2.0%   3854240 
will-it-scale.per_process_ops
3950138 -2.0%   3872615 
will-it-scale.per_thread_ops
futex1  2941886 -1.8%   2888912 
will-it-scale.per_process_ops
futex2  2500203 -1.6%   2461065 
will-it-scale.per_process_ops
1534692 -2.3%   1499532 
will-it-scale.per_thread_ops
malloc1 61314   -1.0%   60725   
will-it-scale.per_process_ops
19996   -1.5%   19688   
will-it-scale.per_thread_ops

Performance improvement with will-it-scale benchmark suite:
testcasebasechange  headmetric  
 
context_switch1 176376  +1.6%   179152  
will-it-scale.per_process_ops
180703  +1.9%   184209  
will-it-scale.per_thread_ops
page_fault2 179716  +2.5%   184272  
will-it-scale.per_process_ops
146890  +2.8%   150989  
will-it-scale.per_thread_ops
page_fault3 666953  +3.7%   691735  
will-it-scale.per_process_ops
464641  +5.0%   487952  
will-it-scale.per_thread_ops
unix1   483094  +4.4%   504201  
will-it-scale.per_process_ops
450055  +7.5%   483637  
will-it-scale.per_thread_ops
read2   575887  +5.0%   604440  
will-it-scale.per_process_ops
500319  +5.2%   526361  
will-it-scale.per_thread_ops
poll1   4614597 +5.4%   4864022 
will-it-scale.per_process_ops
3981551 +5.8%   4213409 
will-it-scale.per_thread_ops
pwrite2 383344  +5.7%   405151  
will-it-scale.per_process_ops
367006  +5.0%   385209  
will-it-scale.per_thread_ops
sched_yield 3011191 +6.0%   3191710 
will-it-scale.per_process_ops
3024171 +6.1%   3208197 
will-it-scale.per_thread_ops
pipe1   755487  +6.2%   802622  
will-it-scale.per_process_ops
705136  +8.8%   766950  
will-it-scale.per_thread_ops
pwrite3 422850  +6.6%   450660  
will-it-scale.per_process_ops
413370  +3.7%   428704  
will-it-scale.per_thread_ops
readseek1   972102  +6.7%   1036852 
will-it-scale.per_process_ops
844877  +6.6%   900686  
will-it-scale.per_thread_ops
pwrite1 981310  +6.8%   1047809 
will-it-scale.per_process_ops

Re: [01/18] x86/asm/64: Remove the restore_c_regs_and_iret label

2017-11-09 Thread kemi
Some performance regression/improvement is reported by LKP-tools for this patch 
series
tested with Intel Atom processor. So, post the data here for your reference.

Branch:x86/entry_consolidation
Commit id:
 base:50da9d439392fdd91601d36e7f05728265bff262
 head:69af865668fdb86a95e4e948b1f48b2689d60b73
Benchmark suite:will-it-scale
Download link:https://github.com/antonblanchard/will-it-scale/tree/master/tests
Metrics:
 will-it-scale.per_process_ops=processes/nr_cpu
 will-it-scale.per_thread_ops=threads/nr_cpu

tbox:lkp-avoton3(nr_cpu=8,memory=16G)
CPU: Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz
Performance regression with will-it-scale benchmark suite:
testcasebasechange  headmetric  
 
eventfd11505677 -5.9%   1416132 
will-it-scale.per_process_ops
1352716 -3.0%   1311943 
will-it-scale.per_thread_ops
lseek2  7306698 -4.3%   6991473 
will-it-scale.per_process_ops
4906388 -3.6%   4730531 
will-it-scale.per_thread_ops
lseek1  7355365 -4.2%   7046224 
will-it-scale.per_process_ops
4928961 -3.7%   4748791 
will-it-scale.per_thread_ops
getppid18479806 -4.1%   8129026 
will-it-scale.per_process_ops
8515252 -4.1%   8162076 
will-it-scale.per_thread_ops
lock1   1054249 -3.2%   1020895 
will-it-scale.per_process_ops
989145  -2.6%   963578  
will-it-scale.per_thread_ops
dup12675825 -3.0%   2596257 
will-it-scale.per_process_ops
futex3  4986520 -2.8%   4846640 
will-it-scale.per_process_ops
5009388 -2.7%   4875126 
will-it-scale.per_thread_ops
futex4  3932936 -2.0%   3854240 
will-it-scale.per_process_ops
3950138 -2.0%   3872615 
will-it-scale.per_thread_ops
futex1  2941886 -1.8%   2888912 
will-it-scale.per_process_ops
futex2  2500203 -1.6%   2461065 
will-it-scale.per_process_ops
1534692 -2.3%   1499532 
will-it-scale.per_thread_ops
malloc1 61314   -1.0%   60725   
will-it-scale.per_process_ops
19996   -1.5%   19688   
will-it-scale.per_thread_ops

Performance improvement with will-it-scale benchmark suite:
testcasebasechange  headmetric  
 
context_switch1 176376  +1.6%   179152  
will-it-scale.per_process_ops
180703  +1.9%   184209  
will-it-scale.per_thread_ops
page_fault2 179716  +2.5%   184272  
will-it-scale.per_process_ops
146890  +2.8%   150989  
will-it-scale.per_thread_ops
page_fault3 666953  +3.7%   691735  
will-it-scale.per_process_ops
464641  +5.0%   487952  
will-it-scale.per_thread_ops
unix1   483094  +4.4%   504201  
will-it-scale.per_process_ops
450055  +7.5%   483637  
will-it-scale.per_thread_ops
read2   575887  +5.0%   604440  
will-it-scale.per_process_ops
500319  +5.2%   526361  
will-it-scale.per_thread_ops
poll1   4614597 +5.4%   4864022 
will-it-scale.per_process_ops
3981551 +5.8%   4213409 
will-it-scale.per_thread_ops
pwrite2 383344  +5.7%   405151  
will-it-scale.per_process_ops
367006  +5.0%   385209  
will-it-scale.per_thread_ops
sched_yield 3011191 +6.0%   3191710 
will-it-scale.per_process_ops
3024171 +6.1%   3208197 
will-it-scale.per_thread_ops
pipe1   755487  +6.2%   802622  
will-it-scale.per_process_ops
705136  +8.8%   766950  
will-it-scale.per_thread_ops
pwrite3 422850  +6.6%   450660  
will-it-scale.per_process_ops
413370  +3.7%   428704  
will-it-scale.per_thread_ops
readseek1   972102  +6.7%   1036852 
will-it-scale.per_process_ops
844877  +6.6%   900686  
will-it-scale.per_thread_ops
pwrite1 981310  +6.8%   1047809 
will-it-scale.per_process_ops

Re: mm, vmstat: Make sure mutex is a global static

2017-11-08 Thread kemi


On 2017年11月08日 05:38, Kees Cook wrote:
> The mutex in sysctl_vm_numa_stat_handler() needs to be a global static, not
> a stack variable, otherwise it doesn't serve any purpose. Also, reading the
> file with CONFIG_LOCKDEP=y will complain:
> 

It's my mistake. Kees, thanks for catching it.

> [   63.258593] INFO: trying to register non-static key.
> [   63.259113] the code is fine but needs lockdep annotation.
> [   63.259596] turning off the locking correctness validator.
> [   63.260073] CPU: 1 PID: 4102 Comm: perl Not tainted 
> 4.14.0-rc8-next-20171107+ #419
> [   63.260769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   63.261570] Call Trace:
> [   63.261783]  dump_stack+0x5f/0x86
> [   63.262062]  register_lock_class+0xe4/0x550
> [   63.262408]  ? __lock_acquire+0x308/0x1170
> [   63.262746]  __lock_acquire+0x7e/0x1170
> [   63.263063]  lock_acquire+0x9d/0x1d0
> [   63.263363]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.263777]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.264192]  __mutex_lock+0xb8/0x9a0
> [   63.264488]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.264942]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.265398]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.265840]  sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.266270]  proc_sys_call_handler+0xe3/0x100
> [   63.266655]  __vfs_read+0x33/0x1b0
> [   63.266957]  vfs_read+0xa6/0x150
> [   63.267244]  SyS_read+0x55/0xc0
> [   63.267525]  do_syscall_64+0x56/0x140
> [   63.267850]  entry_SYSCALL64_slow_path+0x25/0x25
> 
> Fixes: 920d5f77d1a25 ("mm, sysctl: make NUMA stats configurable")
> Cc: Jesper Dangaard Brouer 
> Cc: Dave Hansen 
> Cc: Ying Huang 
> Cc: Vlastimil Babka 
> Cc: Michal Hocko 
> Signed-off-by: Kees Cook 
> ---
>  mm/vmstat.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index e0593434fd58..40b2db6db6b1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -72,11 +72,12 @@ static void invalid_numa_statistics(void)
>   zero_global_numa_counters();
>  }
>  
> +static DEFINE_MUTEX(vm_numa_stat_lock);
> +
>  int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
>   void __user *buffer, size_t *length, loff_t *ppos)
>  {
>   int ret, oldval;
> - DEFINE_MUTEX(vm_numa_stat_lock);
>  
>   mutex_lock(_numa_stat_lock);
>   if (write)
> 


Re: mm, vmstat: Make sure mutex is a global static

2017-11-08 Thread kemi


On 2017年11月08日 05:38, Kees Cook wrote:
> The mutex in sysctl_vm_numa_stat_handler() needs to be a global static, not
> a stack variable, otherwise it doesn't serve any purpose. Also, reading the
> file with CONFIG_LOCKDEP=y will complain:
> 

It's my mistake. Kees, thanks for catching it.

> [   63.258593] INFO: trying to register non-static key.
> [   63.259113] the code is fine but needs lockdep annotation.
> [   63.259596] turning off the locking correctness validator.
> [   63.260073] CPU: 1 PID: 4102 Comm: perl Not tainted 
> 4.14.0-rc8-next-20171107+ #419
> [   63.260769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   63.261570] Call Trace:
> [   63.261783]  dump_stack+0x5f/0x86
> [   63.262062]  register_lock_class+0xe4/0x550
> [   63.262408]  ? __lock_acquire+0x308/0x1170
> [   63.262746]  __lock_acquire+0x7e/0x1170
> [   63.263063]  lock_acquire+0x9d/0x1d0
> [   63.263363]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.263777]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.264192]  __mutex_lock+0xb8/0x9a0
> [   63.264488]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.264942]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.265398]  ? sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.265840]  sysctl_vm_numa_stat_handler+0x8f/0x2d0
> [   63.266270]  proc_sys_call_handler+0xe3/0x100
> [   63.266655]  __vfs_read+0x33/0x1b0
> [   63.266957]  vfs_read+0xa6/0x150
> [   63.267244]  SyS_read+0x55/0xc0
> [   63.267525]  do_syscall_64+0x56/0x140
> [   63.267850]  entry_SYSCALL64_slow_path+0x25/0x25
> 
> Fixes: 920d5f77d1a25 ("mm, sysctl: make NUMA stats configurable")
> Cc: Jesper Dangaard Brouer 
> Cc: Dave Hansen 
> Cc: Ying Huang 
> Cc: Vlastimil Babka 
> Cc: Michal Hocko 
> Signed-off-by: Kees Cook 
> ---
>  mm/vmstat.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index e0593434fd58..40b2db6db6b1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -72,11 +72,12 @@ static void invalid_numa_statistics(void)
>   zero_global_numa_counters();
>  }
>  
> +static DEFINE_MUTEX(vm_numa_stat_lock);
> +
>  int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
>   void __user *buffer, size_t *length, loff_t *ppos)
>  {
>   int ret, oldval;
> - DEFINE_MUTEX(vm_numa_stat_lock);
>  
>   mutex_lock(_numa_stat_lock);
>   if (write)
> 


Re: [PATCH v2] buffer: Avoid setting buffer bits that are already set

2017-11-02 Thread kemi


On 2017年10月24日 09:16, Kemi Wang wrote:
> It's expensive to set buffer flags that are already set, because that
> causes a costly cache line transition.
> 
> A common case is setting the "verified" flag during ext4 writes.
> This patch checks for the flag being set first.
> 
> With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
> file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
> a 2-sockets broadwell platform.
> 
> What the benchmark does is: it forks 3000 processes, and each  process do
> the following:
> a) open a new file
> b) close the file
> c) delete the file
> until loop=100*1000 times.
> 
> The original patch is contributed by Andi Kleen.
> 
> Signed-off-by: Andi Kleen <a...@linux.intel.com>
> Signed-off-by: Kemi Wang <kemi.w...@intel.com>
> Tested-by: Kemi Wang <kemi.w...@intel.com>
> Reviewed-by: Jens Axboe <ax...@kernel.dk>
> ---

Seems that this patch is still not merged. Anything wrong with that? thanks

>  include/linux/buffer_head.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c8dae55..211d8f5 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -80,11 +80,14 @@ struct buffer_head {
>  /*
>   * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
>   * and buffer_foo() functions.
> + * To avoid reset buffer flags that are already set, because that causes
> + * a costly cache line transition, check the flag first.
>   */
>  #define BUFFER_FNS(bit, name)
> \
>  static __always_inline void set_buffer_##name(struct buffer_head *bh)
> \
>  {\
> - set_bit(BH_##bit, &(bh)->b_state);  \
> + if (!test_bit(BH_##bit, &(bh)->b_state))\
> + set_bit(BH_##bit, &(bh)->b_state);  \
>  }\
>  static __always_inline void clear_buffer_##name(struct buffer_head *bh)  
> \
>  {\
> 


Re: [PATCH v2] buffer: Avoid setting buffer bits that are already set

2017-11-02 Thread kemi


On 2017年10月24日 09:16, Kemi Wang wrote:
> It's expensive to set buffer flags that are already set, because that
> causes a costly cache line transition.
> 
> A common case is setting the "verified" flag during ext4 writes.
> This patch checks for the flag being set first.
> 
> With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
> file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
> a 2-sockets broadwell platform.
> 
> What the benchmark does is: it forks 3000 processes, and each  process do
> the following:
> a) open a new file
> b) close the file
> c) delete the file
> until loop=100*1000 times.
> 
> The original patch is contributed by Andi Kleen.
> 
> Signed-off-by: Andi Kleen 
> Signed-off-by: Kemi Wang 
> Tested-by: Kemi Wang 
> Reviewed-by: Jens Axboe 
> ---

Seems that this patch is still not merged. Anything wrong with that? thanks

>  include/linux/buffer_head.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c8dae55..211d8f5 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -80,11 +80,14 @@ struct buffer_head {
>  /*
>   * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
>   * and buffer_foo() functions.
> + * To avoid reset buffer flags that are already set, because that causes
> + * a costly cache line transition, check the flag first.
>   */
>  #define BUFFER_FNS(bit, name)
> \
>  static __always_inline void set_buffer_##name(struct buffer_head *bh)
> \
>  {\
> - set_bit(BH_##bit, &(bh)->b_state);  \
> + if (!test_bit(BH_##bit, &(bh)->b_state))\
> + set_bit(BH_##bit, &(bh)->b_state);  \
>  }\
>  static __always_inline void clear_buffer_##name(struct buffer_head *bh)  
> \
>  {\
> 


Re: [v5,22/22] powerpc/mm: Add speculative page fault

2017-10-26 Thread kemi
Some regression is found by LKP-tools(linux kernel performance) on this patch 
series
tested on Intel 2s/4s Skylake platform. 
The regression result is sorted by the metric will-it-scale.per_process_ops.

Branch:Laurent-Dufour/Speculative-page-faults/20171011-213456(V4 patch series)
Commit id:
 base:9a4b4dd1d8700dd5771f11dd2c048e4363efb493
 head:56a4a8962fb32555a42eefdc9a19eeedd3e8c2e6
Benchmark suite:will-it-scale
Download link:https://github.com/antonblanchard/will-it-scale/tree/master/tests
Metrics:
 will-it-scale.per_process_ops=processes/nr_cpu
 will-it-scale.per_thread_ops=threads/nr_cpu

tbox:lkp-skl-4sp1(nr_cpu=192,memory=768G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE is not set
testcasebasechange  headmetric  
 
brk12251803 -18.1%  1843535 
will-it-scale.per_process_ops
341101  -17.5%  281284  
will-it-scale.per_thread_ops
malloc1 48833   -9.2%   44343   
will-it-scale.per_process_ops
31555   +2.9%   32473   
will-it-scale.per_thread_ops
page_fault3 913019  -8.5%   835203  
will-it-scale.per_process_ops
233978  -18.1%  191593  
will-it-scale.per_thread_ops
mmap2   95892   -6.6%   89536   
will-it-scale.per_process_ops
90180   -13.7%  77803   
will-it-scale.per_thread_ops
mmap1   109586  -4.7%   104414  
will-it-scale.per_process_ops
104477  -12.4%  91484   
will-it-scale.per_thread_ops
sched_yield 4964649 -2.1%   4859927 
will-it-scale.per_process_ops
4946759 -1.7%   4864924 
will-it-scale.per_thread_ops
write1  1345159 -1.3%   1327719 
will-it-scale.per_process_ops
1228754 -2.2%   1201915 
will-it-scale.per_thread_ops
page_fault2 202519  -1.0%   200545  
will-it-scale.per_process_ops
96573   -10.4%  86526   
will-it-scale.per_thread_ops
page_fault1 225608  -0.9%   223585  
will-it-scale.per_process_ops
105945  +14.4%  121199  
will-it-scale.per_thread_ops

tbox:lkp-skl-4sp1(nr_cpu=192,memory=768G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
testcasebasechange  headmetric  
 
context_switch1 333780  -23.0%  256927  
will-it-scale.per_process_ops
brk12263539 -18.8%  1837462 
will-it-scale.per_process_ops
325854  -15.7%  274752  
will-it-scale.per_thread_ops
malloc1 48746   -13.5%  42148   
will-it-scale.per_process_ops
mmap1   106860  -12.4%  93634   
will-it-scale.per_process_ops
98082   -18.9%  79506   
will-it-scale.per_thread_ops
mmap2   92468   -11.3%  82059   
will-it-scale.per_process_ops
80468   -8.9%   73343   
will-it-scale.per_thread_ops
page_fault3 900709  -9.1%   818851  
will-it-scale.per_process_ops
229837  -18.3%  187769  
will-it-scale.per_thread_ops
write1  1327409 -1.7%   1305048 
will-it-scale.per_process_ops
1215658 -1.6%   1196479 
will-it-scale.per_thread_ops
writeseek3  300639  -1.6%   295882  
will-it-scale.per_process_ops
231118  -2.2%   225929  
will-it-scale.per_thread_ops
signal1 122011  -1.5%   120155  
will-it-scale.per_process_ops
futex1  5123778 -1.2%   5062087 
will-it-scale.per_process_ops
page_fault2 202321  -1.0%   200289  
will-it-scale.per_process_ops
93073   -9.8%   83927   
will-it-scale.per_thread_ops

tbox:lkp-skl-2sp2(nr_cpu=112,memory=64G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
testcasebasechange  headmetric  
 
brk12177903 -20.0%  1742054 
will-it-scale.per_process_ops
434558  -15.3%  367896  
will-it-scale.per_thread_ops
malloc1 64871   -10.3%  58174   
will-it-scale.per_process_ops
page_fault3 882435  -9.0%   802892  
will-it-scale.per_process_ops
299176  -15.7%  252170   

Re: [v5,22/22] powerpc/mm: Add speculative page fault

2017-10-26 Thread kemi
Some regression is found by LKP-tools(linux kernel performance) on this patch 
series
tested on Intel 2s/4s Skylake platform. 
The regression result is sorted by the metric will-it-scale.per_process_ops.

Branch:Laurent-Dufour/Speculative-page-faults/20171011-213456(V4 patch series)
Commit id:
 base:9a4b4dd1d8700dd5771f11dd2c048e4363efb493
 head:56a4a8962fb32555a42eefdc9a19eeedd3e8c2e6
Benchmark suite:will-it-scale
Download link:https://github.com/antonblanchard/will-it-scale/tree/master/tests
Metrics:
 will-it-scale.per_process_ops=processes/nr_cpu
 will-it-scale.per_thread_ops=threads/nr_cpu

tbox:lkp-skl-4sp1(nr_cpu=192,memory=768G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE is not set
testcasebasechange  headmetric  
 
brk12251803 -18.1%  1843535 
will-it-scale.per_process_ops
341101  -17.5%  281284  
will-it-scale.per_thread_ops
malloc1 48833   -9.2%   44343   
will-it-scale.per_process_ops
31555   +2.9%   32473   
will-it-scale.per_thread_ops
page_fault3 913019  -8.5%   835203  
will-it-scale.per_process_ops
233978  -18.1%  191593  
will-it-scale.per_thread_ops
mmap2   95892   -6.6%   89536   
will-it-scale.per_process_ops
90180   -13.7%  77803   
will-it-scale.per_thread_ops
mmap1   109586  -4.7%   104414  
will-it-scale.per_process_ops
104477  -12.4%  91484   
will-it-scale.per_thread_ops
sched_yield 4964649 -2.1%   4859927 
will-it-scale.per_process_ops
4946759 -1.7%   4864924 
will-it-scale.per_thread_ops
write1  1345159 -1.3%   1327719 
will-it-scale.per_process_ops
1228754 -2.2%   1201915 
will-it-scale.per_thread_ops
page_fault2 202519  -1.0%   200545  
will-it-scale.per_process_ops
96573   -10.4%  86526   
will-it-scale.per_thread_ops
page_fault1 225608  -0.9%   223585  
will-it-scale.per_process_ops
105945  +14.4%  121199  
will-it-scale.per_thread_ops

tbox:lkp-skl-4sp1(nr_cpu=192,memory=768G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
testcasebasechange  headmetric  
 
context_switch1 333780  -23.0%  256927  
will-it-scale.per_process_ops
brk12263539 -18.8%  1837462 
will-it-scale.per_process_ops
325854  -15.7%  274752  
will-it-scale.per_thread_ops
malloc1 48746   -13.5%  42148   
will-it-scale.per_process_ops
mmap1   106860  -12.4%  93634   
will-it-scale.per_process_ops
98082   -18.9%  79506   
will-it-scale.per_thread_ops
mmap2   92468   -11.3%  82059   
will-it-scale.per_process_ops
80468   -8.9%   73343   
will-it-scale.per_thread_ops
page_fault3 900709  -9.1%   818851  
will-it-scale.per_process_ops
229837  -18.3%  187769  
will-it-scale.per_thread_ops
write1  1327409 -1.7%   1305048 
will-it-scale.per_process_ops
1215658 -1.6%   1196479 
will-it-scale.per_thread_ops
writeseek3  300639  -1.6%   295882  
will-it-scale.per_process_ops
231118  -2.2%   225929  
will-it-scale.per_thread_ops
signal1 122011  -1.5%   120155  
will-it-scale.per_process_ops
futex1  5123778 -1.2%   5062087 
will-it-scale.per_process_ops
page_fault2 202321  -1.0%   200289  
will-it-scale.per_process_ops
93073   -9.8%   83927   
will-it-scale.per_thread_ops

tbox:lkp-skl-2sp2(nr_cpu=112,memory=64G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
testcasebasechange  headmetric  
 
brk12177903 -20.0%  1742054 
will-it-scale.per_process_ops
434558  -15.3%  367896  
will-it-scale.per_thread_ops
malloc1 64871   -10.3%  58174   
will-it-scale.per_process_ops
page_fault3 882435  -9.0%   802892  
will-it-scale.per_process_ops
299176  -15.7%  252170   

Re: [PATCH] buffer: Avoid setting buffer bits that are already set

2017-10-23 Thread kemi


On 2017年10月24日 09:21, Andi Kleen wrote:
> kemi <kemi.w...@intel.com> writes:
>>
>> I'll see if I can find some
>>> time to implement the above in a nice way.
>>
>> Agree. Maybe something like test_and_set_bit() would be more suitable.
> 
> test_and_set_bit is a very different operation for the CPU because
> it is atomic for both. But we want the initial read to not
> be atomic.
> 

I meant to express the meaning of test before setting bit.
Apologize to make you confused.

> If you add special functions use a different variant that is only
> atomic for the set.
> 
> -Andi
> 


Re: [PATCH] buffer: Avoid setting buffer bits that are already set

2017-10-23 Thread kemi


On 2017年10月24日 09:21, Andi Kleen wrote:
> kemi  writes:
>>
>> I'll see if I can find some
>>> time to implement the above in a nice way.
>>
>> Agree. Maybe something like test_and_set_bit() would be more suitable.
> 
> test_and_set_bit is a very different operation for the CPU because
> it is atomic for both. But we want the initial read to not
> be atomic.
> 

I meant to express the meaning of test before setting bit.
Apologize to make you confused.

> If you add special functions use a different variant that is only
> atomic for the set.
> 
> -Andi
> 


[PATCH v2] buffer: Avoid setting buffer bits that are already set

2017-10-23 Thread Kemi Wang
It's expensive to set buffer flags that are already set, because that
causes a costly cache line transition.

A common case is setting the "verified" flag during ext4 writes.
This patch checks for the flag being set first.

With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
a 2-sockets broadwell platform.

What the benchmark does is: it forks 3000 processes, and each  process do
the following:
a) open a new file
b) close the file
c) delete the file
until loop=100*1000 times.

The original patch is contributed by Andi Kleen.

Signed-off-by: Andi Kleen <a...@linux.intel.com>
Signed-off-by: Kemi Wang <kemi.w...@intel.com>
Tested-by: Kemi Wang <kemi.w...@intel.com>
Reviewed-by: Jens Axboe <ax...@kernel.dk>
---
 include/linux/buffer_head.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index c8dae55..211d8f5 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -80,11 +80,14 @@ struct buffer_head {
 /*
  * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
  * and buffer_foo() functions.
+ * To avoid reset buffer flags that are already set, because that causes
+ * a costly cache line transition, check the flag first.
  */
 #define BUFFER_FNS(bit, name)  \
 static __always_inline void set_buffer_##name(struct buffer_head *bh)  \
 {  \
-   set_bit(BH_##bit, &(bh)->b_state);  \
+   if (!test_bit(BH_##bit, &(bh)->b_state))\
+   set_bit(BH_##bit, &(bh)->b_state);  \
 }  \
 static __always_inline void clear_buffer_##name(struct buffer_head *bh)
\
 {  \
-- 
2.7.4



[PATCH v2] buffer: Avoid setting buffer bits that are already set

2017-10-23 Thread Kemi Wang
It's expensive to set buffer flags that are already set, because that
causes a costly cache line transition.

A common case is setting the "verified" flag during ext4 writes.
This patch checks for the flag being set first.

With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
a 2-sockets broadwell platform.

What the benchmark does is: it forks 3000 processes, and each  process do
the following:
a) open a new file
b) close the file
c) delete the file
until loop=100*1000 times.

The original patch is contributed by Andi Kleen.

Signed-off-by: Andi Kleen 
Signed-off-by: Kemi Wang 
Tested-by: Kemi Wang 
Reviewed-by: Jens Axboe 
---
 include/linux/buffer_head.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index c8dae55..211d8f5 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -80,11 +80,14 @@ struct buffer_head {
 /*
  * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
  * and buffer_foo() functions.
+ * To avoid reset buffer flags that are already set, because that causes
+ * a costly cache line transition, check the flag first.
  */
 #define BUFFER_FNS(bit, name)  \
 static __always_inline void set_buffer_##name(struct buffer_head *bh)  \
 {  \
-   set_bit(BH_##bit, &(bh)->b_state);  \
+   if (!test_bit(BH_##bit, &(bh)->b_state))\
+   set_bit(BH_##bit, &(bh)->b_state);  \
 }  \
 static __always_inline void clear_buffer_##name(struct buffer_head *bh)
\
 {  \
-- 
2.7.4



Re: [PATCH] buffer: Avoid setting buffer bits that are already set

2017-10-23 Thread kemi


On 2017年10月24日 00:19, Jens Axboe wrote:
> On 10/23/2017 10:27 AM, Kemi Wang wrote:
>> It's expensive to set buffer flags that are already set, because that
>> causes a costly cache line transition.
>>
>> A common case is setting the "verified" flag during ext4 writes.
>> This patch checks for the flag being set first.
>>
>> With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
>> file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
>> a 2-sockets broadwell platform.
>>
>> What the benchmark does is: it forks 3000 processes, and each  process do
>> the following:
>> a) open a new file
>> b) close the file
>> c) delete the file
>> until loop=100*1000 times.
>>
>> The original patch is contributed by Andi Kleen.
> 
> We discussed this recently, in reference to this commit:
> 
> commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d
> Author: Jens Axboe <ax...@fb.com>
> Date:   Thu May 22 11:54:16 2014 -0700
> 
> mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
> 
> which made a massive difference, as the changelog details.
> 
> blk-mq uses this extensively as well, where possible. The problem is
> that it always has to be explained, hence the recent discussion was
> around perhaps adding
> 
> set_bit_if_not_set()
> clear_bit_if_set()
> 
> or similar functions, to document in a single location why this matters.
> Additionally, some archs may be able to implement that in an efficient
> manner.
> 
> You can add my reviewed-by to the below, 

Thanks.

I'll see if I can find some
> time to implement the above in a nice way.

Agree. Maybe something like test_and_set_bit() would be more suitable.

 In the mean time, you may
> want to consider adding a comment to the function explaining why you
> have done it that way.
> 

Sure.


Re: [PATCH] buffer: Avoid setting buffer bits that are already set

2017-10-23 Thread kemi


On 2017年10月24日 00:19, Jens Axboe wrote:
> On 10/23/2017 10:27 AM, Kemi Wang wrote:
>> It's expensive to set buffer flags that are already set, because that
>> causes a costly cache line transition.
>>
>> A common case is setting the "verified" flag during ext4 writes.
>> This patch checks for the flag being set first.
>>
>> With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
>> file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
>> a 2-sockets broadwell platform.
>>
>> What the benchmark does is: it forks 3000 processes, and each  process do
>> the following:
>> a) open a new file
>> b) close the file
>> c) delete the file
>> until loop=100*1000 times.
>>
>> The original patch is contributed by Andi Kleen.
> 
> We discussed this recently, in reference to this commit:
> 
> commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d
> Author: Jens Axboe 
> Date:   Thu May 22 11:54:16 2014 -0700
> 
> mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
> 
> which made a massive difference, as the changelog details.
> 
> blk-mq uses this extensively as well, where possible. The problem is
> that it always has to be explained, hence the recent discussion was
> around perhaps adding
> 
> set_bit_if_not_set()
> clear_bit_if_set()
> 
> or similar functions, to document in a single location why this matters.
> Additionally, some archs may be able to implement that in an efficient
> manner.
> 
> You can add my reviewed-by to the below, 

Thanks.

I'll see if I can find some
> time to implement the above in a nice way.

Agree. Maybe something like test_and_set_bit() would be more suitable.

 In the mean time, you may
> want to consider adding a comment to the function explaining why you
> have done it that way.
> 

Sure.


[PATCH] buffer: Avoid setting buffer bits that are already set

2017-10-23 Thread Kemi Wang
It's expensive to set buffer flags that are already set, because that
causes a costly cache line transition.

A common case is setting the "verified" flag during ext4 writes.
This patch checks for the flag being set first.

With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
a 2-sockets broadwell platform.

What the benchmark does is: it forks 3000 processes, and each  process do
the following:
a) open a new file
b) close the file
c) delete the file
until loop=100*1000 times.

The original patch is contributed by Andi Kleen.

Signed-off-by: Andi Kleen <a...@linux.intel.com>
Signed-off-by: Kemi Wang <kemi.w...@intel.com>
Tested-by: Kemi Wang <kemi.w...@intel.com>
---
 include/linux/buffer_head.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index c8dae55..e1799f7 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -84,7 +84,8 @@ struct buffer_head {
 #define BUFFER_FNS(bit, name)  \
 static __always_inline void set_buffer_##name(struct buffer_head *bh)  \
 {  \
-   set_bit(BH_##bit, &(bh)->b_state);  \
+   if (!test_bit(BH_##bit, &(bh)->b_state))\
+   set_bit(BH_##bit, &(bh)->b_state);  \
 }  \
 static __always_inline void clear_buffer_##name(struct buffer_head *bh)
\
 {  \
-- 
2.7.4



[PATCH] buffer: Avoid setting buffer bits that are already set

2017-10-23 Thread Kemi Wang
It's expensive to set buffer flags that are already set, because that
causes a costly cache line transition.

A common case is setting the "verified" flag during ext4 writes.
This patch checks for the flag being set first.

With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
a 2-sockets broadwell platform.

What the benchmark does is: it forks 3000 processes, and each  process do
the following:
a) open a new file
b) close the file
c) delete the file
until loop=100*1000 times.

The original patch is contributed by Andi Kleen.

Signed-off-by: Andi Kleen 
Signed-off-by: Kemi Wang 
Tested-by: Kemi Wang 
---
 include/linux/buffer_head.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index c8dae55..e1799f7 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -84,7 +84,8 @@ struct buffer_head {
 #define BUFFER_FNS(bit, name)  \
 static __always_inline void set_buffer_##name(struct buffer_head *bh)  \
 {  \
-   set_bit(BH_##bit, &(bh)->b_state);  \
+   if (!test_bit(BH_##bit, &(bh)->b_state))\
+   set_bit(BH_##bit, &(bh)->b_state);  \
 }  \
 static __always_inline void clear_buffer_##name(struct buffer_head *bh)
\
 {  \
-- 
2.7.4



[PATCH v5] mm, sysctl: make NUMA stats configurable

2017-10-17 Thread Kemi Wang
This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested by
Dave Hansen and Ying Huang.

=
When page allocation performance becomes a bottleneck and you can tolerate
some possible tool breakage and decreased numa counter precision, you can
do:
echo 0 > /proc/sys/vm/numa_stat
In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
cycles per single page allocation and reclaim on Jesper's page_bench03 (88
threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
memory).

Benchmark link provided by Jesper D Brouer(increase loop times to
1000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

=
When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:
echo 1 > /proc/sys/vm/numa_stat
This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

ChangeLog:
  V4->V5
  a) Scope vm_numa_stat_lock into the sysctl handler function, as suggested
  by Michal Hocko;
  b) Only allow 0/1 value when setting a value to numa_stat at userspace,
  that would keep the possibility for add auto mode in future (e.g. 2 for
  auto mode), as suggested by Michal Hocko.

  V3->V4
  a) Get rid of auto mode of numa stats, and may add it back if necessary,
  as alignment before;
  b) Skip NUMA_INTERLEAVE_HIT counter update when numa stats is disabled,
  as reported by Andrey Ryabinin. See commit "de55c8b2519" for details
  c) Remove extern declaration for those clear_numa_ function, and make
  them static in vmstat.c, as suggested by Vlastimil Babka.

  V2->V3:
  a) Propose a better way to use jump label to eliminate the overhead of
  branch selection in zone_statistics(), as inspired by Ying Huang;
  b) Add a paragraph in commit log to describe the way for branch target
  selection;
  c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
  and change the description accordingly, as suggested by Michal Hocko;
  d) Make this functionality NUMA-specific via ifdef

  V1->V2:
  a) Merge to one patch;
  b) Use jump label to eliminate the overhead of branch selection;
  c) Add a single-time log message at boot time to help tell users what
  happened.

Reported-by: Jesper Dangaard Brouer <bro...@redhat.com>
Suggested-by: Dave Hansen <dave.han...@intel.com>
Suggested-by: Ying Huang <ying.hu...@intel.com>
Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 Documentation/sysctl/vm.txt | 16 +++
 include/linux/vmstat.h  | 10 +++
 kernel/sysctl.c |  9 ++
 mm/mempolicy.c  |  3 ++
 mm/page_alloc.c |  6 
 mm/vmstat.c | 70 +
 6 files changed, 114 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 9baf66a..f65c5c7 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -58,6 +58,7 @@ Currently, these files are in /proc/sys/vm:
 - percpu_pagelist_fraction
 - stat_interval
 - stat_refresh
+- numa_stat
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
@@ -792,6 +793,21 @@ with no ill effects: errors and warnings on these stats 
are suppressed.)
 
 ==
 
+numa_stat
+
+This interface allows runtime configuration of numa statistics.
+
+When page allocation performance becomes a bottleneck and you can tolerate
+some possible tool breakage and decreased numa counter precision, you can
+do:
+   echo 0 > /proc/sys/vm/numa_stat
+
+When page allocation performance is not a bottleneck and you want all
+tooling to work, you can do:
+   echo 1 > /proc/sys/vm/numa_stat
+
+==
+
 swappiness
 
 This control is used to define how aggressive the kernel will swap
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5..c605c94 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -6,9 +6,19 @@
 #include 
 #include 
 #include 
+#include 
 
 extern int sysctl_stat_interval;
 
+#ifdef CONFIG_NUMA
+#define ENABLE_NUMA_STAT   1
+#define DISABLE_NUMA_STAT   0
+extern int sysctl_vm_numa_stat;
+DECLARE_STATIC_KEY_TRUE(vm_numa_stat_key);
+extern int sysctl_vm_numa_stat_handler(struct ctl_table *table,
+   int write, void __user *buffer, size_t *length, loff_t *ppos);
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 /*
  * Light weight per cpu counter implementation.
diff --git a

[PATCH v5] mm, sysctl: make NUMA stats configurable

2017-10-17 Thread Kemi Wang
This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested by
Dave Hansen and Ying Huang.

=
When page allocation performance becomes a bottleneck and you can tolerate
some possible tool breakage and decreased numa counter precision, you can
do:
echo 0 > /proc/sys/vm/numa_stat
In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
cycles per single page allocation and reclaim on Jesper's page_bench03 (88
threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
memory).

Benchmark link provided by Jesper D Brouer(increase loop times to
1000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

=
When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:
echo 1 > /proc/sys/vm/numa_stat
This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

ChangeLog:
  V4->V5
  a) Scope vm_numa_stat_lock into the sysctl handler function, as suggested
  by Michal Hocko;
  b) Only allow 0/1 value when setting a value to numa_stat at userspace,
  that would keep the possibility for add auto mode in future (e.g. 2 for
  auto mode), as suggested by Michal Hocko.

  V3->V4
  a) Get rid of auto mode of numa stats, and may add it back if necessary,
  as alignment before;
  b) Skip NUMA_INTERLEAVE_HIT counter update when numa stats is disabled,
  as reported by Andrey Ryabinin. See commit "de55c8b2519" for details
  c) Remove extern declaration for those clear_numa_ function, and make
  them static in vmstat.c, as suggested by Vlastimil Babka.

  V2->V3:
  a) Propose a better way to use jump label to eliminate the overhead of
  branch selection in zone_statistics(), as inspired by Ying Huang;
  b) Add a paragraph in commit log to describe the way for branch target
  selection;
  c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
  and change the description accordingly, as suggested by Michal Hocko;
  d) Make this functionality NUMA-specific via ifdef

  V1->V2:
  a) Merge to one patch;
  b) Use jump label to eliminate the overhead of branch selection;
  c) Add a single-time log message at boot time to help tell users what
  happened.

Reported-by: Jesper Dangaard Brouer 
Suggested-by: Dave Hansen 
Suggested-by: Ying Huang 
Signed-off-by: Kemi Wang 
---
 Documentation/sysctl/vm.txt | 16 +++
 include/linux/vmstat.h  | 10 +++
 kernel/sysctl.c |  9 ++
 mm/mempolicy.c  |  3 ++
 mm/page_alloc.c |  6 
 mm/vmstat.c | 70 +
 6 files changed, 114 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 9baf66a..f65c5c7 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -58,6 +58,7 @@ Currently, these files are in /proc/sys/vm:
 - percpu_pagelist_fraction
 - stat_interval
 - stat_refresh
+- numa_stat
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
@@ -792,6 +793,21 @@ with no ill effects: errors and warnings on these stats 
are suppressed.)
 
 ==
 
+numa_stat
+
+This interface allows runtime configuration of numa statistics.
+
+When page allocation performance becomes a bottleneck and you can tolerate
+some possible tool breakage and decreased numa counter precision, you can
+do:
+   echo 0 > /proc/sys/vm/numa_stat
+
+When page allocation performance is not a bottleneck and you want all
+tooling to work, you can do:
+   echo 1 > /proc/sys/vm/numa_stat
+
+==
+
 swappiness
 
 This control is used to define how aggressive the kernel will swap
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5..c605c94 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -6,9 +6,19 @@
 #include 
 #include 
 #include 
+#include 
 
 extern int sysctl_stat_interval;
 
+#ifdef CONFIG_NUMA
+#define ENABLE_NUMA_STAT   1
+#define DISABLE_NUMA_STAT   0
+extern int sysctl_vm_numa_stat;
+DECLARE_STATIC_KEY_TRUE(vm_numa_stat_key);
+extern int sysctl_vm_numa_stat_handler(struct ctl_table *table,
+   int write, void __user *buffer, size_t *length, loff_t *ppos);
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 /*
  * Light weight per cpu counter implementation.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d9c31bc..8f272db 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysct

Re: [PATCH v4] mm, sysctl: make NUMA stats configurable

2017-10-17 Thread kemi


On 2017年10月17日 15:54, Michal Hocko wrote:
> On Tue 17-10-17 09:20:58, Kemi Wang wrote:
> [...]
> 
> Other than two remarks below, it looks good to me and it also looks
> simpler.
> 
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 4bb13e7..e746ed1 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -32,6 +32,76 @@
>>  
>>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>>  
>> +#ifdef CONFIG_NUMA
>> +int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>> +static DEFINE_MUTEX(vm_numa_stat_lock);
> 
> You can scope this mutex to the sysctl handler function
> 

OK, thanks.

>> +int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
>> +void __user *buffer, size_t *length, loff_t *ppos)
>> +{
>> +int ret, oldval;
>> +
>> +mutex_lock(_numa_stat_lock);
>> +if (write)
>> +oldval = sysctl_vm_numa_stat;
>> +ret = proc_dointvec(table, write, buffer, length, ppos);
>> +if (ret || !write)
>> +goto out;
>> +
>> +if (oldval == sysctl_vm_numa_stat)
>> +goto out;
>> +else if (oldval == DISABLE_NUMA_STAT) {
> 
> So basically any value will enable numa stats. This means that we would
> never be able to extend this interface to e.g. auto mode (say value 2).
> I guess you meant to check sysctl_vm_numa_stat == ENABLE_NUMA_STAT?
> 

I meant to make it more general other than ENABLE_NUMA_STAT(non 0 is enough), 
but it will make it hard to scale, as you said.
So, it would be like this:
0 -- disable
1 -- enable
other value is invalid.

May add option 2 later for auto if necessary:)

>> +static_branch_enable(_numa_stat_key);
>> +pr_info("enable numa statistics\n");
>> +} else if (sysctl_vm_numa_stat == DISABLE_NUMA_STAT) {
>> +static_branch_disable(_numa_stat_key);
>> +invalid_numa_statistics();
>> +pr_info("disable numa statistics, and clear numa counters\n");
>> +}
>> +
>> +out:
>> +mutex_unlock(_numa_stat_lock);
>> +return ret;
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_VM_EVENT_COUNTERS
>>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
>>  EXPORT_PER_CPU_SYMBOL(vm_event_states);
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v4] mm, sysctl: make NUMA stats configurable

2017-10-17 Thread kemi


On 2017年10月17日 15:54, Michal Hocko wrote:
> On Tue 17-10-17 09:20:58, Kemi Wang wrote:
> [...]
> 
> Other than two remarks below, it looks good to me and it also looks
> simpler.
> 
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 4bb13e7..e746ed1 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -32,6 +32,76 @@
>>  
>>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>>  
>> +#ifdef CONFIG_NUMA
>> +int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>> +static DEFINE_MUTEX(vm_numa_stat_lock);
> 
> You can scope this mutex to the sysctl handler function
> 

OK, thanks.

>> +int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
>> +void __user *buffer, size_t *length, loff_t *ppos)
>> +{
>> +int ret, oldval;
>> +
>> +mutex_lock(_numa_stat_lock);
>> +if (write)
>> +oldval = sysctl_vm_numa_stat;
>> +ret = proc_dointvec(table, write, buffer, length, ppos);
>> +if (ret || !write)
>> +goto out;
>> +
>> +if (oldval == sysctl_vm_numa_stat)
>> +goto out;
>> +else if (oldval == DISABLE_NUMA_STAT) {
> 
> So basically any value will enable numa stats. This means that we would
> never be able to extend this interface to e.g. auto mode (say value 2).
> I guess you meant to check sysctl_vm_numa_stat == ENABLE_NUMA_STAT?
> 

I meant to make it more general other than ENABLE_NUMA_STAT(non 0 is enough), 
but it will make it hard to scale, as you said.
So, it would be like this:
0 -- disable
1 -- enable
other value is invalid.

May add option 2 later for auto if necessary:)

>> +static_branch_enable(_numa_stat_key);
>> +pr_info("enable numa statistics\n");
>> +} else if (sysctl_vm_numa_stat == DISABLE_NUMA_STAT) {
>> +static_branch_disable(_numa_stat_key);
>> +invalid_numa_statistics();
>> +pr_info("disable numa statistics, and clear numa counters\n");
>> +}
>> +
>> +out:
>> +mutex_unlock(_numa_stat_lock);
>> +return ret;
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_VM_EVENT_COUNTERS
>>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
>>  EXPORT_PER_CPU_SYMBOL(vm_event_states);
>> -- 
>> 2.7.4
>>
> 


[PATCH v4] mm, sysctl: make NUMA stats configurable

2017-10-16 Thread Kemi Wang
This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested by
Dave Hansen and Ying Huang.

=
When page allocation performance becomes a bottleneck and you can tolerate
some possible tool breakage and decreased numa counter precision, you can
do:
echo 0 > /proc/sys/vm/numa_stat
In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
cycles per single page allocation and reclaim on Jesper's page_bench03 (88
threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
memory).

Benchmark link provided by Jesper D Brouer(increase loop times to
1000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

=
When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:
echo 1 > /proc/sys/vm/numa_stat
This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

ChangeLog:
  V3->V4
  a) Get rid of auto mode of numa stats, and may add it back if necessary,
  as alignment before;
  b) Skip NUMA_INTERLEAVE_HIT counter update when numa stats is disabled,
  as reported by Andrey Ryabinin. See commit "de55c8b2519" for details
  c) Remove extern declaration for those clear_numa_ function, and make
  them static in vmstat.c, as suggested by Vlastimil Babka.

  V2->V3:
  a) Propose a better way to use jump label to eliminate the overhead of
  branch selection in zone_statistics(), as inspired by Ying Huang;
  b) Add a paragraph in commit log to describe the way for branch target
  selection;
  c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
  and change the description accordingly, as suggested by Michal Hocko;
  d) Make this functionality NUMA-specific via ifdef

  V1->V2:
  a) Merge to one patch;
  b) Use jump label to eliminate the overhead of branch selection;
  c) Add a single-time log message at boot time to help tell users what
  happened.

Reported-by: Jesper Dangaard Brouer <bro...@redhat.com>
Suggested-by: Dave Hansen <dave.han...@intel.com>
Suggested-by: Ying Huang <ying.hu...@intel.com>
Signed-off-by: Kemi Wang <kemi.w...@intel.com>
---
 Documentation/sysctl/vm.txt | 16 +++
 include/linux/vmstat.h  | 10 +++
 kernel/sysctl.c |  7 +
 mm/mempolicy.c  |  3 ++
 mm/page_alloc.c |  6 
 mm/vmstat.c | 70 +
 6 files changed, 112 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 9baf66a..f65c5c7 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -58,6 +58,7 @@ Currently, these files are in /proc/sys/vm:
 - percpu_pagelist_fraction
 - stat_interval
 - stat_refresh
+- numa_stat
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
@@ -792,6 +793,21 @@ with no ill effects: errors and warnings on these stats 
are suppressed.)
 
 ==
 
+numa_stat
+
+This interface allows runtime configuration of numa statistics.
+
+When page allocation performance becomes a bottleneck and you can tolerate
+some possible tool breakage and decreased numa counter precision, you can
+do:
+   echo 0 > /proc/sys/vm/numa_stat
+
+When page allocation performance is not a bottleneck and you want all
+tooling to work, you can do:
+   echo 1 > /proc/sys/vm/numa_stat
+
+==
+
 swappiness
 
 This control is used to define how aggressive the kernel will swap
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5..c605c94 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -6,9 +6,19 @@
 #include 
 #include 
 #include 
+#include 
 
 extern int sysctl_stat_interval;
 
+#ifdef CONFIG_NUMA
+#define ENABLE_NUMA_STAT   1
+#define DISABLE_NUMA_STAT   0
+extern int sysctl_vm_numa_stat;
+DECLARE_STATIC_KEY_TRUE(vm_numa_stat_key);
+extern int sysctl_vm_numa_stat_handler(struct ctl_table *table,
+   int write, void __user *buffer, size_t *length, loff_t *ppos);
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 /*
  * Light weight per cpu counter implementation.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d9c31bc..f6a79a3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1371,6 +1371,13 @@ static struct ctl_table vm_table[] = {
.mode   = 0644,
.proc_handler   = _mempolicy_sysctl_handler,
},
+

[PATCH v4] mm, sysctl: make NUMA stats configurable

2017-10-16 Thread Kemi Wang
This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested by
Dave Hansen and Ying Huang.

=
When page allocation performance becomes a bottleneck and you can tolerate
some possible tool breakage and decreased numa counter precision, you can
do:
echo 0 > /proc/sys/vm/numa_stat
In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
cycles per single page allocation and reclaim on Jesper's page_bench03 (88
threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
memory).

Benchmark link provided by Jesper D Brouer(increase loop times to
1000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

=
When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:
echo 1 > /proc/sys/vm/numa_stat
This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

ChangeLog:
  V3->V4
  a) Get rid of auto mode of numa stats, and may add it back if necessary,
  as alignment before;
  b) Skip NUMA_INTERLEAVE_HIT counter update when numa stats is disabled,
  as reported by Andrey Ryabinin. See commit "de55c8b2519" for details
  c) Remove extern declaration for those clear_numa_ function, and make
  them static in vmstat.c, as suggested by Vlastimil Babka.

  V2->V3:
  a) Propose a better way to use jump label to eliminate the overhead of
  branch selection in zone_statistics(), as inspired by Ying Huang;
  b) Add a paragraph in commit log to describe the way for branch target
  selection;
  c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
  and change the description accordingly, as suggested by Michal Hocko;
  d) Make this functionality NUMA-specific via ifdef

  V1->V2:
  a) Merge to one patch;
  b) Use jump label to eliminate the overhead of branch selection;
  c) Add a single-time log message at boot time to help tell users what
  happened.

Reported-by: Jesper Dangaard Brouer 
Suggested-by: Dave Hansen 
Suggested-by: Ying Huang 
Signed-off-by: Kemi Wang 
---
 Documentation/sysctl/vm.txt | 16 +++
 include/linux/vmstat.h  | 10 +++
 kernel/sysctl.c |  7 +
 mm/mempolicy.c  |  3 ++
 mm/page_alloc.c |  6 
 mm/vmstat.c | 70 +
 6 files changed, 112 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 9baf66a..f65c5c7 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -58,6 +58,7 @@ Currently, these files are in /proc/sys/vm:
 - percpu_pagelist_fraction
 - stat_interval
 - stat_refresh
+- numa_stat
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
@@ -792,6 +793,21 @@ with no ill effects: errors and warnings on these stats 
are suppressed.)
 
 ==
 
+numa_stat
+
+This interface allows runtime configuration of numa statistics.
+
+When page allocation performance becomes a bottleneck and you can tolerate
+some possible tool breakage and decreased numa counter precision, you can
+do:
+   echo 0 > /proc/sys/vm/numa_stat
+
+When page allocation performance is not a bottleneck and you want all
+tooling to work, you can do:
+   echo 1 > /proc/sys/vm/numa_stat
+
+==
+
 swappiness
 
 This control is used to define how aggressive the kernel will swap
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5..c605c94 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -6,9 +6,19 @@
 #include 
 #include 
 #include 
+#include 
 
 extern int sysctl_stat_interval;
 
+#ifdef CONFIG_NUMA
+#define ENABLE_NUMA_STAT   1
+#define DISABLE_NUMA_STAT   0
+extern int sysctl_vm_numa_stat;
+DECLARE_STATIC_KEY_TRUE(vm_numa_stat_key);
+extern int sysctl_vm_numa_stat_handler(struct ctl_table *table,
+   int write, void __user *buffer, size_t *length, loff_t *ppos);
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 /*
  * Light weight per cpu counter implementation.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d9c31bc..f6a79a3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1371,6 +1371,13 @@ static struct ctl_table vm_table[] = {
.mode   = 0644,
.proc_handler   = _mempolicy_sysctl_handler,
},
+   {
+   .procname   = "numa_stat",
+   .data

Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread kemi


On 2017年10月10日 13:49, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> 
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).
> 

It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit
V4 patch for it. Thanks


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread kemi


On 2017年10月10日 13:49, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> 
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).
> 

It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit
V4 patch for it. Thanks


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread kemi


On 2017年10月03日 17:23, Michal Hocko wrote:
> On Thu 28-09-17 14:11:41, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>>  echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>>  echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>>  echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
> 
> I am still not convinced the auto mode is worth all the additional code
> and a safe default to use. The whole thing could have been 0/1 with a
> simpler parsing and less code to catch readers.
> 

I understood your concern. 
Well, we may get rid of auto mode if there is some obvious disadvantage
here. Now, I tend to keep it because most people may not touch this interface,
and auto mode is helpful in such case.

> E.g. why do we have to do static_branch_enable on any read or even
> vmstat_stop? Wouldn't open be sufficient?
> 

NUMA stats is used in four files:
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/numastat
/sys/devices/system/node/node*/vmstat
In auto mode, each *read* will trigger the update of NUMA counter. 
So, we should make sure the target branch is jumped to the branch 
for NUMA counter update once the file is read from user space.
the intension of static_branch_enable in vmstat_stop(in the call site 
of file->file_ops.read) is for reading /proc/vmstat in case.  

I guess the *open* means file->file_op.open here, right?
Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, 
>> NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  struct device_attribute *attr, char *buf)
>>  {
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>>  return sprintf(buf,
>> "numa_hit %lu\n"
>> "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  n += sprintf(buf+n, "%s %lu\n",
>>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>   sum_zone_numa_state(nid, i));
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>>  #endif
>>  
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> [...]
>> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void 
>> *arg)
>>  {
>>  pg_data_t *pgdat = (pg_data_t *)arg;
>>  walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
>> +#ifdef CONFIG_NUMA
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>> +#endif
>>  return 0;
>>  }
>>  
>> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>>  
>>  static void vmstat_stop(struct seq_file *m, void *arg)
>>  {
>> +#ifdef CONFIG_NUMA
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>> +#endif
>>  kfree(m->private);
>>  m->private = NULL;
>>  }
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-09 Thread kemi


On 2017年10月03日 17:23, Michal Hocko wrote:
> On Thu 28-09-17 14:11:41, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>>  echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>>  echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>>  echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
> 
> I am still not convinced the auto mode is worth all the additional code
> and a safe default to use. The whole thing could have been 0/1 with a
> simpler parsing and less code to catch readers.
> 

I understood your concern. 
Well, we may get rid of auto mode if there is some obvious disadvantage
here. Now, I tend to keep it because most people may not touch this interface,
and auto mode is helpful in such case.

> E.g. why do we have to do static_branch_enable on any read or even
> vmstat_stop? Wouldn't open be sufficient?
> 

NUMA stats is used in four files:
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/numastat
/sys/devices/system/node/node*/vmstat
In auto mode, each *read* will trigger the update of NUMA counter. 
So, we should make sure the target branch is jumped to the branch 
for NUMA counter update once the file is read from user space.
the intension of static_branch_enable in vmstat_stop(in the call site 
of file->file_ops.read) is for reading /proc/vmstat in case.  

I guess the *open* means file->file_op.open here, right?
Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, 
>> NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  struct device_attribute *attr, char *buf)
>>  {
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>>  return sprintf(buf,
>> "numa_hit %lu\n"
>> "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  n += sprintf(buf+n, "%s %lu\n",
>>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>   sum_zone_numa_state(nid, i));
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>>  #endif
>>  
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> [...]
>> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void 
>> *arg)
>>  {
>>  pg_data_t *pgdat = (pg_data_t *)arg;
>>  walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
>> +#ifdef CONFIG_NUMA
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>> +#endif
>>  return 0;
>>  }
>>  
>> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>>  
>>  static void vmstat_stop(struct seq_file *m, void *arg)
>>  {
>> +#ifdef CONFIG_NUMA
>> +if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +static_branch_enable(_numa_stats_mode_key);
>> +#endif
>>  kfree(m->private);
>>  m->private = NULL;
>>  }
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v3] mm, sysctl: make NUMA stats configurable

2017-10-08 Thread kemi


On 2017年09月29日 15:03, Vlastimil Babka wrote:
> On 09/28/2017 08:11 AM, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>>  echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 1000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>>  echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>>  echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
>>
>> Branch target selection with jump label:
>> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
>> counters update.
>> b) When numa_stats_mode is changed to *coarse*, return back directly.
>> c) When numa_stats_mode is changed to *auto*, the branch target used in
>> last time is kept, and the branch target is changed to the branch for numa
>> counters update once numa counters are *read* by users.
>>
>> Therefore, with the help of jump label, the page allocation performance is
>> hardly affected when numa counters are updated with a call in
>> zone_statistics(). Meanwhile, the auto mode can give people benefit without
>> manual tuning.
>>
>> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
>> help improve the original patch.
>>
>> ChangeLog:
>>   V2->V3:
>>   a) Propose a better way to use jump label to eliminate the overhead of
>>   branch selection in zone_statistics(), as inspired by Ying Huang;
>>   b) Add a paragraph in commit log to describe the way for branch target
>>   selection;
>>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>>   and change the description accordingly, as suggested by Michal Hocko;
>>   d) Make this functionality NUMA-specific via ifdef
>>
>>   V1->V2:
>>   a) Merge to one patch;
>>   b) Use jump label to eliminate the overhead of branch selection;
>>   c) Add a single-time log message at boot time to help tell users what
>>   happened.
>>
>> Reported-by: Jesper Dangaard Brouer <bro...@redhat.com>
>> Suggested-by: Dave Hansen <dave.han...@intel.com>
>> Suggested-by: Ying Huang <ying.hu...@intel.com>
>> Signed-off-by: Kemi Wang <kemi.w...@intel.com>
>> ---
>>  Documentation/sysctl/vm.txt |  24 +
>>  drivers/base/node.c |   4 ++
>>  include/linux/vmstat.h  |  23 
>>  init/main.c |   3 ++
>>  kernel/sysctl.c |   7 +++
>>  mm/page_alloc.c |  10 
>>  mm/vmstat.c | 129 
>> 
>>  7 files changed, 200 insertions(+)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 9baf66a..e310e69 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>>  - swappiness
>>  - user_reserve_kbytes
>>  - vfs_cache_pressure
>> +- numa_stats_mode
>>  - watermark_scale_factor
>>  - zone_reclaim_mode
>>  
>> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>>  
>>  =
>>  
>

  1   2   >