Re: [developer] Is L2ARC read process single-threaded?

2018-08-30 Thread Wojciech Kruzel via openzfs-developer


   On Thursday, 30 August 2018, 18:38:48 GMT+1, Richard Elling 
 wrote:  

flamegraphs sample stacks executing on CPUs. They are useless for the analysis 
you're looking for.ZFS knows nothing about NVMe, SATA, SCSI, or any other 
low-level block protocol. Nor does it care.To get to your answer, look at the 
block interface boundary. -- richard

 Are you able to tell me some more about this block interface boundary?What 
exactly am I looking for?
Thanks,Wojciech
 


openzfs / openzfs-developer / seediscussions +participants +delivery 
optionsPermalink  

|  | Virus-free. www.avast.com  |

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-Ma73ee607efc91c90119f6d74
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Is L2ARC read process single-threaded?

2018-08-30 Thread Richard Elling


> On Aug 30, 2018, at 7:15 AM, w.kruzel via openzfs-developer 
>  wrote:
> 
> The flamegraphs are here:
> https://drive.google.com/open?id=1vM-5wy4s-QhV2D3hBVh5bPgaqPqEKsMa 
> 
> 
> There are 11 of them.
> Files out.svg to out4.svg are dtrace flamegraphs of reading when L2ARC has 
> been in use.
> out5.svg to out10.svg are dtrace flamegraphs of usign nvmecontrol command - 
> in read mode either using single thread or multiple threads.
> 
> So, what I noticed is, that only when I used nvmecontrol with multiple 
> threads i.e:
> # nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1
> I can then find this process "kernel`nvme_qpair_process_completions" - just 
> search for nvme in the graph.
> It's hard to select it in some of them.
> kernel`nvme_qpair_process_completions
> kernel`intr_event_execute_handlers
> kernel`nvme_qpair_submit_request
> kernel`nvme_qpair_complete_tracker
> kernel`nvme_ctrlr_submit_io_request
> 
> Is this queueing system for access to nvme disk? See out10 and out6 and out7 
> for the nvme processes.
> All I know is that when arc_read runs, it does not talk to these nvme 
> processes.

flamegraphs sample stacks executing on CPUs. They are useless for the analysis 
you're looking for.
ZFS knows nothing about NVMe, SATA, SCSI, or any other low-level block 
protocol. Nor does it care.
To get to your answer, look at the block interface boundary.
 -- richard

> 
> I haven't tried using the nvme as regular disk and then looking at the output 
> yet, as we are currently using the file server quite extensively.
> 
> I have also tried iostats -x, which is very useful besides that it looks at 
> nvd and not nvme (small detail) so it will not notice the nvmecontrol reads.
> 
> openzfs  / openzfs-developer / see 
> discussions  + participants 
>  + delivery options 
> Permalink 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-M19d34ae166b5cee602311d8b
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Potential bug recently introduced in arc_adjust() that leads to unintended pressure on MFU eventually leading to dramatic reduction in its size

2018-08-30 Thread Richard Elling
Hi Mark, 
yes, this is the change I've tested on ZoL. It is a trivial, low-risk change 
that is needed to restore the 
previous behaviour.
 -- richard

> On Aug 30, 2018, at 7:40 AM, Mark Johnston  wrote:
> 
> On Thu, Aug 30, 2018 at 09:55:27AM +0300, Paul wrote:
>> 30 August 2018, 00:22:14, by "Mark Johnston" :
>> 
>>> On Wed, Aug 29, 2018 at 12:42:33PM +0300, Paul wrote:
 Hello team,
 
 
 It seems like a commit on Mar 23 introduced a bug: if during execution of 
 arc_adjust()
 target is reached after MRU is evicted current code continues evicting 
 MFU. Before said
 commit, on the step prior to MFU eviction, target value was recalculated 
 as:
 
  target = arc_size - arc_c;
 
 arc_size here is a global variable that was being updated accordingly, 
 during MRU eviction,
 hence this expression, resulted in zero or negative target if MRU eviction 
 was enough
 to reach the original goal.
 
 Modern version uses cached value of arc_size, called asize:
 
  target = asize - arc_c;
 
 Because asize stays constant during execution of whole body of 
 arc_adjust() it means that
 above expression will always be evaluated to value > 0, causing MFU to be 
 evicted every 
 time, even if MRU eviction has reached the goal already. Because of the 
 difference in 
 nature of MFU and MRU, globally it leads to eventual reduction of amount 
 of MFU in ARC 
 to dramatic numbers.
>>> 
>>> Hi Paul,
>>> 
>>> Your analysis does seem right to me.  I cc'ed the openzfs mailing list
>>> so that an actual ZFS expert can chime in; it looks like this behaviour
>>> is consistent between FreeBSD, illumos and ZoL.
>>> 
>>> Have you already tried the obvious "fix" of subtracting total_evicted
>>> from the MFU target?
>> 
>> We are going to apply the asize patch (plus the ameta, as suggested by 
>> Richard) and reboot 
>> one of our production servers this night or the following.
> 
> Just to be explicit, are you testing something equivalent to the patch
> at the end of this email?
> 
>> Then we have to wait a few days and observer the ARC behaviour.
> 
> Thanks!  Please let us know how it goes: we're preparing to release
> FreeBSD 12.0 shortly and I'd like to get this fixed in head/ as soon as
> possible.
> 
> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c 
> b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
> index 1387925c4607..882c04dba50a 100644
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
> @@ -4446,6 +4446,12 @@ arc_adjust(void)
>arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
>}
> 
> +   /*
> +* Re-sum ARC stats after the first round of evictions.
> +*/
> +   asize = aggsum_value(_size);
> +   ameta = aggsum_value(_meta_used);
> +
>/*
> * Adjust MFU size
> *
> 
> --
> openzfs: openzfs-developer
> Permalink: 
> https://openzfs.topicbox.com/groups/developer/T10a105c53bcce15c-M1c45cd09114d2ce2e8c9dd26
>  
> 
> Delivery options: https://openzfs.topicbox.com/groups/developer/subscription 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10a105c53bcce15c-Mb937b1ff0ccbad450028c211
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


[developer] Re: Potential bug recently introduced in arc_adjust() that leads to unintended pressure on MFU eventually leading to dramatic reduction in its size

2018-08-30 Thread Mark Johnston
On Thu, Aug 30, 2018 at 09:55:27AM +0300, Paul wrote:
> 30 August 2018, 00:22:14, by "Mark Johnston" :
> 
> > On Wed, Aug 29, 2018 at 12:42:33PM +0300, Paul wrote:
> > > Hello team,
> > > 
> > > 
> > > It seems like a commit on Mar 23 introduced a bug: if during execution of 
> > > arc_adjust()
> > > target is reached after MRU is evicted current code continues evicting 
> > > MFU. Before said
> > > commit, on the step prior to MFU eviction, target value was recalculated 
> > > as:
> > > 
> > >   target = arc_size - arc_c;
> > > 
> > > arc_size here is a global variable that was being updated accordingly, 
> > > during MRU eviction,
> > > hence this expression, resulted in zero or negative target if MRU 
> > > eviction was enough
> > > to reach the original goal.
> > > 
> > > Modern version uses cached value of arc_size, called asize:
> > > 
> > >   target = asize - arc_c;
> > > 
> > > Because asize stays constant during execution of whole body of 
> > > arc_adjust() it means that
> > > above expression will always be evaluated to value > 0, causing MFU to be 
> > > evicted every 
> > > time, even if MRU eviction has reached the goal already. Because of the 
> > > difference in 
> > > nature of MFU and MRU, globally it leads to eventual reduction of amount 
> > > of MFU in ARC 
> > > to dramatic numbers.
> > 
> > Hi Paul,
> > 
> > Your analysis does seem right to me.  I cc'ed the openzfs mailing list
> > so that an actual ZFS expert can chime in; it looks like this behaviour
> > is consistent between FreeBSD, illumos and ZoL.
> > 
> > Have you already tried the obvious "fix" of subtracting total_evicted
> > from the MFU target?
> 
> We are going to apply the asize patch (plus the ameta, as suggested by 
> Richard) and reboot 
> one of our production servers this night or the following.

Just to be explicit, are you testing something equivalent to the patch
at the end of this email?

> Then we have to wait a few days and observer the ARC behaviour.

Thanks!  Please let us know how it goes: we're preparing to release
FreeBSD 12.0 shortly and I'd like to get this fixed in head/ as soon as
possible.

diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c 
b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
index 1387925c4607..882c04dba50a 100644
--- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
+++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
@@ -4446,6 +4446,12 @@ arc_adjust(void)
arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
}
 
+   /*
+* Re-sum ARC stats after the first round of evictions.
+*/
+   asize = aggsum_value(_size);
+   ameta = aggsum_value(_meta_used);
+
/*
 * Adjust MFU size
 *

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10a105c53bcce15c-M1c45cd09114d2ce2e8c9dd26
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


[developer] Re[2]: Potential bug recently introduced in arc_adjust() that leads to unintended pressure on MFU eventually leading to dramatic reduction in its size

2018-08-30 Thread Paul
30 August 2018, 00:22:14, by "Mark Johnston" :

> On Wed, Aug 29, 2018 at 12:42:33PM +0300, Paul wrote:
> > Hello team,
> > 
> > 
> > It seems like a commit on Mar 23 introduced a bug: if during execution of 
> > arc_adjust()
> > target is reached after MRU is evicted current code continues evicting MFU. 
> > Before said
> > commit, on the step prior to MFU eviction, target value was recalculated as:
> > 
> >   target = arc_size - arc_c;
> > 
> > arc_size here is a global variable that was being updated accordingly, 
> > during MRU eviction,
> > hence this expression, resulted in zero or negative target if MRU eviction 
> > was enough
> > to reach the original goal.
> > 
> > Modern version uses cached value of arc_size, called asize:
> > 
> >   target = asize - arc_c;
> > 
> > Because asize stays constant during execution of whole body of arc_adjust() 
> > it means that
> > above expression will always be evaluated to value > 0, causing MFU to be 
> > evicted every 
> > time, even if MRU eviction has reached the goal already. Because of the 
> > difference in 
> > nature of MFU and MRU, globally it leads to eventual reduction of amount of 
> > MFU in ARC 
> > to dramatic numbers.
> 
> Hi Paul,
> 
> Your analysis does seem right to me.  I cc'ed the openzfs mailing list
> so that an actual ZFS expert can chime in; it looks like this behaviour
> is consistent between FreeBSD, illumos and ZoL.
> 
> Have you already tried the obvious "fix" of subtracting total_evicted
> from the MFU target?

We are going to apply the asize patch (plus the ameta, as suggested by Richard) 
and reboot 
one of our production servers this night or the following. Then we have to wait 
a few days
and observer the ARC behaviour.

> > Servers that run the version of FreeBSD prior to the issue have this 
> > picture of ARC:
> >   
> >ARC: 369G Total, 245G MFU, 97G MRU, 36M Anon, 3599M Header, 24G Other
> > 
> > As you can see, MFU dominates. This is a nature of our workload: we have a 
> > considerably 
> > small dataset that we use constantly and repeatedly; and a large dataset 
> > that we use
> > rarely.
> > 
> > But on the modern version of FreeBSD picture is dramatically different: 
> > 
> >ARC: 360G Total, 50G MFU, 272G MRU, 211M Anon, 7108M Header, 30G Other
> > 
> > This leads to a much heavier burden on the disk sub-system.
> > 
> > 
> > Commit that introduced a bug: 
> > https://github.com/freebsd/freebsd/commit/555f9563c9dc217341d4bb5129f5d233cf1f92b8

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10a105c53bcce15c-Mf41e3b645cce716aa8863cc6
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Is L2ARC read process single-threaded?

2018-08-30 Thread w.kruzel via openzfs-developer
The flamegraphs are here:
https://drive.google.com/open?id=1vM-5wy4s-QhV2D3hBVh5bPgaqPqEKsMa

There are 11 of them.
Files out.svg to out4.svg are dtrace flamegraphs of reading when L2ARC has been 
in use.
out5.svg to out10.svg are dtrace flamegraphs of usign nvmecontrol command - in 
read mode either using single thread or multiple threads.

So, what I noticed is, that only when I used nvmecontrol with multiple threads 
i.e:
# nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1
I can then find this process "kernel`nvme_qpair_process_completions" - just 
search for nvme in the graph.
It's hard to select it in some of them.
kernel`nvme_qpair_process_completions
kernel`intr_event_execute_handlers
kernel`nvme_qpair_submit_request
kernel`nvme_qpair_complete_tracker
kernel`nvme_ctrlr_submit_io_request

Is this queueing system for access to nvme disk? See out10 and out6 and out7 
for the nvme processes.
All I know is that when arc_read runs, it does not talk to these nvme processes.

I haven't tried using the nvme as regular disk and then looking at the output 
yet, as we are currently using the file server quite extensively.

I have also tried iostats -x, which is very useful besides that it looks at nvd 
and not nvme (small detail) so it will not notice the nvmecontrol reads.

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-Mb0fb155dbf4c71e1b0d4531f
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription