Hi,

On 28/09/18 13:50, Mark Syms wrote:
Hi Bon,

The patches look quite good and would seem to help in the intra-node congestion 
case, which our first patch was trying to do. We haven't tried them yet but 
I'll pull a build together and try to run it over the weekend.

We don't however, see that they would help in the situation we saw for the 
second patch where rgrp glocks would get bounced around between hosts at high 
speed and cause lots of state flushing to occur in the process as the stats 
don't take any account of anything other than network latency whereas there is 
more involved with a rgrp glock when state needs to be flushed.

Any thoughts on this?

Thanks,

        Mark.
There are a few points here... the stats measure the latency of the DLM requests. Since in order to release a lock, some work has to be done, and the lock is not released until that work is complete, the stats do include that in their timings.

There are several parts to the complete picture here:

1. Resource group selection for allocation (which is what the current stats based solution tries to do)  - Note this will not help deallocation, as then there is no choice in which resource group we use! So the following two items should address deallocation too... 2. Parallelism of resource group usage within a single node (currently missing, but we hope to add this feature shortly) 3. Reduction in latency when glocks need to be demoted for use on another node (something we plan to address in due course)

All these things are a part of the overall picture, and we need to be careful not to try and optimise one at the expense of others. It is actually quite easy to get a big improvement in one particular workload, but if we are not careful, it may well be at the expense of another that we've not taken into account. There will always be a trade off between locality and parallelism of course, but we do have to be fairly cautious here too.

We are of course very happy to encourage work in this area, since it should help us gain a greater insight into the various dependencies between these parts, and result in a better overall solution. I hope that helps to give a rough idea of our current thoughts and where we hope to get to in due course,

Steve.

-----Original Message-----
From: Mark Syms
Sent: 28 September 2018 13:37
To: 'Bob Peterson' <rpete...@redhat.com>
Cc: cluster-devel@redhat.com; Tim Smith <tim.sm...@citrix.com>; Ross Lagerwall 
<ross.lagerw...@citrix.com>
Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

Hi Bob,

No, we haven't but it wouldn't be hard for us to replace our patches in our 
internal patchqueue with these and try them. Will let you know what we find.

We have also seen, what we think is an unrelated issue where we get the 
following backtrace in kern.log and our system stalls

Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked 
for more than 120 seconds.
Sep 21 21:19:09 cl15-05 kernel: [21389.462749]       Tainted: G           O    
4.4.0+10 #1
Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python          D 
ffff88019628bc90     0 15480      1 0x00000000
Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  ffff88019628bc90 
ffff880198f11c00 ffff88005a509c00 ffff88019628c000 Sep 21 21:19:09 cl15-05 
kernel: [21389.462795]  ffffc90040226000 ffff88019628bd80 fffffffffffffe58 
ffff8801818da418 Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  
ffff88019628bca8 ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68 Sep 21 
21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [<ffffffff815a1cd4>] schedule+0x64/0x80 Sep 21 21:19:09 cl15-05 kernel: 
[21389.462877]  [<ffffffffa0663624>] find_insert_glock+0x4a4/0x530 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  
[<ffffffffa0660c20>] ? gfs2_holder_wake+0x20/0x20 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [<ffffffffa06639ed>] 
gfs2_glock_get+0x3d/0x330 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [<ffffffffa066cff2>] do_flock+0xf2/0x210 [gfs2] Sep 
21 21:19:09 cl15-05 kernel: [21389.462933]  [<ffffffffa0671ad0>] ? gfs2_getattr+0xe0/0xf0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: 
[21389.462938]  [<ffffffff811ba2fb>] ? cp_new_stat+0x10b/0x120 Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  
[<ffffffffa066d188>] gfs2_flock+0x78/0xa0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [<ffffffff812021e9>] 
SyS_flock+0x129/0x170 Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [<ffffffff815a57ee>] entry_SYSCALL_64_fastpath+0x12/0x71

We think there is a possibility, given that this code path only gets entered if 
a glock is being destroyed, that there is a time of check, time of use issue 
here where by the time that schedule gets called the thing which we expect to 
be waking us up has completed dying and therefore won't trigger a wakeup for 
us. We only seen this a couple of times in fairly intensive VM stress tests 
where a lot of flocks get used on a small number of lock files (we use them to 
ensure consistent behaviour of disk activation/deactivation and also access to 
the database with the system state) but it's concerning nonetheless. We're 
looking at replacing the call to schedule with schedule_timeout with a timeout 
of maybe HZ to ensure that we will always get out of the schedule operation and 
retry. Is this something you think you may have seen or have any ideas on?

Thanks,

        Mark.

-----Original Message-----
From: Bob Peterson <rpete...@redhat.com>
Sent: 28 September 2018 13:24
To: Mark Syms <mark.s...@citrix.com>
Cc: cluster-devel@redhat.com; Ross Lagerwall <ross.lagerw...@citrix.com>; Tim Smith 
<tim.sm...@citrix.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

----- Original Message -----
Thanks for that Bob, we've been watching with interest the changes
going in upstream but at the moment we're not really in a position to
take advantage of them.

Due to hardware vendor support certification requirements XenServer
can only very occasionally make big kernel bumps that would affect the
ABI that the driver would see as that would require our hardware partners to 
recertify.
So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat
newer as it is essentially self-contained and therefore we can
backport change more easily. We currently have most of the GFS2 and
DLM changes that are in
4.15 backported into the XenServer 7.6 kernel, but we can't take the
ones related to iomap as they are more invasive and it looks like a
number of the more recent performance targeting changes are also
predicated on the iomap framework.

As I mentioned in the covering letter, the intra host problem would
largely be a non-issue if EX glocks were actually a host wide thing
with local mutexes used to share them within the host. I don't know if
this is what your patch set is trying to achieve or not. It's not so
much that that selection of resource group is "random", just that
there is a random chance that we won't select the first RG that we
test, it probably does work out much the same though.

The inter host problem addressed by the second patch seems to be less
amenable to avoidance as the hosts don't seem to have a synchronous
view of the state of the resource group locks (for understandable
reasons as I'd expect thisto be very expensive to keep sync'd). So it
seemed reasonable to try to make it "expensive" to request a resource
that someone else is using and also to avoid immediately grabbing it
back if we've been asked to relinquish it. It does seem to give a
fairer balance to the usage without being massively invasive.

We thought we should share these with the community anyway even if
they only serve as inspiration for more detailed changes and also to
describe the scenarios where we're seeing issues now that we have
completed implementing the XenServer support for GFS2 that we
discussed back in Nuremburg last year. In our testing they certainly
make things better. They probably aren’t fully optimal as we can't
maintain 10g wire speed consistently across the full LUN but we're
getting about 75% which is certainly better than we were seeing before we 
started looking at this.

Thanks,

        Mark.
Hi Mark,

I'm really curious if you guys tried the two patches I posted here from
17 January 2018 in place of the two patches you posted. We see much better 
throughput with those over stock.

I know Steve wants a different solution, and in the long run it will be a 
better one, but I've been trying to convince him we should use them as a 
stop-gap measure to mitigate this problem until we get a more proper solution 
in place (which is obviously taking some time, due to unforeseen circumstances).

Regards,

Bob Peterson




Reply via email to