Thanks for that Bob, we've been watching with interest the changes going in 
upstream but at the moment we're not really in a position to take advantage of 
them.

Due to hardware vendor support certification requirements XenServer can only 
very occasionally make big kernel bumps that would affect the ABI that the 
driver would see as that would require our hardware partners to recertify. So, 
we're currently on a 4.4.52 base but the gfs2 driver is somewhat newer as it is 
essentially self-contained and therefore we can backport change more easily. We 
currently have most of the GFS2 and DLM changes that are in 4.15 backported 
into the XenServer 7.6 kernel, but we can't take the ones related to iomap as 
they are more invasive and it looks like a number of the more recent 
performance targeting changes are also predicated on the iomap framework.

As I mentioned in the covering letter, the intra host problem would largely be 
a non-issue if EX glocks were actually a host wide thing with local mutexes 
used to share them within the host. I don't know if this is what your patch set 
is trying to achieve or not. It's not so much that that selection of resource 
group is "random", just that there is a random chance that we won't select the 
first RG that we test, it probably does work out much the same though.

The inter host problem addressed by the second patch seems to be less amenable 
to avoidance as the hosts don't seem to have a synchronous view of the state of 
the resource group locks (for understandable reasons as I'd expect thisto be 
very expensive to keep sync'd). So it seemed reasonable to try to make it 
"expensive" to request a resource that someone else is using and also to avoid 
immediately grabbing it back if we've been asked to relinquish it. It does seem 
to give a fairer balance to the usage without being massively invasive.

We thought we should share these with the community anyway even if they only 
serve as inspiration for more detailed changes and also to describe the 
scenarios where we're seeing issues now that we have completed implementing the 
XenServer support for GFS2 that we discussed back in Nuremburg last year. In 
our testing they certainly make things better. They probably aren’t fully 
optimal as we can't maintain 10g wire speed consistently across the full LUN 
but we're getting about 75% which is certainly better than we were seeing 
before we started looking at this.

Thanks,

        Mark.

-----Original Message-----
From: Bob Peterson <[email protected]> 
Sent: 20 September 2018 18:18
To: Mark Syms <[email protected]>
Cc: [email protected]; Ross Lagerwall <[email protected]>; Tim 
Smith <[email protected]>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

----- Original Message -----
> While testing GFS2 as a storage repository for virtual machines we 
> discovered a number of scenarios where the performance was being 
> pathologically poor.
> 
> The scenarios are simplfied to the following -
> 
>   * On a single host in the cluster grow a number of files to a
>     significant proportion of the filesystems LUN size, exceeding the
>     hosts preferred resource group allocation. This can be replicated
>     by using fio and writing to 20 different files with a script like

Hi Mark, Tim and all,

The performance problems with rgrp contention are well known, and have been for 
a very long time.

In rhel6 it's not as big of a problem because rhel6 gfs2 uses "try locks"
which distributes different processes to unique rgrps, thus keeping them from 
contending. However, it results in file system fragmentation that tends to 
catch up with you later.

I posted a different patch set to solve the problem a different way by trying 
to keep track of both Inter-node and Intra-node contention, and redistributed 
rgrps accordingly. It was similar to your first patch, but used a more 
predictable distribution, whereas yours is random.
It worked very well, but it ultimately got rejected by Steve Whitehouse in 
favor of a better approach:

Our current plan is to allow rgrps to be shared among many processes on a 
single node. This alleviates the contention, improves throughput and 
performance, and fixes the "favoritism" problems gfs2 has today.
In other words, it's better than just redistributing the rgrps.

I did a proof-of-concept set of patches and saw pretty good performance numbers 
and "fairness" among simultaneous writers. I posted that a few months ago.

Your patch would certainly work, and random distribution of rgrps would 
definitely gain performance, just as the Orlov algorithm does, however, I still 
want to pursue what Steve suggested.

My patch set for this still needs some work because I found some bugs with how 
things are done, so it'll take time to get working properly.

Regards,

Bob Peterson
Red Hat File Systems

Reply via email to