Hi,

----- Original Message -----
| > This patch adds logic to function gfs2_rgrp_congested to check
| > inter-node and intra-node congestion. It checks for intra-node
| > (local node process) congestion first, since that's faster.
| >
| > If there are other processes actively using this rgrp (i.e. if
| > they have reservations or are holding its glock) it's considered
| > congested in the intra-node sense.
| My comments in relation to the previous patch still stand. This is still
| the wrong answer to the problem. If we have lock contention here, then
| we should be looking to fix the locking and not trying to avoid the
| issue by choosing a different rgrp.
| 
| If we are deallocating blocks, we have no choice over which rgrp we
| lock, since we must lock the rgrp that contains the blocks we want to
| deallocate. The same problem exists there too, so lets fix this properly
| as per my previous comments,
| 
| Steve.

Some of the relevant comments from the previous were:

| This is the wrong solution I think. We should not be choosing other
| rgrps for allocation due to local contention. That is just likely to
| land up creating more fragmentation by spreading things out across the
| disk. The issue is that we can't currently separate the locking into dlm
| (exclusive) and local (shared) but important data structures spinlock
| protected, so that we land up with lock contention on the rgrp and
| everything serialized via the glock. At the moment the glock state
| machine only supports everything (dlm & local) shared or everything
| exclusive.
| 
| If we want to resolve this issue, then we need to deal with that issue
| first and allow multiple local users of the same rgrp to co-exist with
| minimal lock contention. That requires the glock state machine to
| support that particular lock mode first though,
| 
| Steve.

I'd like to hear more about how you envision this to work (see below).

If you study the problem with glocktop, you will see that in the
intra-node congestion case (for example, ten processes on one node
all doing iozone, dd, or other heavy user of the block allocator)
most of the processes are stuck waiting for various rgrp glocks.
That's the problem.

My proposed solution was to reduce or eliminate that contention by
letting each process stick to a unique rgrp for block assignments.
This is pretty much what RHEL5 and RHEL6 did when they were using
the "try locks" for rgrps, and that's exactly why those older
releases performed significantly faster with those tests than
RHEL7 and upstream.

It's true that the block deallocator suffers from the same issue,
and my proposed solution alleviates that too by spreading the blocks
to multiple rgrps. Blocks may be both allocated and deallocated
simultaneously because the glocks are unique.

This is also not too different from how we gained performance
boosts with the Orlov algorithm to assign unique rgrps to groups
of files, which also spread the distance of the writes.

This will not increase file fragmentation because that is all
mitigated by the multi-block allocation algorithms we have
in place.

The file system itself may incur more fragmentation as now files
may span a greater distance with writes to unique rgrps, but
that is no different from RHEL5, RHEL6, and older releases,
which acquired their rgrp glocks with "try locks" which yielded
basically the same results.

I'm not at all concerned about "head bouncing" performance issues
because those are pretty much absorbed by almost all modern
storage arrays. That's what the performance guys, and my own
testing, tell me. Head-bounce problems are also mitigated by
many modern hard drives as well. And let's face it: very few
people are going to use GFS2 with hard drives that have head-
bounce performance problems; GFS2 is mostly used on storage
arrays.

That's not to say there isn't a better solution. We've talked in
the past about possibly allowing multiple processes to share the
same rgrp for block allocations while the rgrp is exclusively
locked (held) on that node. We can, for example, have unique
reservations, and as long as the reservations don't conflict,
we can safely assign blocks from them simultaneously. However,
we will still get into sticky situations and need some kind of
process-exclusive lock on the rgrp itself. For example, only one
process may assign fields like rg_free, rg_inodes, and so forth.
Maybe we can mitigate that with atomic variables and cpu boundaries.
For the deallocation path, we can, for example, pre-construct a
"multi-block reservation" structure for an existing file, then
deallocate from that with the knowledge that other processes will
avoid blocks touched by the reservations. But at some point, we
will need a process-exclusive lock (spin_lock, rwsem, mutex, etc.)
and that may prove to have a similar contention problem. This
may require a fair amount of design to achieve both correctness
and decent performance.

I've also thought about assigning unique glocks for each bitmap
of the rgrp rather than the entire rgrp. We could keep assign
each writer a unique bitmap, but that's probably not good enough
for two reasons: (1) writers to non-zero-index bitmaps still
need to write to the rgrp block itself to adjust its used, free,
and dinode counts. (2) this would only scale to the number of
bitmaps, which is typically 5 or so, so the ten-writer tests
would still contend in a lot of cases. Still, it would help
"large (e.g. 2GB) rgrp" performance where you typically have 13
or so bitmaps.

Since the glocks are kind of like a layer of caching for dlm
locks, the idea of possibly having more than one "EX" state
for a glock which both map to the same DLM state is intriguing.
After all, rgrp glocks (like all glocks) have unique "glops"
that we might be able to leverage for that purpose.
I'll have to give that some thought, although the "multi-block
reservations" concept still seems like a better fit.

Perhaps we should do this in two-phases:

(1) Implement my current design as a short-term solution which
has been proven to increase performance by 20-50 percent for
some use cases and gets the block allocator performance back
to where it was in RHEL6. (And by the same token, helps the
deallocator for the same reasons.)

This is a stop-gap measure while we work on a longer-term
solution that's likely to involve some design work and therefore
take more time to perfect.

(2) Design and implement a longer-term solution that allows
for multiple processes to share an rgrp while its glock is held
in EXclusive mode.

As always, I'm open to more ideas on how to improve this.

Regards,

Bob Peterson

Reply via email to