Hi,
On 03/07/17 21:31, Bob Peterson wrote:
----- Original Message -----
| Hi,
|
|
| On 30/06/17 19:18, Bob Peterson wrote:
| > Hi,
| >
| > This patch changes GFS2's LRU list from a single linked list with
| > a much-contended spin_lock to a hash table of linked lists, thus
| > having more locks with less contention. The key for the table is
| > set round-robin to evenly distribute the glocks into the buckets.
| >
| > Signed-off-by: Bob Peterson <[email protected]>
| This is a bit confused I think. There needs to be some more work done
| here to figure out exactly what the problem is, but I'm not very keen on
| this solution for a number of reasons...
|
| - It still claims to be LRU, but in fact it isn't any more
| - The list might have been broken up, but the counter is still global,
| so that there is still the possibility for contention on that
| - There is no performance data to say whether or not there is any
| improvement or not
| - There should not in general be huge numbers of LRU operations on
| glocks anyway, so why is the LRU list contended in the first place?
| - It introduces a new global counter at glock creation time, so
| something else that might land up being a contention point
|
| The important question is what causes the contention on the LRU list in
| the first place - ideally we'd look at that issue in order to avoid so
| many list operations, rather than making it very complicated and non-LRU
| - at the very least there should be a justification for the new algorithm,
|
| Steve.
Hi,
Let me see if I can address some of the issues you pointed out.
| There needs to be some more work done
| here to figure out exactly what the problem is
The problem, as I see it, is contention on the lru_lock, which is a single
point of contention between all GFS2 glock users as documented in this
bug: https://bugzilla.redhat.com/show_bug.cgi?id=1456886.
Before I posted the patch, I asked our performance group to run specsfs
on an instrumented kernel. The instrumented kernel basically tells me
how much total time was being spent in each 30-second iterval holding the
lru_lock, and how much time was spent waiting for it.
GFS2 only locks the lru_lock in five places, which I numbered 1 through 4.
The five are:
(1) gfs2_glock_add_to_lru
(2) gfs2_glock_remove_from_lru
(3) gfs2_glock_put
(4) gfs2_scan_glock_lru
The goal was to determine if the lru scanning code was holding the lru
lock for long periods of time. I examined the results in detail for the
duration of the specsfs run. Here is a very small excerpt of the output
that shows one of the anomalies:
Jun 14 15:21:34 dellfsvr1 kernel: held: 0 23 22 0 0 wait: 0 191 187 0 0
Jun 14 15:21:35 dellfsvr1 kernel: held: 0 110 115 0 0 wait: 0 563 523 1 0
Jun 14 15:21:50 dellfsvr1 kernel: held: 0 1541 1686 7 0 wait: 0 10596 10478
30 0
Jun 14 15:21:50 dellfsvr1 kernel: held: 0 67 74 1 0 wait: 0 412 453 0 0
Jun 14 15:21:51 dellfsvr1 kernel: held: 0 49 68 0 0 wait: 0 232 248 0 0
As in this excerpt, the "held" times imply that no function is holding
the lru_lock for very long. In this excerpt, the longest time the lru lock
was held was 1686 milliseconds from remove_from_lru, followed closely by
That is a very long time for the lock to be held, assuming that is a
single call to the lock. That implies that something is wrong, since
that function does very little. If that includes the wait time for the
lock, then it is just an artifact of the contention and doesn't by
itself tell us anything that we didn't already know.
add_to_lru. (In other words, the total time spent in add_to_lru was 1541 ms
during that 30 second reporting period.) Functions glock_put and
scan_glock_lru were so small they're barely a blip.
The "wait" times show that during a 30 second period, add_to_lru waited
a total of 10596 milliseconds, and remove_from_lru waited a total of 10478
milliseconds.
As an experiment, I tried disabling interrupts while the lru lock was
held. The result: Absolutely no change in performance, so the problem
is apparently not made worse by context switches.
I'm not sure I understand, there would not ever be any context switches
while spinlocks are held anyway.
My conclusion was that nobody is holding the lru lock for long; the
contention is just based on the sheer number of calls to lock it.
That's why I broke the lock into a hash table to split it up.
I'm going to ask the perf team to try a patched version with specsfs
because I can't seem to find the results of the last test.
(1) It still claims to be LRU, but in fact it isn't any more.
This is true. I don't know how much of a problem that is, but we
might be able to rearrange things to make it more like one.
The new algorithm is more or less selecting random glocks to demote. We
know from previous experience that this doesn't work well, so I'd rather
not substitute one bug for a different one, since we are no better off.
(2) The list might have been broken up, but the counter is still global.
The counter is just an atomic, so it seems unlikely to me to contend
enough to make a difference. Worst case, it would be no worse than what
we've got now. Hopefully the performance numbers will reflect it.
Whether it is atomic or not, has nothing to do with whether it will
contend. The issue is not the operation itself but whether it is likely
to be sharing cache lines with other threads on different cpus. So at
the very least it should have been made a per_cpu counter, assuming that
this really is the right solution.
(3) There is no performance data to say whether or not there is any improvement
or not.
Again, I'll try to get those numbers.
(4) There should not in general be huge numbers of LRU operations on
glocks anyway, so why is the LRU list contended in the first place?
The glocks are being added to and taken off the lru list a lot.
Yes, and we need to get to the bottom of why that is happening. This is
the missing piece of the puzzle. What glock types are implicated in
this? what operations are causing this to happen? can we avoid doing
that in the first place?
If there aren't huge numbers of lru operations, there is no need for
bz#1456886. The "why is it contended in the first place" is explained
above: It's the sheer number of processes hitting that lock at any
given time.
(5) It introduces a new global counter at glock creation time, so
something else that might land up being a contention point.
Point taken. The performance numbers should hopefully reflect whether
that's true.
Stay tuned and I'll see if I can get some performance statistics.
Regards,
Bob Peterson
Red Hat File Systems
There is already a generic NUMA aware LRU (or nearly LRU) implementation
which we are using. It would be better to simply make the changes to the
glock shrinker so that it can use the NUMA aware implementation, as per
the quota shrinker. I seem to remember there was an outstanding issue
which meant that this was not so easy to do, and I don't remember what
it was off the top of my head. At least if we are going to break up that
list then lets make sure that we do it in a way that makes sense,
Steve.