On Wed, May 01, 2019 at 09:41:08AM -0700, Linus Torvalds wrote: > On Mon, Apr 29, 2019 at 7:52 AM Jan Glauber <jglau...@marvell.com> wrote: > > > > It turned out the issue we have on ThunderX2 is the file open-close sequence > > with small read sizes. If the used files are opened read-only the > > lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used. > > > > The lockref CMPXCHG_LOOP uses an unbound (as long as the associated > > spinlock isn't taken) while loop to change the lock count. This behaves > > badly under heavy contention > > Ok, excuse me when I rant a bit. > > Since you're at Marvell, maybe you can forward this rant to the proper > guilty parties?
Sure :) > Who was the absolute *GENIUS* who went > > Step 1: "Oh, we have a middling CPU that isn't world-class on its own" > > Step 2: "BUT! We can put a lot of them on a die, because that's 'easy'" > > Step 3: "But let's make sure the interconnect isn't all that special, > because that would negate the the whole 'easy' part, and really strong > interconnects are even harder than CPU's and use even more power, so > that wouldn't work" > > Step 4: "I wonder why this thing scales badly?" > > Seriously. Why are you guys doing this? Has nobody ever looked at the > fundamental thought process above and gone "Hmm"? > > If you try to compensate for a not-great core by putting twice the > number of them in a system, you need a cache system and interconnect > between them that is more than twice as good as the competition. > > And honestly, from everything that I hear, you don't have it. The > whole chip is designed for "throughput when there is no contention". > Is it really a huge surprise that it then falls flat on its face when > there's something fancy going on? I'll see how x86 runs the same testcase, I thought that playing cacheline ping-pong is not the optimal use case for any CPU. My assumption was that x86 probably doesn't suffer that much because of cpu_relax() -> pause insn could slow down the retry rate. > So now you want to penalize everybody else in the ARM community > because you have a badly balanced system? Not really, as I intentionally did not include a patch and sent this as RFC. > Ok, rant over. > > The good news is that we can easily fix _this_ particular case by just > limiting the CMPXCHG_LOOP to a maximum number of retries, since the > loop is already designed to fail quickly if the spin lock value isn't > unlocked, and all the lockref code is already organized to fall back > to spinlocks. > > So the attached three-liner patch may just work for you. Once _one_ > thread hits the maximum retry case and goes into the spinlocked case, > everybody else will also fall back to spinlocks because they now see > that the lockref is contended. So the "retry" value probably isn't all > that important, but let's make it big enough that it probably never > happens on a well-balanced system. Agreed, your patch would solve the issue for ThunderX2. Limiting the retry attempts was one of the things I tried beside extending the number of NOPs in cpu_relax(). > But seriously: the whole "let's just do lots of CPU cores because it's > easy" needs to stop. It's fine if you have a network processor and > you're doing independent things, but it's not a GP processor approach. > > Your hardware people need to improve on your CPU core (maybe the > server version of Cortex A76 is starting to approach being good > enough?) and your interconnect (seriously!) instead of just slapping > 32 cores on a die and calling it a day. > > Linus "not a fan of the flock of chickens" Torvalds > lib/lockref.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/lib/lockref.c b/lib/lockref.c > index 3d468b53d4c9..a6762f8f45c9 100644 > --- a/lib/lockref.c > +++ b/lib/lockref.c > @@ -9,6 +9,7 @@ > * failure case. > */ > #define CMPXCHG_LOOP(CODE, SUCCESS) do { > \ > + int retry = 15; /* Guaranteed random number */ > \ > struct lockref old; > \ > BUILD_BUG_ON(sizeof(old) != 8); > \ > old.lock_count = READ_ONCE(lockref->lock_count); > \ > @@ -21,6 +22,8 @@ > if (likely(old.lock_count == prev.lock_count)) { > \ > SUCCESS; > \ > } > \ > + if (!--retry) > \ > + break; > \ > cpu_relax(); > \ > } > \ > } while (0)