** Changed in: ubuntu-power-systems
Status: In Progress => Fix Committed
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1788097
Title:
performance drop with ATS enabled
Status in The Ubuntu-power-systems project:
Fix Committed
Status in linux package in Ubuntu:
Fix Committed
Status in linux source package in Bionic:
Fix Committed
Bug description:
== Comment: #0 - Michael Ranweiler <[email protected]> - 2018-08-16
09:58:02 ==
Witherspoon cluster now has ATS enabled with driver 396.42, CUDA
version 9.2.148. They are running the CORAL benchmark LULESH with and
without ATS, and they see a significant performance drop with ATS
enabled.
========
Below is the run with ATS:
Run completed:
Problem size = 160
MPI tasks = 8
Iteration count = 100
Final Origin Energy = 1.605234e+09
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 2.384186e-07
TotalAbsDiff = 5.300015e-07
MaxRelDiff = 1.631916e-12
Elapsed time = 153.00 (s)
Grind time (us/z/c) = 0.37352393 (per dom) (0.046690491 overall)
FOM = 21417.637 (z/s)
========
Here is the run without ATS:
Run completed:
Problem size = 160
MPI tasks = 8
Iteration count = 100
Final Origin Energy = 1.605234e+09
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 2.384186e-07
TotalAbsDiff = 5.300015e-07
MaxRelDiff = 1.631916e-12
Elapsed time = 13.27 (s)
Grind time (us/z/c) = 0.032394027 (per dom) (0.0040492534 overall)
FOM = 246959.11 (z/s)
========
Using ATS on a single node slows down the OpenACC version more than 10
times, and for the version with OpenMP 4.5 and managed memory, they
observe a 2x slowdown.
Last comment from NVIDIA (Javier Cabezas - 07/29/2018 11:30 AM):
We think we have found where's the issue.
This behavior reproduces for any two concurrent processes that create
CUDA contexts on the GPUs and heavily unmap memory (no need to launch
any work on the GPUs). When the problem repros, perf shows that most
of the time is spent in mmio_invalidate. However, this only happens
when processes register GPUs attached to the same NPU. Thus, if
process A, initializes GPU 0 and/or 1, and process B, initializes GPU
2 and/or 3, we don't see the slowdown. This makes sense, because ATSDs
on different NPUs are issued independently.
After some code inspection in npu-dma.c (powerpc backend in the Linux
kernel), Mark noticed that the problem could be in the utilization of
test_and_set_bit_lock in get_mmio_atsd_reg. The implementation of
test_and_set_bit_lock in powerpc relies on the ldarx/stdcx
instructions (PPC_LLARX/PPC_STLCX in the snippet below):
#define DEFINE_TESTOP(fn, op, prefix, postfix, eh) \
static __inline__ unsigned long fn( \
unsigned long mask, \
volatile unsigned long *_p) \
{ \
unsigned long old, t; \
unsigned long *p = (unsigned long *)_p; \
__asm__ __volatile__ ( \
prefix \
"1:" PPC_LLARX(%0,0,%3,eh) "\n" \
stringify_in_c(op) "%1,%0,%2\n" \
PPC405_ERR77(0,%3) \
PPC_STLCX "%1,0,%3\n" \
"bne- 1b\n" \
postfix \
: "=&r" (old), "=&r" (t) \
: "r" (mask), "r" (p) \
: "cc", "memory"); \
return (old & mask); \
}
According to the PowerPC manual, ldarx creates a memory reservation
and a subsequent stwcx instruction from the same processor ensures an
atomic read-modify-write operation. However, the reservation can be
lost if a different processor executes any store instruction on the
same address. That's why "bne- 1b" checks wether stwcx was successful
and jumps back to retry, otherwise. Since DEFINE_TESTOP doesn't
implement any back-off mechanism, two different processors trying to
get an ATSD register can starve each other.
Mark compiled a custom kernel which surrounds the calls to
test_and_set_bit_lock in get_mmio_atsd_reg with a spinlock and I
verified that it solves the issue. These are the execution times for
LULESH:
ATS OFF
Elapsed time = 16.87 (s)
ATS ON
Elapsed time = 215.56 (s)
ATS ON + Spinlock
Elapsed time = 18.14 (s)
Fixed with the following patch in the powerpc tree:
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=9eab9901b015
== Comment: #1 - Michael Ranweiler <[email protected]> - 2018-08-20
14:56:52 ==
This is now in mainline, too:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/powernv/npu-dma.c?id=9eab9901b015f489199105c470de1ffc337cfabb
It has some small fuzz to apply to 4.15.0.32-35:
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c
b/arch/powerpc/platforms/powernv/npu-dma.c
index 6c8e168e6571..18226895681e 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -434,8 +434,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
int i;
for (i = 0; i < npu->mmio_atsd_count; i++) {
- if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
- return i;
+ if (!test_bit(i, &npu->mmio_atsd_usage))
+ if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
+ return i;
}
return -ENOSPC;
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1788097/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp