Re: [hwloc-users] AMD EPYC topology

2017-12-29 Thread Matthew Scutter
Following up on this,
Indeed with a recent kernel the error message goes away.
The poor performance stays though (a few percent difference between 4.13
and 4.15rc5), and I'm at a loss as to whether it's related to MPI or not.
I see oddities such as locking the job to the first 12 cores yield 100%
greater performance than locking to the last 12 cores which I can't explain
but I can only suspect are related to some kind of MPI cache partitioning
issue.


On Sat, Dec 30, 2017 at 8:59 AM, Brice Goglin  wrote:

>
>
> Le 29/12/2017 à 23:15, Bill Broadley a écrit :
> >
> >
> > Very interesting, I was running parallel finite element code and was
> seeing
> > great performance compared to Intel in most cases, but on larger runs it
> was 20x
> > slower.  This would explain it.
> >
> > Do you know which commit, or anything else that might help find any
> related
> > discussion?  I tried a few google searches without luck.
> >
> > Is it specific to the 24-core?  The slowdown I described happened on a
> 32 core
> > Epyc single socket as well as a dual socket 24 core AMD Epyc system.
>
> Hello
>
> Yes it's 24-core specific (that's the only core-count that doesn't have
> 8-core per zeppelin module).
>
> The commit in Linux git master is 2b83809a5e6d619a780876fcaf68cdc42b50d28c
>
> Brice
>
>
> commit 2b83809a5e6d619a780876fcaf68cdc42b50d28c
> Author: Suravee Suthikulpanit 
> Date:   Mon Jul 31 10:51:59 2017 +0200
>
> x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
>
> For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID
> to calculate shared_cpu_map. However, APIC IDs are not guaranteed to
> be contiguous for cores across different L3s (e.g. family17h system
> w/ downcore configuration). This breaks the logic, and results in an
> incorrect L3 shared_cpu_map.
>
> Instead, always use the previously calculated cpu_llc_shared_mask of
> each CPU to derive the L3 shared_cpu_map.
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] AMD EPYC topology

2017-12-29 Thread Brice Goglin


Le 29/12/2017 à 23:15, Bill Broadley a écrit :
>
>
> Very interesting, I was running parallel finite element code and was seeing
> great performance compared to Intel in most cases, but on larger runs it was 
> 20x
> slower.  This would explain it.
>
> Do you know which commit, or anything else that might help find any related
> discussion?  I tried a few google searches without luck.
>
> Is it specific to the 24-core?  The slowdown I described happened on a 32 core
> Epyc single socket as well as a dual socket 24 core AMD Epyc system.

Hello

Yes it's 24-core specific (that's the only core-count that doesn't have
8-core per zeppelin module).

The commit in Linux git master is 2b83809a5e6d619a780876fcaf68cdc42b50d28c

Brice


commit 2b83809a5e6d619a780876fcaf68cdc42b50d28c
Author: Suravee Suthikulpanit 
Date:   Mon Jul 31 10:51:59 2017 +0200

x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask

For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID
to calculate shared_cpu_map. However, APIC IDs are not guaranteed to
be contiguous for cores across different L3s (e.g. family17h system
w/ downcore configuration). This breaks the logic, and results in an
incorrect L3 shared_cpu_map.

Instead, always use the previously calculated cpu_llc_shared_mask of
each CPU to derive the L3 shared_cpu_map.

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] AMD EPYC topology

2017-12-29 Thread Bill Broadley



Very interesting, I was running parallel finite element code and was seeing
great performance compared to Intel in most cases, but on larger runs it was 20x
slower.  This would explain it.

Do you know which commit, or anything else that might help find any related
discussion?  I tried a few google searches without luck.

Is it specific to the 24-core?  The slowdown I described happened on a 32 core
Epyc single socket as well as a dual socket 24 core AMD Epyc system.
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] AMD EPYC topology

2017-12-24 Thread Brice Goglin
Hello
Make sure you use a very recent Linux kernel. There was a bug regarding L3 
caches on 24-core Epyc processors which has been fixed in 4.14 and backported 
in 4.13.x (and maybe in distro kernels too).
However, that would likely not cause huge performance difference unless your 
application heavily depends on the L3 cache.
Brice


Le 24 décembre 2017 12:46:01 GMT+01:00, Matthew Scutter 
 a écrit :
>I'm getting poor performance on OpenMPI tasks on a new AMD 7401P EPYC
>server. I suspect hwloc providing a poor topology may have something to
>do
>with it as I receive this warning below when creating a job.
>Requested data files available at http://static.skysight.io/out.tgz
>Cheers,
>Matthew
>
>
>
>
>* hwloc 1.11.8 has encountered what looks like an error from the
>operating
>system.
>
>*
>
>
>* L3 (cpuset 0x6060) intersects with NUMANode (P#0 cpuset
>0x3f3f
>nodeset 0x0001) without inclusion!
>
>
>* Error occurred in topology.c line 1088
>
>
>
>*
>
>
>
>
>* The following FAQ entry in the hwloc documentation may help:
>
>
>*   What should I do when hwloc reports "operating system" warnings?
>
>
>* Otherwise please report this error message to the hwloc user's
>mailing
>list,
>
>* along with the files generated by the hwloc-gather-topology script.
>
>
>
>
>
>
>depth 0:1 Machine (type #1)
>
>
> depth 1:   1 Package (type #3)
>
>
>  depth 2:  4 NUMANode (type #2)
>
>
>   depth 3: 10 L3Cache (type #4)
>
>
>depth 4:24 L2Cache (type #4)
>
>
> depth 5:   24 L1dCache (type #4)
>
>
>  depth 6:  24 L1iCache (type #4)
>
>
>   depth 7: 24 Core (type #5)
>
>
>
>depth 8:48 PU (type #6)
>
>
>
>Special depth -3:   12 Bridge (type #9)
>
>
>Special depth -4:   9 PCI Device (type #10)
>
>
>Special depth -5:   4 OS Device (type #11)
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users