Thank you for reporting back on this Thomas!
It's good to hear that the issue can be resolved by using a newer
version of OpenBLAS, but it's also frustrating...
This is clearly a bug in OpenBLAS that could have been prevented.
I haven't studied this issue in detail myself yet, but I have seen
comments pass by from OpenBLAS maintainers who say they don't have
Skylake hardware to test on.
That makes me wonder how well the rest of OpenBLAS is tested, which is a
bit infuriating for a library that important.
On the EasyBuild side, I think we have a couple of options for mitigation:
1) Add eaysconfigs for the latest version of OpenBLAS to the next
EasyBuild release (v3.9.1) which can be used to swap out the OpenBLAS
included in recent foss toolchains.
I suspect simply doing a "module swap" to the newer OpenBLAS version is
sufficient in most cases (if OpenBLAS was not statically linked, and if
RPATH is not used).
2) Modify the toolchain definition of foss/2018b (and foss/2019a?) to
use the newer OpenBLAS version.
I'm not sure if this is too drastic or not, but it would be up to each
site to decide whether or not they want to update their already foss
modules to pick on this or not.
3) Collect test programs/scripts/benchmarks in a central repository
(easybuild-testing?), so we can assess the stability of future OpenBLAS
versions that we consider for inclusion in the 'foss' toolchains.
You could state that this isn't our 'job', but if the OpenBLAS
maintainers are not capable of properly testing their releases on recent
hardware, then I guess it's our duty to try and catch problems like this
ourselves before they blow up in our faces weeks (or months) later.
Anyone who would be up for helping out with this?
For now we should definitely focus on covering this OpenBLAS issue well,
but I can see this thing growing out as another central repo where we
pool together efforts done on testing/benchmarking on top of modules
installed with EasyBuild...
I'm a bit surprised that these problems didn't arise earlier...
foss/2018b has been defined a fairly long time ago (early July 2018),
and this toolchain has been picked up quite long (based on incoming
contributions).
So why did these problems only start surfacing in recent weeks? Does
anyone have a plausible explanation?
Note that I'm genuinely wondering here, I'm not trying to insinuate
anything...
regards,
Kenneth
On 08/05/2019 13:38, Thomas Eylenbosch wrote:
Hi Jurij
I have installed the latest version of OpenBLAS(0.3.6), it the seems the matrix
calculations are correct now.
Fyi: I am using EasyBuild 3.9.0, so it did not require the
https://github.com/easybuilders/easybuild-easyconfigs/pull/8227 patch
Met vriendelijke groet / Kind regards / Beste Grüße
Thomas Eylenbosch
Ext: Gluo N.V.
BASF Agricultural Solutions Belgium NV
Technologiepark 101
B-9052 Ghent (Zwijnaarde)
BELGIUM
E-mail: [email protected]
BASF Agricultural Solutions Belgium NV, Registered Office: 9052 Gent, Belgium
Registration: RPR Gent: 0685.756.742
-----Original Message-----
From: [email protected] <[email protected]> On
Behalf Of Jure Pecar
Sent: woensdag 8 mei 2019 11:03
To: [email protected]
Subject: Re: [easybuild] Openblas(foss) matrix issue
On Tue, 7 May 2019 19:10:10 +0200
Åke Sandgren <[email protected]> wrote:
Parts of it may be SkylakeX related.
I've recently checked OpenBLAS changelog and saw some
adding-removing-adding-removing specific avx512 kernels there.
I'd suggest trying the latest OpenBLAS release and see if that fixes issues
you're seeing. Alternatively figure out a way to force OpenBLAS to use only
avx2 and see if that helps.