On 5/8/19 7:56 PM, Kenneth Hoste wrote:
> Thank you for reporting back on this Thomas!
>
> It's good to hear that the issue can be resolved by using a newer
> version of OpenBLAS, but it's also frustrating...
>
> This is clearly a bug in OpenBLAS that could have been prevented.
> I haven't studied this issue in detail myself yet, but I have seen
> comments pass by from OpenBLAS maintainers who say they don't have
> Skylake hardware to test on.
> That makes me wonder how well the rest of OpenBLAS is tested, which is a
> bit infuriating for a library that important.
>
> On the EasyBuild side, I think we have a couple of options for mitigation:
>
> 1) Add eaysconfigs for the latest version of OpenBLAS to the next
> EasyBuild release (v3.9.1) which can be used to swap out the OpenBLAS
> included in recent foss toolchains.
> I suspect simply doing a "module swap" to the newer OpenBLAS version is
> sufficient in most cases (if OpenBLAS was not statically linked, and if
> RPATH is not used).
>
> 2) Modify the toolchain definition of foss/2018b (and foss/2019a?) to
> use the newer OpenBLAS version.
> I'm not sure if this is too drastic or not, but it would be up to each
> site to decide whether or not they want to update their already foss
> modules to pick on this or not.
>
> 3) Collect test programs/scripts/benchmarks in a central repository
> (easybuild-testing?), so we can assess the stability of future OpenBLAS
> versions that we consider for inclusion in the 'foss' toolchains.
>
> You could state that this isn't our 'job', but if the OpenBLAS
> maintainers are not capable of properly testing their releases on recent
> hardware, then I guess it's our duty to try and catch problems like this
> ourselves before they blow up in our faces weeks (or months) later.
>
> Anyone who would be up for helping out with this?
> For now we should definitely focus on covering this OpenBLAS issue well,
> but I can see this thing growing out as another central repo where we
> pool together efforts done on testing/benchmarking on top of modules
> installed with EasyBuild...
>
>
> I'm a bit surprised that these problems didn't arise earlier...
> foss/2018b has been defined a fairly long time ago (early July 2018),
> and this toolchain has been picked up quite long (based on incoming
> contributions).
> So why did these problems only start surfacing in recent weeks? Does
> anyone have a plausible explanation?
> Note that I'm genuinely wondering here, I'm not trying to insinuate
> anything...
One reason it haven't shown up earlier is that it mainly affects
slightly larger matrices and maybe also only in certain circumstances.
That's the feeling i got from the user here that had problems.
small to medium sizes had no problem, only fairly large ones showed any
problems...
And we would definitely help in testing...
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: [email protected] Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se