Re: [Beowulf] Frontier Announcement
On 07/05/2019 17.32, Prentice Bisbal via Beowulf wrote: ORNL's Frontier System has been announced: https://www.hpcwire.com/2019/05/07/cray-amd-exascale-frontier-at-oak-ridge/ Here's the spec sheet: https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf The use of AMD GPU's is interesting. AFAIU, AMD GPU's are per se roughly competitive with NVIDIA, but the reason why NVIDIA has a more or less monopoly in the GPGPU and ML markets is due to the software ecosystem. AMD has a lot of catchup to do on the software side, but if they deliver customers will win big. People over here have steam coming out of their ears when they hear the prices for Tesla GPU's, and the recent NV data center licensing changes. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] MLNX_OFED vs. rdma-core vs. MLNX_OFED with rdma-core
Hello, scouring the release notes for the latest MLNX_OFED (version 4.6-1.0.1.1, and no, still no RHEL 7.7 support), I read a note about an upcoming API change at https://docs.mellanox.com/display/MLNXOFEDv461000/Changes+and+New+Features "As of MLNX_OFED v5.0 release (Q1 of the year 2020), the following MLNX_OFED Verbs API will migrate from the legacy version of user space verbs libraries (libibervs, libmlx5, etc.) to the Upstream version rdma-core. For further details on how to install Upstream rdma-core libraries, refer to Installing Upstream rdma-core Libraries section in the User Manual." And in the link how to install with upstream rdma-core there are instructions to use either /mlnxofedinstall --upstream-libs or if using the yum repo, one should use the RPMS_UPSTREAM_LIBS subdir rather than RPMS. Looking at the RPMS_UPSTREAM_LIBS subdir, it seems it has only a small subset of the components available in the full installation, apparently with the expectation that all the rest will be installed via the upstream repos. Does anybody know more about this? Do I read it correctly that the RPMS_UPSTREAM_LIBS will be the only one available as of MLNX_OFED 5.0? And thus MLNX_OFED will become a much "thinner" add-on than currently? Has anyone tested these different configurations, if there's any difference in performance and/or functionality? 1. Distro RDMA stack (rdma-core) 2. MLNX_OFED full 3. MLNX_OFED RPMS_UPSTREAM_LIBS -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] MLNX_OFED vs. rdma-core vs. MLNX_OFED with rdma-core
On 23/09/2019 11.16, Janne Blomqvist wrote: > Hello, > > scouring the release notes for the latest MLNX_OFED (version > 4.6-1.0.1.1, and no, still no RHEL 7.7 support), I read a note about an > upcoming API change at > https://docs.mellanox.com/display/MLNXOFEDv461000/Changes+and+New+Features > > "As of MLNX_OFED v5.0 release (Q1 of the year 2020), the following > MLNX_OFED Verbs API will migrate from the legacy version of user space > verbs libraries (libibervs, libmlx5, etc.) to the Upstream version > rdma-core. > For further details on how to install Upstream rdma-core libraries, > refer to Installing Upstream rdma-core Libraries section in the User > Manual." > > And in the link how to install with upstream rdma-core there are > instructions to use either > > /mlnxofedinstall --upstream-libs > > or if using the yum repo, one should use the RPMS_UPSTREAM_LIBS subdir > rather than RPMS. Looking at the RPMS_UPSTREAM_LIBS subdir, it seems it > has only a small subset of the components available in the full > installation, apparently with the expectation that all the rest will be > installed via the upstream repos. > > Does anybody know more about this? Do I read it correctly that the > RPMS_UPSTREAM_LIBS will be the only one available as of MLNX_OFED 5.0? > And thus MLNX_OFED will become a much "thinner" add-on than currently? > > Has anyone tested these different configurations, if there's any > difference in performance and/or functionality? > > 1. Distro RDMA stack (rdma-core) > 2. MLNX_OFED full > 3. MLNX_OFED RPMS_UPSTREAM_LIBS > Not to be outdone, the recently released MLNX_OFED 4.7 has reorganized this slightly, and now there's yet another option. So, 1. Distro RDMA stack (rdma-core) 2. MLNX_OFED full (directory RPMS/MLNX_LIBS) 3. MLNX_OFED upstream libs 1 (directory RPMS/UPSTREAM_LIBS) 4. MLNX_OFED upstream libs 2 (directory RPMS_UPSTREAM_LIBS) The difference between options 3 and 4 is that 4 is more bare bones, e.g. 3 includes the MLNX opensm whereas with 4 you're expected to use the distro opensm. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] [EXTERNAL] Re: Is Crowd Computing the Next Big Thing?
On 27/11/2019 22.56, Lux, Jim (US 337K) via Beowulf wrote: > With respect to "free cycles" in desktop computers - back in the day, 10-15 > years ago, a bunch of folks made measurements on cluster nodes of one sort or > another. As I recall, there *is* a power consumption change between full > load and not, but there's a significant "background load" that is more than > 50% of the total power consumption. With current hardware, there is a significant difference (at least, assuming the "ipmi-dcmi --get-system-power-statistics" output is correct). On our skylake nodes ("standard 2-socket CPU nodes") idle power is about 50W, when running flat out about 400W. Somewhat older hardware is less good at saving power when idle, Westmere nodes consume about 100W idling. IIRC our old Istanbul Opterons (which we have already thrown away, so can't double-check) consumed about 170W when idle. -- Janne Blomqvist ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Is Crowd Computing the Next Big Thing?
On 28/11/2019 06.44, Jonathan Aquilina wrote: > I am going to be blunt here but its not worth that 2 cents per hour > given phones are key to everyones day to day work etc youll have to > carry quite a number of battery packs to get you through your day. Agreed, I wouldn't accept 2c/h for something that runs at full power and drains the battery. But if the thing runs only when the phone is connected to a charger (& wifi if it needs to communicate), then it's slightly less nonsensical. Also, could be a use for old phones that otherwise would just gather dust in a drawer or be thrown in the trash. Of course, if they pay for compute performance (for some suitable definition of performance) rather than just cpu-hours old phones will be less power efficient. -- Janne Blomqvist ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] apline linux
On Sat, Jan 30, 2021 at 7:57 AM Jonathan Aquilina via Beowulf wrote: > Recently been giving alpine linux a try. I am super impressed on virtualized > version of it what a low ram footprint it has 60mb starting. What do you guys > think when it comes to alpine linux in the HPC space? AFAIU the libm in musl hasn't received anywhere close to the same level of attention to correctness and performance as the glibc one. Further, depending on how you provision accounts in your cluster, the lack of NSS in musl might be a problem. -- Janne Blomqvist ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] [beowulf] nfs vs parallel filesystems
On Sat, Sep 18, 2021 at 8:21 PM Lohit Valleru via Beowulf wrote: > > Hello Everyone, > > I am trying to find answers to an age old question of NFS vs Parallel file > systems. Specifically - Isilon oneFS vs parallel filesystems.Specifically > looking for any technical articles or papers that can help me understand what > exactly will not work on oneFS. > I understand that at the end - it all depends on workloads. > But at what capacity of metadata io or a particular io pattern is bad in > NFS.Would just getting a beefy isilon NFS HDD based storage - resolve most of > the issues? > I am trying to find sources that can say that no matter how beefy an NFS > server can get with HDDs as backed - it will not be as good as parallel > filesystems for so and so workload. > If possible - Can anyone point me to experiences or technical papers that > mention so and so do not work with NFS. > > Does it have to be that at the end - i will have to test my workloads across > both NFS/OneFS and Parallel File systems and then see what would not work? > > I am concerned that any test case might not be valid, compared to real shared > workloads where performance might lag once the storage reaches PBs in scale > and millions of files. For one thing NFS is not cache coherent, but rather implements a looser form of consistency called close-to-open consistency. See e.g. the spec at https://datatracker.ietf.org/doc/html/rfc7530#section-1.4.6 One case in which this matters is if you have a workload where multiple nodes concurrently write to a shared file. E.g. with the ever-popular IOR benchmarking tool, a slurm batch file like #SBATCH -N 2 # 2 nodes #SBATCH --ntasks-per-node=1# 1 MPI task per node SEGMENTCOUNT=100 #Offset must be equal to ntasks-per-node OFFSET=1 srun IOR -a POSIX -t 1000 -b 1000 -s $SEGMENTCOUNT -C -Q $OFFSET -e -i 5 -d 10 -v -w -r -W -R -g -u -q -o testfile This should fail due to corruption within minutes if the testfile is on NFS. Not saying any parallel filesystem will handle this either. Some will. -- Janne Blomqvist ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf