On 22/08/2023 11:52, Alec wrote:

I wouldn't want to use GPFS if I didn't want my nodes to be able to go nuts, why bother to be frank.


Because there are multiple users to the system. Do you want to be the one explaining to 50 other users that they can't use the system today because John from Chemistry is pounding the filesystem to death for his jobs? Didn't think so.

There is not an infinite amount of money available and it is not possible with a reasonable amount of money to make a file system that all the nodes can max out their network connection at once.

I had tested a configuration with a single x86 box and 4 x 100Gbe adapters talking to an ESS, that thing did amazing performance in excess of 25 GB/s over Ethernet.  If you have a node that needs that performance build to it.  Spend more time configuring QoS to fair share your bandwidth than baking bottlenecks into your configuration.


There are finite budgets and compromises have to be made. The compromises we made back in 2017 when the specification was written and put out to tender have held up really well.

The reasoning of holding end nodes to a smaller bandwidth than the backend doesn't make sense.  You want to clear "the work" as efficiently as possible, more than keep IT from having any constraints popping up. That's what leads to just endless dithering and diluting of infrastructure until no one can figure out how to get real performance.


It does because a small number of jobs can hold the system to ransom for lots of other users. I have to balance things across a large number of nodes. There is only a finite amount of bandwidth to the storage and it has to be shared out fairly. I could attempt to do it with QOS on the switches or I could go sod that for a lark 10Gbps is all you get and lets keep it simple. Though like I said today it would be 25Gbps, but this was a specification written six years ago when 25Gbps Ethernet was rather exotic and too expensive.

So yeah 95% of the workloads don't care about their performance and can live on dithered and diluted infrastructure that costs a zillion times more money than what the 5% of workload that does care about bandwidth needs to spend to actually deliver.


They do care about performance, they just don't need to max out the allotted performance per node. However if performance of the file system is bad the performance of the their jobs will also be bad and the total FLOPS I get from the system will plummet through the floor.

Note it is more like 0.1% of jobs that peg the 10Gbps network interface for any period of time it at all.

Build your infrastructure storage as high bandwidth as possible per node because compared to all the other costs it's a drop in the bucket... Don't cheap out on "cables".

No it's not. The Omnipath network (which by the way is reserved deliberately for MPI) cost a *LOT* of money. We are having serious conversations that with current core counts per node that an Infiniband/Omnipath network doesn't make sense any more, and that 25Gbps Ethernet will do just fine for a standard compute node.

Around 85% of our jobs run on 40 cores (aka one node) or less. If you go to 128 cores a node it's more like 95% of all jobs. If you go to 192 cores it's about 98% of all jobs. The maximum job size we allow currently is 400 cores.

Better to ditch the expensive interconnect and use the hundreds of thousands of dollars saved and buy more compute nodes is the current thinking. The 2% of users can just have longer runtimes but hey there will be a lot more FLOPS available in total and they rarely have just one job in the queue so it will all balance out in the wash and be positive for most users.

In consultation the users are on board with this direction of travel. From our perspective if a user absolutely needs more than 192 cores on a modern system it would not be unreasonable to direct them to a national facility that can handle the really huge jobs. We are an institutional HPC facility after all. We don't claim to be able to handle a 1000 core job for example.


JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Reply via email to