Re: [gpfsug-discuss] Joining RDMA over different networks?

Jonathan Buzzard Thu, 24 Aug 2023 08:29:17 -0700

On 22/08/2023 11:52, Alec wrote:

I wouldn't want to use GPFS if I didn't want my nodes to be able to gonuts, why bother to be frank.

Because there are multiple users to the system. Do you want to be theone explaining to 50 other users that they can't use the system todaybecause John from Chemistry is pounding the filesystem to death for hisjobs? Didn't think so.

There is not an infinite amount of money available and it is notpossible with a reasonable amount of money to make a file system thatall the nodes can max out their network connection at once.

I had tested a configuration with a single x86 box and 4 x 100Gbeadapters talking to an ESS, that thing did amazing performance in excessof 25 GB/s over Ethernet. If you have a node that needs thatperformance build to it. Spend more time configuring QoS to fair shareyour bandwidth than baking bottlenecks into your configuration.

There are finite budgets and compromises have to be made. Thecompromises we made back in 2017 when the specification was written andput out to tender have held up really well.

The reasoning of holding end nodes to a smaller bandwidth than thebackend doesn't make sense. You want to clear "the work" as efficientlyas possible, more than keep IT from having any constraints popping up.That's what leads to just endless dithering and diluting ofinfrastructure until no one can figure out how to get real performance.

It does because a small number of jobs can hold the system to ransom forlots of other users. I have to balance things across a large number ofnodes. There is only a finite amount of bandwidth to the storage and ithas to be shared out fairly. I could attempt to do it with QOS on theswitches or I could go sod that for a lark 10Gbps is all you get andlets keep it simple. Though like I said today it would be 25Gbps, butthis was a specification written six years ago when 25Gbps Ethernet wasrather exotic and too expensive.

So yeah 95% of the workloads don't care about their performance and canlive on dithered and diluted infrastructure that costs a zillion timesmore money than what the 5% of workload that does care about bandwidthneeds to spend to actually deliver.

They do care about performance, they just don't need to max out theallotted performance per node. However if performance of the file systemis bad the performance of the their jobs will also be bad and the totalFLOPS I get from the system will plummet through the floor.

Note it is more like 0.1% of jobs that peg the 10Gbps network interfacefor any period of time it at all.

Build your infrastructure storage as high bandwidth as possible per nodebecause compared to all the other costs it's a drop in the bucket...Don't cheap out on "cables".

No it's not. The Omnipath network (which by the way is reserveddeliberately for MPI) cost a *LOT* of money. We are having seriousconversations that with current core counts per node that anInfiniband/Omnipath network doesn't make sense any more, and that 25GbpsEthernet will do just fine for a standard compute node.

Around 85% of our jobs run on 40 cores (aka one node) or less. If you goto 128 cores a node it's more like 95% of all jobs. If you go to 192cores it's about 98% of all jobs. The maximum job size we allowcurrently is 400 cores.

Better to ditch the expensive interconnect and use the hundreds ofthousands of dollars saved and buy more compute nodes is the currentthinking. The 2% of users can just have longer runtimes but hey therewill be a lot more FLOPS available in total and they rarely have justone job in the queue so it will all balance out in the wash and bepositive for most users.

In consultation the users are on board with this direction of travel.From our perspective if a user absolutely needs more than 192 cores ona modern system it would not be unreasonable to direct them to anational facility that can handle the really huge jobs. We are aninstitutional HPC facility after all. We don't claim to be able tohandle a 1000 core job for example.



JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] Joining RDMA over different networks?

Reply via email to