So why not use the built in QOS features of Spectrum Scale to adjust the performance of a particular fileset, that way you can ensure you have appropriate bandwidth?
https://www.ibm.com/docs/en/storage-scale/5.1.1?topic=reference-mmqos-command What you're saying is that you don't want to build a system to meet Johns demands because you're worried about Tom not having bandwidth for his process. When in fact there is a way to guarantee a minimum quality of service for every user and still allow the system to perform exceptionally well for those that need / want it. You can also set hard caps if you want. I haven't tested it but you should also be able to set a maxbps for a node so that it won't exceed a certain limit if you really need to. Not sure if you're using LSF but you can even tie LSF queues to Spectrum Scale QOS, I didn't really try it but thought that has some great possibilities. I would say don't hurt John to keep Tom happy.. make both of them happy. In this scenario you don't have to intimately know the CPU vs IO characteristics of a job. You just need to know that reserving 1GB/s of I/O per filesystem is fair, and letting jobs consume max I/O when available is efficient. In Linux you have other mechanisms such as cgroups to refine workload distribution within the node. Another way to think about it is that in a system that is trying to get work done any unused capacity is costing someone somewhere something. At the same time if a system can't perform reliably and predictably that is a problem, but QOS is there to solve that problem. Alec On Thu, Aug 24, 2023, 8:28 AM Jonathan Buzzard < [email protected]> wrote: > On 22/08/2023 11:52, Alec wrote: > > > I wouldn't want to use GPFS if I didn't want my nodes to be able to go > > nuts, why bother to be frank. > > > > Because there are multiple users to the system. Do you want to be the > one explaining to 50 other users that they can't use the system today > because John from Chemistry is pounding the filesystem to death for his > jobs? Didn't think so. > > There is not an infinite amount of money available and it is not > possible with a reasonable amount of money to make a file system that > all the nodes can max out their network connection at once. > > > I had tested a configuration with a single x86 box and 4 x 100Gbe > > adapters talking to an ESS, that thing did amazing performance in excess > > of 25 GB/s over Ethernet. If you have a node that needs that > > performance build to it. Spend more time configuring QoS to fair share > > your bandwidth than baking bottlenecks into your configuration. > > > > There are finite budgets and compromises have to be made. The > compromises we made back in 2017 when the specification was written and > put out to tender have held up really well. > > > The reasoning of holding end nodes to a smaller bandwidth than the > > backend doesn't make sense. You want to clear "the work" as efficiently > > as possible, more than keep IT from having any constraints popping up. > > That's what leads to just endless dithering and diluting of > > infrastructure until no one can figure out how to get real performance. > > > > It does because a small number of jobs can hold the system to ransom for > lots of other users. I have to balance things across a large number of > nodes. There is only a finite amount of bandwidth to the storage and it > has to be shared out fairly. I could attempt to do it with QOS on the > switches or I could go sod that for a lark 10Gbps is all you get and > lets keep it simple. Though like I said today it would be 25Gbps, but > this was a specification written six years ago when 25Gbps Ethernet was > rather exotic and too expensive. > > > So yeah 95% of the workloads don't care about their performance and can > > live on dithered and diluted infrastructure that costs a zillion times > > more money than what the 5% of workload that does care about bandwidth > > needs to spend to actually deliver. > > > > They do care about performance, they just don't need to max out the > allotted performance per node. However if performance of the file system > is bad the performance of the their jobs will also be bad and the total > FLOPS I get from the system will plummet through the floor. > > Note it is more like 0.1% of jobs that peg the 10Gbps network interface > for any period of time it at all. > > > Build your infrastructure storage as high bandwidth as possible per node > > because compared to all the other costs it's a drop in the bucket... > > Don't cheap out on "cables". > > No it's not. The Omnipath network (which by the way is reserved > deliberately for MPI) cost a *LOT* of money. We are having serious > conversations that with current core counts per node that an > Infiniband/Omnipath network doesn't make sense any more, and that 25Gbps > Ethernet will do just fine for a standard compute node. > > Around 85% of our jobs run on 40 cores (aka one node) or less. If you go > to 128 cores a node it's more like 95% of all jobs. If you go to 192 > cores it's about 98% of all jobs. The maximum job size we allow > currently is 400 cores. > > Better to ditch the expensive interconnect and use the hundreds of > thousands of dollars saved and buy more compute nodes is the current > thinking. The 2% of users can just have longer runtimes but hey there > will be a lot more FLOPS available in total and they rarely have just > one job in the queue so it will all balance out in the wash and be > positive for most users. > > In consultation the users are on board with this direction of travel. > From our perspective if a user absolutely needs more than 192 cores on > a modern system it would not be unreasonable to direct them to a > national facility that can handle the really huge jobs. We are an > institutional HPC facility after all. We don't claim to be able to > handle a 1000 core job for example. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
