There is QoS in lustre, the feature called NRS - Network Request Scheduler.
It is possible to set different policies.
Will it address the issue ?

The manual has entry and there were few presentations on LUG/LAD.

I did not use NRS myself but I would like to learn.
Alex.

> On Jul 7, 2023, at 06:48, Anna Fuchs via lustre-discuss 
> <[email protected]> wrote:
> 
> Dear all,
> 
> I have some questions regarding the following scenario:
>  - A large HPC system.
> - Let's assume that Job X is running on 1 compute node and is reading a very 
> large file with a stripecount (>>1)..-1. Alternatively, tons of files are 
> read at once with smaller striping each, but distributed across all OSS/OSTs.
> - The compute node is connected, for example, with a 100Gb/s link, and there 
> are 50 servers, each with a 200Gb/s link. This generates a network load of 
> 50x200Gb/s, which is processed at 100Gb/s.
> - Job Y, which requires the same network and potentially doesn't even perform 
> I/O, suffers a lot as a result.
> 
> Does this scenario sound familiar to you?
> Is the sequence of events correct?
> What could be done in this situation?
> 
> To avoid:
> a) having such single/few-nodes jobs
> b) striping large files with up to -1
> c) reading millions of files at once
> One could try, but I have concerns that the users will persist in doing it, 
> either intentionally or accidentally, and it would only shift the problem, 
> rather than solving it.
> One could tweak the network design, reconfigure it, separate I/O from 
> communication, but it would hardly optimize all use cases. Virtual lanes 
> could potentially be a solution as well. Though, that might not help if the 
> Job Y also involves some I/O.
> 
> Wouldn't it be better if Lustre somehow recognized this imbalance between 
> incoming and outgoing network traffic and loaded the file(s)/data gradually 
> rather than all at once, saturating or slightly overloading the consumer 
> 100Gb/s connection rather than by a factor of 100? Does this sound 
> reasonable, and is there already a solution for it?
> I would appreciate any opinions.
> 
> Best regards
> Anna
> 
> --
> Anna Fuchs
> Universität Hamburg
> https://wr.informatik.uni-hamburg.de/people/anna_fuchs
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to