On 05/10/2020 07:27, Jordi Caubet Serrabou wrote:

> Coming to the routing point, is there any reason why you need it ? I
> mean, this is because GPFS trying to connect between compute nodes or
> a reason outside GPFS scope ?
> If the reason is GPFS,  imho best approach - without knowledge of the
> licensing you have - would be to use separate clusters: a storage
> cluster and two compute clusters.

The issue is that individual nodes want to talk to one another on the data interface. Which caught me by surprise as the cluster is set to admin mode central.

The admin interface runs over ethernet for all nodes on a specific VLAN which which is given 802.1p priority 5 (that's Voice, < 10 ms latency and jitter). That saved a bunch of switching and cabling as you don't need the extra interface for the admin traffic. The cabling already significantly restricts airflow for a compute rack as it is, without adding a whole bunch more for a barely used admin interface.

It's like the people who wrote the best practice about separate interface for the admin traffic know very little about networking to be frankly honest. This is all last century technology.

The nodes for undergraduate teaching only have a couple of 1Gb ethernet ports which would suck for storage usage. However they also have QDR Infiniband. That is because even though undergraduates can't run multinode jobs, on the old cluster the Lustre storage was delivered over Infiniband, so they got Infiniband cards.

> Both compute clusters join using multicluster setup the storage
> cluster. There is no need both compute clusters see each other, they
> only need to see the storage cluster. One of the clusters using the
> 10G, the other cluster using the IPoIB interface.
> You need at least three quorum nodes in each compute cluster but if
> licensing is per drive on the DSS, it is covered.

Three clusters is starting to get complicated from an admin perspective. The biggest issue is coordinating maintenance and keep sufficient quorum nodes up.

Maintenance on compute nodes is done via the job scheduler. I know some people think this is crazy, but it is in reality extremely elegant.

We can schedule a reboot on a node as soon as the current job has finished (usually used for firmware upgrades). Or we can schedule a job to run as root (usually for applying updates) as soon as the current job has finished. As such we have no way of knowing when that will be for a given node, and there is a potential for all three quorum nodes to be down at once.

Using this scheme we can seamlessly upgrade the nodes safe in the knowledge that a node is either busy and it's running on the current configuration or it has been upgraded and is running the new configuration. Consequently multinode jobs are guaranteed to have all nodes in the job running on the same configuration.

The alternative is to drain the node, but there is only a 23% chance the node will become available during working hours leading to a significant loss of compute time when doing maintenance compared to our existing scheme where the loss of compute time is only as long as the upgrade takes to install. Pretty much the only time we have idle nodes is when the scheduler is reserving nodes ready to schedule a multi node job.

Right now we have a single cluster with the quorum nodes being the two DSS-G nodes and the node used for backup. It is easy to ensure that quorum is maintained on these, they also all run real RHEL, where as the compute nodes run CentOS.


JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to