Re: [gpfsug-discuss] Services on DSS/ESS nodes

Jonathan Buzzard Mon, 05 Oct 2020 04:45:17 -0700


On 05/10/2020 07:27, Jordi Caubet Serrabou wrote:


> Coming to the routing point, is there any reason why you need it ? I
> mean, this is because GPFS trying to connect between compute nodes or
> a reason outside GPFS scope ?
> If the reason is GPFS,  imho best approach - without knowledge of the
> licensing you have - would be to use separate clusters: a storage
> cluster and two compute clusters.

The issue is that individual nodes want to talk to one another on thedata interface. Which caught me by surprise as the cluster is set toadmin mode central.

The admin interface runs over ethernet for all nodes on a specific VLANwhich which is given 802.1p priority 5 (that's Voice, < 10 ms latencyand jitter). That saved a bunch of switching and cabling as you don'tneed the extra interface for the admin traffic. The cabling alreadysignificantly restricts airflow for a compute rack as it is, withoutadding a whole bunch more for a barely used admin interface.

It's like the people who wrote the best practice about separateinterface for the admin traffic know very little about networking to befrankly honest. This is all last century technology.

The nodes for undergraduate teaching only have a couple of 1Gb ethernetports which would suck for storage usage. However they also have QDRInfiniband. That is because even though undergraduates can't runmultinode jobs, on the old cluster the Lustre storage was delivered overInfiniband, so they got Infiniband cards.


> Both compute clusters join using multicluster setup the storage
> cluster. There is no need both compute clusters see each other, they
> only need to see the storage cluster. One of the clusters using the
> 10G, the other cluster using the IPoIB interface.
> You need at least three quorum nodes in each compute cluster but if
> licensing is per drive on the DSS, it is covered.

Three clusters is starting to get complicated from an admin perspective.The biggest issue is coordinating maintenance and keep sufficient quorumnodes up.

Maintenance on compute nodes is done via the job scheduler. I know somepeople think this is crazy, but it is in reality extremely elegant.

We can schedule a reboot on a node as soon as the current job hasfinished (usually used for firmware upgrades). Or we can schedule a jobto run as root (usually for applying updates) as soon as the current jobhas finished. As such we have no way of knowing when that will be for agiven node, and there is a potential for all three quorum nodes to bedown at once.

Using this scheme we can seamlessly upgrade the nodes safe in theknowledge that a node is either busy and it's running on the currentconfiguration or it has been upgraded and is running the newconfiguration. Consequently multinode jobs are guaranteed to have allnodes in the job running on the same configuration.

The alternative is to drain the node, but there is only a 23% chance thenode will become available during working hours leading to a significantloss of compute time when doing maintenance compared to our existingscheme where the loss of compute time is only as long as the upgradetakes to install. Pretty much the only time we have idle nodes is whenthe scheduler is reserving nodes ready to schedule a multi node job.

Right now we have a single cluster with the quorum nodes being the twoDSS-G nodes and the node used for backup. It is easy to ensure thatquorum is maintained on these, they also all run real RHEL, where as thecompute nodes run CentOS.



JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Services on DSS/ESS nodes

Reply via email to