Re: [gpfsug-discuss] Services on DSS/ESS nodes
On 05/10/2020 09:40, Simon Thompson wrote: I now need to check IBM are not going to throw a wobbler down the line if I need to get support before deploying it to the DSS-G nodes :-) I know there were a lot of other emails about this ... I think you maybe want to be careful doing this. Whilst it might work when you setup the DSS-G like this, remember that the memory usage you are seeing at this point in time may not be what you always need. For example if you fail-over the recovery groups, you need to have enough free memory to handle this. E.g. a node failure, or more likely you are upgrading the building blocks. I think there is a lack of understanding on exactly how light weight keepalived is. It's the same code as on my routers which are admittedly different CPU's (MIPS to be precise) but memory usage (taking out shared memory usage - libc for example is loaded anyway) is under 200KB. A bash shell uses more memory... Personally I wouldn't run other things like this on my DSS-G storage nodes. We do run e.g. nrpe monitoring to collect and report faults, but this is pretty lightweight compared to everything else. They even removed support for running the gui packages on the IO nodes - the early DSS-G builds used the IO nodes for this, but now you need separate systems for this. And keepalived is in the same range as nrpe, which you do run :-) I have seen nrpe get out of hand and consume significant amounts of resources on a machine; the machine was ground to halt due to nrpe. One of the standard plugins was failing and sitting their busy waiting. Every five minutes it ran again. It of course decided to wait till ~7pm on a Friday to go wonky. By mid morning on Saturday it was virtually unresponsive, several minutes to get a shell... I would note that you can run keepalived quite happily on an Ubiquiti EdgeRouter X which has a dual core 880 MHz MIPS CPU with 256MB of RAM. Mikrotik have models with similar specs that run it too. On a dual Xeon Gold 6142 machine the usage of RAM and CPU by keepalived is noise. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Services on DSS/ESS nodes
>> Mixing DSS and ESS in the same cluster is not a supported configuration. > > I know, it means you can never ever migrate your storage from DSS to ESS > without a full backup and restore. Who with any significant amount of > storage is going to want to do that? The logic behind this escapes me, > or perhaps in that scenario IBM might relax the rules for the migration > period. > We do indeed relax the rules temporarily for a migration. The reasoning behind this rule is for support. Many Scale support issues - often the toughest ones - are not about a single node, but about the cluster or network as a whole. So if you have a mix of IBM systems with systems supported by an OEM (this applies to any OEM by the way, not just Lenovo) and a cluster-wide issue, who are you going to call. (Well, in practice you’re going to call IBM and we’ll do our best to help you despite limits on our knowledge of the OEM systems…). --CZ Carl Zetie Program Director Offering Management Spectrum Scale (919) 473 3318 ][ Research Triangle Park ca...@us.ibm.com [signature_386371469] ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Services on DSS/ESS nodes
Jordi wrote: “Both compute clusters join using multicluster setup the storage cluster. There is no need both compute clusters see each other, they only need to see the storage cluster. One of the clusters using the 10G, the other cluster using the IPoIB interface. You need at least three quorum nodes in each compute cluster but if licensing is per drive on the DSS, it is covered.” As a side note: One of the reasons we designed capacity (per Disk or per TB) licensing the way we did was specifically so that you could make this kind of architectural decision on its own merits, without worrying about a licensing penalty. Carl Zetie Program Director Offering Management Spectrum Scale (919) 473 3318 ][ Research Triangle Park ca...@us.ibm.com [signature_1243111775] ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Services on DSS/ESS nodes
On 05/10/2020 07:27, Jordi Caubet Serrabou wrote: > Coming to the routing point, is there any reason why you need it ? I > mean, this is because GPFS trying to connect between compute nodes or > a reason outside GPFS scope ? > If the reason is GPFS, imho best approach - without knowledge of the > licensing you have - would be to use separate clusters: a storage > cluster and two compute clusters. The issue is that individual nodes want to talk to one another on the data interface. Which caught me by surprise as the cluster is set to admin mode central. The admin interface runs over ethernet for all nodes on a specific VLAN which which is given 802.1p priority 5 (that's Voice, < 10 ms latency and jitter). That saved a bunch of switching and cabling as you don't need the extra interface for the admin traffic. The cabling already significantly restricts airflow for a compute rack as it is, without adding a whole bunch more for a barely used admin interface. It's like the people who wrote the best practice about separate interface for the admin traffic know very little about networking to be frankly honest. This is all last century technology. The nodes for undergraduate teaching only have a couple of 1Gb ethernet ports which would suck for storage usage. However they also have QDR Infiniband. That is because even though undergraduates can't run multinode jobs, on the old cluster the Lustre storage was delivered over Infiniband, so they got Infiniband cards. > Both compute clusters join using multicluster setup the storage > cluster. There is no need both compute clusters see each other, they > only need to see the storage cluster. One of the clusters using the > 10G, the other cluster using the IPoIB interface. > You need at least three quorum nodes in each compute cluster but if > licensing is per drive on the DSS, it is covered. Three clusters is starting to get complicated from an admin perspective. The biggest issue is coordinating maintenance and keep sufficient quorum nodes up. Maintenance on compute nodes is done via the job scheduler. I know some people think this is crazy, but it is in reality extremely elegant. We can schedule a reboot on a node as soon as the current job has finished (usually used for firmware upgrades). Or we can schedule a job to run as root (usually for applying updates) as soon as the current job has finished. As such we have no way of knowing when that will be for a given node, and there is a potential for all three quorum nodes to be down at once. Using this scheme we can seamlessly upgrade the nodes safe in the knowledge that a node is either busy and it's running on the current configuration or it has been upgraded and is running the new configuration. Consequently multinode jobs are guaranteed to have all nodes in the job running on the same configuration. The alternative is to drain the node, but there is only a 23% chance the node will become available during working hours leading to a significant loss of compute time when doing maintenance compared to our existing scheme where the loss of compute time is only as long as the upgrade takes to install. Pretty much the only time we have idle nodes is when the scheduler is reserving nodes ready to schedule a multi node job. Right now we have a single cluster with the quorum nodes being the two DSS-G nodes and the node used for backup. It is easy to ensure that quorum is maintained on these, they also all run real RHEL, where as the compute nodes run CentOS. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Services on DSS/ESS nodes
>I now need to check IBM are not going to throw a wobbler down the line >if I need to get support before deploying it to the DSS-G nodes :-) I know there were a lot of other emails about this ... I think you maybe want to be careful doing this. Whilst it might work when you setup the DSS-G like this, remember that the memory usage you are seeing at this point in time may not be what you always need. For example if you fail-over the recovery groups, you need to have enough free memory to handle this. E.g. a node failure, or more likely you are upgrading the building blocks. Personally I wouldn't run other things like this on my DSS-G storage nodes. We do run e.g. nrpe monitoring to collect and report faults, but this is pretty lightweight compared to everything else. They even removed support for running the gui packages on the IO nodes - the early DSS-G builds used the IO nodes for this, but now you need separate systems for this. Simon ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Services on DSS/ESS nodes
Coming to the routing point, is there any reason why you need it ? I mean, this is because GPFS trying to connect between compute nodes or a reason outside GPFS scope ? If the reason is GPFS, imho best approach - without knowledge of the licensing you have - would be to use separate clusters: a storage cluster and two compute clusters. Both compute clusters join using multicluster setup the storage cluster. There is no need both compute clusters see each other, they only need to see the storage cluster. One of the clusters using the 10G, the other cluster using the IPoIB interface. You need at least three quorum nodes in each compute cluster but if licensing is per drive on the DSS, it is covered. -- Jordi Caubet Serrabou IBM Software Defined Infrastructure (SDI) and Flash Technical Sales Specialist Technical Computing and HPC IT Specialist and Architect Ext. Phone: (+34) 679.79.17.84 (internal 55834) E-mail: jordi.cau...@es.ibm.com > On 5 Oct 2020, at 08:19, Olaf Weiser wrote: > > > let me add a few comments from some very successful large installations in > Eruope > > # InterOP > Even though (as Luis pointed to) , there is no support statement to run > intermix DSS/ESS in general, it was ~, and is, and will be, ~ allowed for > short term purposes, such as e.g migration. > The reason to not support those DSS/ESS mixed configuration in general is > simply driven by the fact, that different release version of DSS/ESS > potentially (not in every release, but sometimes) comes with different > driver levels, (e.g. MOFED), OS, RDMA-settings, GPFS tuning, etc... > Those changes can have an impact/multiple impacts and therefore, we do not > support that in general. Of course -and this would be the advice for every > one - if you are faced the need to run a mixed configuration for e.g. a > migration and/or e.g. cause of you need to temporary provide space etc... > contact you IBM representative and settle to plan that accordingly.. > There will be (likely) some additional requirements/dependencies defined > like driver versions, OS, and/or Scale versions, but you'll get a chance to > run mixed configuration - temporary limited to your specific scenario. > > # Monitoring > No doubt, monitoring is essential and absolutely needed. - and/but - IBM > wants customers to be very sensitive, what kind of additional software > (=workload) gets installed on the ESS-IO servers. BTW, this rule applies as > well to any other important GPFS node with special roles (e.g. any other NSD > server etc) > But given the fact, that customer's usually manage and monitor their server > farms from a central point of control (any 3rd party software), it is common/ > best practice , that additionally monitor software(clients/endpoints) has to > run on GPFS nodes, so as on ESS nodes too. > > If that way of acceptance applies for DSS too, you may want to double check > with Lenovo ?! > > > #additionally GW functions > It would be a hot iron, to general allow routing on IO nodes. Similar to the > mixed support approach, the field variety for such a statement would be > hard(==impossible) to manage. As we all agree, additional network traffic can > (and in fact will) impact GPFS. > In your special case, the expected data rates seems to me more than ok and > acceptable to go with your suggested config (as long workloads remain on that > level / monitor it accordingly as you are already obviously doing) > Again,to be on the safe side.. contact your IBM representative and I'm sure > you 'll find a way.. > > > > kind regards > olaf > > > - Original message - > From: Jonathan Buzzard > Sent by: gpfsug-discuss-boun...@spectrumscale.org > To: gpfsug-discuss@spectrumscale.org > Cc: > Subject: [EXTERNAL] Re: [gpfsug-discuss] Services on DSS/ESS nodes > Date: Sun, Oct 4, 2020 12:17 PM > > On 04/10/2020 10:29, Luis Bolinches wrote: > > Hi > > > > As stated on the same link you can do remote mounts from each other and > > be a supported setup. > > > > “ You can use the remote mount feature of IBM Spectrum Scale to share > > file system data across clusters.” > > > > You can, but imagine I have a DSS-G cluster, with 2PB of storage on it > which is quite modest in 2020. It is now end of life and for whatever > reason I decide I want to move to ESS instead. > > What any sane storage admin want to do at this stage is set the ESS, add > the ESS nodes to the existing cluster on the DSS-G then do a bit of > mmadddisk/mmdeldisk and sit back while the data is seemlessly moved from > the DSS-G to the ESS. Admittedly this might take a while :-) > > Then once all the data is moved a bit of mmdelnode and bingo the storage > has been migrated from DSS-G to ESS with zero downtime. > > As that is not allowed for what I presume are commercial reasons (you > could do it in reverse and presumable that is what IBM don't want) then > once you are down the rabbit hole of one type of storage
Re: [gpfsug-discuss] Services on DSS/ESS nodes
let me add a few comments from some very successful large installations in Eruope # InterOP Even though (as Luis pointed to) , there is no support statement to run intermix DSS/ESS in general, it was ~, and is, and will be, ~ allowed for short term purposes, such as e.g migration. The reason to not support those DSS/ESS mixed configuration in general is simply driven by the fact, that different release version of DSS/ESS potentially (not in every release, but sometimes) comes with different driver levels, (e.g. MOFED), OS, RDMA-settings, GPFS tuning, etc... Those changes can have an impact/multiple impacts and therefore, we do not support that in general. Of course -and this would be the advice for every one - if you are faced the need to run a mixed configuration for e.g. a migration and/or e.g. cause of you need to temporary provide space etc... contact you IBM representative and settle to plan that accordingly.. There will be (likely) some additional requirements/dependencies defined like driver versions, OS, and/or Scale versions, but you'll get a chance to run mixed configuration - temporary limited to your specific scenario. # Monitoring No doubt, monitoring is essential and absolutely needed. - and/but - IBM wants customers to be very sensitive, what kind of additional software (=workload) gets installed on the ESS-IO servers. BTW, this rule applies as well to any other important GPFS node with special roles (e.g. any other NSD server etc) But given the fact, that customer's usually manage and monitor their server farms from a central point of control (any 3rd party software), it is common/ best practice , that additionally monitor software(clients/endpoints) has to run on GPFS nodes, so as on ESS nodes too. If that way of acceptance applies for DSS too, you may want to double check with Lenovo ?! #additionally GW functions It would be a hot iron, to general allow routing on IO nodes. Similar to the mixed support approach, the field variety for such a statement would be hard(==impossible) to manage. As we all agree, additional network traffic can (and in fact will) impact GPFS. In your special case, the expected data rates seems to me more than ok and acceptable to go with your suggested config (as long workloads remain on that level / monitor it accordingly as you are already obviously doing) Again,to be on the safe side.. contact your IBM representative and I'm sure you 'll find a way.. kind regards olaf - Original message -From: Jonathan Buzzard Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: gpfsug-discuss@spectrumscale.orgCc:Subject: [EXTERNAL] Re: [gpfsug-discuss] Services on DSS/ESS nodesDate: Sun, Oct 4, 2020 12:17 PM On 04/10/2020 10:29, Luis Bolinches wrote:> Hi>> As stated on the same link you can do remote mounts from each other and> be a supported setup.>> “ You can use the remote mount feature of IBM Spectrum Scale to share> file system data across clusters.”>You can, but imagine I have a DSS-G cluster, with 2PB of storage on itwhich is quite modest in 2020. It is now end of life and for whateverreason I decide I want to move to ESS instead.What any sane storage admin want to do at this stage is set the ESS, addthe ESS nodes to the existing cluster on the DSS-G then do a bit ofmmadddisk/mmdeldisk and sit back while the data is seemlessly moved fromthe DSS-G to the ESS. Admittedly this might take a while :-)Then once all the data is moved a bit of mmdelnode and bingo the storagehas been migrated from DSS-G to ESS with zero downtime.As that is not allowed for what I presume are commercial reasons (youcould do it in reverse and presumable that is what IBM don't want) thenonce you are down the rabbit hole of one type of storage the you are notgoing to switch to a different one.You need to look at it from the perspective of the users. They franklycould not give a monkeys what storage solution you are using. All theycare about is having usable storage and large amounts of downtime toswitch from one storage type to another is not really acceptable.JAB.--Jonathan A. Buzzard Tel: +44141-5483420HPC System Administrator, ARCHIE-WeSt.University of Strathclyde, John Anderson Building, Glasgow. G4 0NG___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss