Hah! I've been deep into SGE (user, trainer, consultant) for years.

Our setups are pretty similar but I'm hoping to use the AWS cfnCluster stack (https://github.com/awslabs/cfncluster) because it is officially blessed by AWS and since it's a cloudformation template at the end of the day it's both easy to support and extend. It also does all the hard work (auto-scaling etc.) that I don't want to have to code myself via ansible. Since I'm a consultant I need to hand off something to my users that is easy for them to operate moving forward without me.

Your experience using IPA in an HPC environment is very helpful. We also use ansible to automate "ipa-client-install --unattended ..." so scripting the install and remove commands should be pretty straightforward.

Just trying to my compare my  appreciation for IPA vs what I saw on the ground at massive HPC installations where the operators jumped through hoops to remove network services that could break user info or affect stablity. I lost count of how many sites I saw people dumping NIS maps and LDAP directories into plaintext files every 4-6hours that they'd spread across the cluster simply to remove any chance that a failed NIS/LDAP query could mess up a node, user or job.

Thanks!

Chris

April 13, 2017 at 9:21 AM

Hi Chris,

we're facing a similar use case from day to day, but changed from AWS to
another cloud provider. Our use case works on both, so i am refering to
AWS.

We decided...

...to use SGE for our HPC infrastructure
...recycle network ranges for 100 static IP addresses + 100 static
hostnames
...to use scripts & cronjobs & ansible (depending on "qstat" and "qhost"
output) on the cluster head node to determine how many additional
cluster nodes have to be created as an additional reserve for
"What-if-we-need-more-nodes?" scenarios
...to create cluster nodes via ansible-playbook on AWS from a
pre-defined image, do software installation & configuration via
ansible-playbook, do the IPA domain join via ansible-playbook
("ipa-client-install --domain=<DOMAIN> --mkhomedir
--hostname=<FreeIPA-Client>.<DOMAIN> --ip-address=<FreeIPA-Client IP
address> -p <Join User> -w <Join User's password> --unattended")
...to destroy cluster nodes in two steps: 1) ansible-playbook
"ipa-client-install --uninstall", 2) ansible-playbook destroy cluster
node on AWS via API

(Right now, i am working on a bulk creation script of IPA users/groups
for expanding our single HPC cluster into several ones, whereas we have
the same set of users (~65-100) with differing suffix in the username
e.g. "it_ops01", "it_ops20", etc...)

We're using 2x IPA-Servers (ESXi VMs, 4GB RAM, 2 CPU) in replication
with another 2x IPA Servers (same dimensions) on our main physical
datacenter. Didn't see much impact on the IPA servers during
enrollment/removal of domain hosts. So far after three months of
operations, we had several "bad box" scenarios, all of them because of
problems with SGE. We solved these problems manually, by removing/adding
cluster nodes via SGE commands.

As you can see, i tend to [Option 1], since it does all the magic with
pre-defined software commands(sge, ansible, ipa cli), instead of jumping
around with additional scripts doing work, which can be done by
"built-in" commands. For us, this works best.

Regards,

Gerald

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Reply via email to