Maybe give a vote for this one: https://ideas.ibm.com/ideas/GPFS-I-652
Encryption - tool to check health status of all configured encryption > servers > > When Encryption is configured on a file system. the key server must be > available to allow user file access. When the key server fails, data access > is lost. We need a tools that can be run to check key server health, check > retrieval of keys, and communication health. This should be independent of > mmfsd. Inclusion in mmhealth would be ideal. > Planned for future release... -jf On Fri, Aug 18, 2023 at 11:11 AM Alec <[email protected]> wrote: > Hmm.. IBM mentions in 5.1.2 documentation that for performance we could > just rotate the order of the keys to load balance the keys.. however > because of server maintenance I would imagine all the nodes end up on the > same server eventually. > > But I think I see a solution. If I just define 4 additional RKM configs > and each one with one key server and don't do anything else with it. I am > guessing that GPFS is going to monitor and complain about them if they go > down. And that is easy to test... > > > So RKM.conf with > RKM_PROD { > kmipServerUri1 = node1 > kmipServerUri2 = node2 > kmipServerUri3 = node3 > kmipServerUri4 = node4 > } > RKM_PROD_T1 { > kmipServerUri = node1 > } > RKM_PROD_T2 { > kmipServerUri = node2 > } > RKM_PROD_T3 { > kmipServerUri = node3 > } > RKM_PROD_T4 { > kmipServerUri = node4 > } > > I could then define 4 files with a key from each test RKM_PROD_T? group to > monitor the availability of the individual key servers. > > Call it Alec's trust but verify HA. > > On Fri, Aug 18, 2023, 1:51 AM Alec <[email protected]> wrote: > >> Okay so how do you know the backup key servers are actually functioning >> until you try to fail to them? We need a way to know they are actually >> working. >> >> Setting encryptionKeyCacheExpiration to 0 would actually help in that we >> shouldn't go down once we are up. But it would suck if we bounce and then >> find out none of the key servers are working, then we have the same >> disaster but just a different time to experience it. >> >> Spectrum Scale honestly needs an option to probe and complain about the >> backup RKM servers. Or if we could run a command to validate that all >> keys are visible on all key servers that could work as well. >> >> Alec >> >> On Fri, Aug 18, 2023, 12:22 AM Jan-Frode Myklebust <[email protected]> >> wrote: >> >>> If a key server go offline, scale will just go to the next one in the >>> list -- and give a warning/error about it in mmhealth. Nothing should >>> happen to the file system access. Also, you can tune how often scale needs >>> to refresh the keys from the key server with encryptionKeyCacheExpiration. >>> Setting it to 0 means that your nodes will only need to fetch the key when >>> they mount the file system, or when you change policy. >>> >>> >>> -jf >>> >>> On Thu, Aug 17, 2023 at 5:54 PM Alec <[email protected]> wrote: >>> >>>> Yesterday I proposed treating the replicated key servers as 2 different >>>> sets of servers. And having scale address two of the RKM servers by one >>>> rkmid/tenant/devicegrp/client name, and having a second >>>> rkmid/tenant/devicegrp/client name for the 2nd set of servers. >>>> >>>> So define the same cluster of key management servers in two separate >>>> stanzas of RKM.conf, an upper and lower half. >>>> >>>> If we do that and key management team takes one set offline, everything >>>> should work but scale would think one set of keys are offline and scream. >>>> >>>> I think we need an IBM ticket to help vet all that out. >>>> >>>> Alec >>>> >>>> On Thu, Aug 17, 2023, 8:11 AM Jan-Frode Myklebust <[email protected]> >>>> wrote: >>>> >>>>> >>>>> Your second KMIP server don’t need to have an active replication >>>>> relationship with the first one — it just needs to contain the same MEK. >>>>> So >>>>> you could do a one time replication / copying between them, and they would >>>>> not have to see each other anymore. >>>>> >>>>> I don’t think having them host different keys will work, as you won’t >>>>> be able to fetch the second key from the one server your client is >>>>> connected to, and then will be unable to encrypt with that key. >>>>> >>>>> From what I’ve seen of KMIP setups with Scale, it’s a stupidly trivial >>>>> service. It’s just a server that will tell you the key when asked + some >>>>> access control to make sure no one else gets it. Also MEKs never changes… >>>>> unless you actively change them in the file system policy, and then you >>>>> could just post the new key to all/both your independent key servers when >>>>> you do the change. >>>>> >>>>> >>>>> -jf >>>>> >>>>> ons. 16. aug. 2023 kl. 23:25 skrev Alec <[email protected]>: >>>>> >>>>>> Ed >>>>>> Thanks for the response, I wasn't aware of those two commands. I >>>>>> will see if that unlocks a solution. I kind of need the test to work in a >>>>>> production environment. So can't just be adding spare nodes onto the >>>>>> cluster and forgetting with file systems. >>>>>> >>>>>> Unfortunately the logs don't indicate when a node has returned to >>>>>> health. Only that it's in trouble but as we patch often we see these >>>>>> regularly. >>>>>> >>>>>> >>>>>> For the second question, we would add a 2nd MEK key to each file so >>>>>> that two independent keys from two different RKM pools would be able to >>>>>> unlock any file. This would give us two whole independent paths to >>>>>> encrypt >>>>>> and decrypt a file. >>>>>> >>>>>> So I'm looking for a best practice example from IBM to indicate this >>>>>> so we don't have a dependency on a single RKM environment. >>>>>> >>>>>> Alec >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward <[email protected]> wrote: >>>>>> >>>>>>> > How can we verify that a key server is up and running when there >>>>>>> are multiple key servers in an rkm pool serving a single key. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Pretty simple. >>>>>>> >>>>>>> -Grab a compute node/client (and mark it offline if needed) unmount >>>>>>> all encrypted File Systems. >>>>>>> >>>>>>> -Hack the RKM.conf to point to JUST the server you want to test (and >>>>>>> maybe a backup) >>>>>>> >>>>>>> -Clear all keys: ‘/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ‘ >>>>>>> >>>>>>> -Reload the RKM.conf: ‘/usr/lpp/mmfs/bin/tsloadikm run’ (this is >>>>>>> a great command if you need to load new Certificates too) >>>>>>> >>>>>>> -Attempt to mount the encrypted FS, and then cat a few files. >>>>>>> >>>>>>> >>>>>>> >>>>>>> If you’ve not setup a 2nd server in your test you will see >>>>>>> quarantine messages in the logs for a bad KMIP server. If it works, >>>>>>> you >>>>>>> can clear keys again and see how many were retrieved. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >Is there any documentation or diagram officially from IBM that >>>>>>> recommends having 2 keys from independent RKM environments for high >>>>>>> availability as best practice that I could refer to? >>>>>>> >>>>>>> >>>>>>> >>>>>>> I am not an IBM-er… but I’m also not 100% sure what you are asking >>>>>>> here. Two un-related SKLM setups? How would you sync the keys? How >>>>>>> would this be better than multiple replicated servers? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Ed Wahl >>>>>>> >>>>>>> Ohio Supercomputer Center >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* gpfsug-discuss <[email protected]> *On >>>>>>> Behalf Of *Alec >>>>>>> *Sent:* Wednesday, August 16, 2023 3:33 PM >>>>>>> *To:* gpfsug main discussion list <[email protected]> >>>>>>> *Subject:* [gpfsug-discuss] RKM resilience questions testing and >>>>>>> best practice >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hello we are using a remote key server with GPFS I have two >>>>>>> questions: First question: How can we verify that a key server is up and >>>>>>> running when there are multiple key servers in an rkm pool serving a >>>>>>> single >>>>>>> key. The scenario is after maintenance >>>>>>> >>>>>>> Hello we are using a remote key server with GPFS I have two >>>>>>> questions: >>>>>>> >>>>>>> >>>>>>> >>>>>>> First question: >>>>>>> >>>>>>> How can we verify that a key server is up and running when there are >>>>>>> multiple key servers in an rkm pool serving a single key. >>>>>>> >>>>>>> >>>>>>> >>>>>>> The scenario is after maintenance or periodically we want to verify >>>>>>> that all member of the pool are in service. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Second question is: >>>>>>> >>>>>>> Is there any documentation or diagram officially from IBM that >>>>>>> recommends having 2 keys from independent RKM environments for high >>>>>>> availability as best practice that I could refer to? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Alec >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at gpfsug.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at gpfsug.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>> >> _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
