Re: [gpfsug-discuss] RKM resilience questions testing and best practice

Jan-Frode Myklebust Fri, 18 Aug 2023 06:05:56 -0700

Maybe give a vote for this one: https://ideas.ibm.com/ideas/GPFS-I-652


Encryption - tool to check health status of all configured encryption
> servers
>
> When Encryption is configured on a file system. the key server must be
> available to allow user file access. When the key server fails, data access
> is lost. We need a tools that can be run to check key server health, check
> retrieval of keys, and communication health. This should be independent of
> mmfsd. Inclusion in mmhealth would be ideal.
>

Planned for future release...


  -jf

On Fri, Aug 18, 2023 at 11:11 AM Alec <[email protected]> wrote:

> Hmm.. IBM mentions in 5.1.2 documentation that for performance we could
> just rotate the order of the keys to load balance the keys.. however
> because of server maintenance I would imagine all the nodes end up on the
> same server eventually.
>
> But I think I see a solution.  If I just define 4 additional RKM configs
> and each one with one key server and don't do anything else with it.  I am
> guessing that GPFS is going to monitor and complain about them if they go
> down.  And that is easy to test...
>
>
> So RKM.conf with
> RKM_PROD {
>   kmipServerUri1 = node1
>   kmipServerUri2 = node2
>   kmipServerUri3 = node3
>   kmipServerUri4 = node4
> }
> RKM_PROD_T1 {
>   kmipServerUri = node1
> }
> RKM_PROD_T2 {
>   kmipServerUri = node2
> }
> RKM_PROD_T3 {
>   kmipServerUri = node3
> }
> RKM_PROD_T4 {
>   kmipServerUri = node4
> }
>
> I could then define 4 files with a key from each test RKM_PROD_T? group to
> monitor the availability of the individual key servers.
>
> Call it Alec's trust but verify HA.
>
> On Fri, Aug 18, 2023, 1:51 AM Alec <[email protected]> wrote:
>
>> Okay so how do you know the backup key servers are actually functioning
>> until you try to fail to them?  We need a way to know they are actually
>> working.
>>
>> Setting  encryptionKeyCacheExpiration to 0 would actually help in that we
>> shouldn't go down once we are up.  But it would suck if we bounce and then
>> find out none of the key servers are working, then we have the same
>> disaster but just a different time to experience it.
>>
>> Spectrum Scale honestly needs an option to probe and complain about the
>> backup RKM servers.   Or if we could run a command to validate that all
>> keys are visible on all key servers that could work as well.
>>
>> Alec
>>
>> On Fri, Aug 18, 2023, 12:22 AM Jan-Frode Myklebust <[email protected]>
>> wrote:
>>
>>> If a key server go offline, scale will just go to the next one in the
>>> list -- and give a warning/error about it in mmhealth. Nothing should
>>> happen to the file system access. Also, you can tune how often scale needs
>>> to refresh the keys from the key server with encryptionKeyCacheExpiration.
>>> Setting it to 0 means that your nodes will only need to fetch the key when
>>> they mount the file system, or when you change policy.
>>>
>>>
>>>   -jf
>>>
>>> On Thu, Aug 17, 2023 at 5:54 PM Alec <[email protected]> wrote:
>>>
>>>> Yesterday I proposed treating the replicated key servers as 2 different
>>>> sets of servers.  And having scale address two of the RKM servers by one
>>>> rkmid/tenant/devicegrp/client name, and having a second
>>>> rkmid/tenant/devicegrp/client name for the 2nd set of servers.
>>>>
>>>> So define the same cluster of key management servers in two separate
>>>> stanzas of RKM.conf, an upper and lower half.
>>>>
>>>> If we do that and key management team takes one set offline, everything
>>>> should work but scale would think one set of keys are offline and scream.
>>>>
>>>> I think we need an IBM ticket to help vet all that out.
>>>>
>>>> Alec
>>>>
>>>> On Thu, Aug 17, 2023, 8:11 AM Jan-Frode Myklebust <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> Your second KMIP server don’t need to have an active replication
>>>>> relationship with the first one — it just needs to contain the same MEK. 
>>>>> So
>>>>> you could do a one time replication / copying between them, and they would
>>>>> not have to see each other anymore.
>>>>>
>>>>> I don’t think having them host different keys will work, as you won’t
>>>>> be able to fetch the second key from the one server your client is
>>>>> connected to, and then will be unable to encrypt with that key.
>>>>>
>>>>> From what I’ve seen of KMIP setups with Scale, it’s a stupidly trivial
>>>>> service. It’s just a server that will tell you the key when asked + some
>>>>> access control to make sure no one else gets it. Also MEKs never changes…
>>>>> unless you actively change them in the file system policy, and then you
>>>>> could just post the new key to all/both your independent key servers when
>>>>> you do the change.
>>>>>
>>>>>
>>>>>  -jf
>>>>>
>>>>> ons. 16. aug. 2023 kl. 23:25 skrev Alec <[email protected]>:
>>>>>
>>>>>> Ed
>>>>>>   Thanks for the response, I wasn't aware of those two commands.  I
>>>>>> will see if that unlocks a solution. I kind of need the test to work in a
>>>>>> production environment.   So can't just be adding spare nodes onto the
>>>>>> cluster and forgetting with file systems.
>>>>>>
>>>>>> Unfortunately the logs don't indicate when a node has returned to
>>>>>> health.  Only that it's in trouble but as we patch often we see these
>>>>>> regularly.
>>>>>>
>>>>>>
>>>>>> For the second question, we would add a 2nd MEK key to each file so
>>>>>> that two independent keys from two different RKM pools would be able to
>>>>>> unlock any file.  This would give us two whole independent paths to 
>>>>>> encrypt
>>>>>> and decrypt a file.
>>>>>>
>>>>>> So I'm looking for a best practice example from IBM to indicate this
>>>>>> so we don't have a dependency on a single RKM environment.
>>>>>>
>>>>>> Alec
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward <[email protected]> wrote:
>>>>>>
>>>>>>> > How can we verify that a key server is up and running when there
>>>>>>> are multiple key servers in an rkm pool serving a single key.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Pretty simple.
>>>>>>>
>>>>>>> -Grab a compute node/client (and mark it offline if needed) unmount
>>>>>>> all encrypted File Systems.
>>>>>>>
>>>>>>> -Hack the RKM.conf to point to JUST the server you want to test (and
>>>>>>> maybe a backup)
>>>>>>>
>>>>>>> -Clear all keys:   ‘/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ‘
>>>>>>>
>>>>>>> -Reload the RKM.conf:  ‘/usr/lpp/mmfs/bin/tsloadikm run’   (this is
>>>>>>> a great command if you need to load new Certificates too)
>>>>>>>
>>>>>>> -Attempt to mount the encrypted FS, and then cat a few files.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If you’ve not setup a 2nd server in your test you will see
>>>>>>> quarantine messages in the logs for a bad KMIP server.    If it works, 
>>>>>>> you
>>>>>>> can clear keys again and see how many were retrieved.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> >Is there any documentation or diagram officially from IBM that
>>>>>>> recommends having 2 keys from independent RKM environments for high
>>>>>>> availability as best practice that I could refer to?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am not an IBM-er…  but I’m also not 100% sure what you are asking
>>>>>>> here.   Two un-related SKLM setups? How would you sync the keys?   How
>>>>>>> would this be better than multiple replicated servers?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ed Wahl
>>>>>>>
>>>>>>> Ohio Supercomputer Center
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* gpfsug-discuss <[email protected]> *On
>>>>>>> Behalf Of *Alec
>>>>>>> *Sent:* Wednesday, August 16, 2023 3:33 PM
>>>>>>> *To:* gpfsug main discussion list <[email protected]>
>>>>>>> *Subject:* [gpfsug-discuss] RKM resilience questions testing and
>>>>>>> best practice
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello we are using a remote key server with GPFS I have two
>>>>>>> questions: First question: How can we verify that a key server is up and
>>>>>>> running when there are multiple key servers in an rkm pool serving a 
>>>>>>> single
>>>>>>> key. The scenario is after maintenance
>>>>>>>
>>>>>>> Hello we are using a remote key server with GPFS I have two
>>>>>>> questions:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> First question:
>>>>>>>
>>>>>>> How can we verify that a key server is up and running when there are
>>>>>>> multiple key servers in an rkm pool serving a single key.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The scenario is after maintenance or periodically we want to verify
>>>>>>> that all member of the pool are in service.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Second question is:
>>>>>>>
>>>>>>> Is there any documentation or diagram officially from IBM that
>>>>>>> recommends having 2 keys from independent RKM environments for high
>>>>>>> availability as best practice that I could refer to?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Alec
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gpfsug-discuss mailing list
>>>>>>> gpfsug-discuss at gpfsug.org
>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gpfsug-discuss mailing list
>>>>>> gpfsug-discuss at gpfsug.org
>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>>>>
>>>>> _______________________________________________
>>>>> gpfsug-discuss mailing list
>>>>> gpfsug-discuss at gpfsug.org
>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>>>
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>
>> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] RKM resilience questions testing and best practice

Reply via email to