fapifta commented on PR #4194:
URL: https://github.com/apache/ozone/pull/4194#issuecomment-1442141428

   You're welcome!
   In the meantime one thought does bother me a bit, and would like to 
understand the intent for that...
   
   In the SecretKeyManager and the SecretKeyState, I have some doubts around 
the mechanism that generates a new key and does the rotation... it is hard to 
tell what exactly, it is just a bad feeling so far, so let me think out loud 
here:
   If I understand correctly, SecretKeyManager will decide it should rotate, 
once the current key is older than 1 day.
   Once the rotation happens eventually after 1d, the 
SecretKeyManager#checkAndRotate() method creates a new list of secret keys, it 
filters the oldest expired key from the list, and SecretKeyState#updateKeys is 
called with the keys in a sorted order, where the newest key is the first item 
in the list.
   This call to updateKeys is replicated to the other SCMs, so all SCMs 
eventually will have the new set of keys.
   
   So far this is clean on its own, the confusion on my side comes after this. 
The way of implementation suggests that the second proposal the pull model will 
be used on the DNs.
   How this will look like?
   I guess if the DN runs into a token, that is newer then the newest key it 
has, it will pull the list of the keys from SCM. While OMs will pull 
periodically from SCM.
   If my understanding is correct, then operations wise we are good, and 
probably my mind stuck with the push model and the pre-generated and 
distributed next key.
   
   
   
   During the thought experiment around this, I started to feel that it might 
not be correct to have a SecretKeyState that gets its internal representation 
from outside, and also misses proper encapsulation... With the current code 
this might not be a problem, but here are the concerns:
   1. the internal list is guarded by locking, but its reference is leaking via 
the getSortedKeys() method.
   2. updateKeys can override the full internal state
   
   1. is easy to fix, and probably it is just some automatism where usually I 
don't think about this concern either for the first sight
   2. on the other hand there might be a more interesting question... Let's do 
some thought experiment:
   Let's say we have one SCM (SCM1) that is stopped for an extended period of 
time, and its raft logs are so old that when it gets back in service it 
downloads a new snapshot, and since that snapshot there was no rotation 
happening. At this time SCM2 is the leader, but after SCM1 comes up and 
syncronizes its data and becomes a proper follower, SCM2 goes down before there 
was a rotation of secret keys, and during election SCM1 becomes the elected 
leader.
   
   In this scenario will SCM1 have the secretKeys properly? As based on the 
code my assumption is that in this case SCM1 will have 1 secret key in its list 
generated at startup (let's assume every key that was persisted has expired 
already), with that it will issue tokens, and DNs will be able to validate 
those tokens, but once keys are rotated, the list of old keys will be 
overridden in all SCMs.
   
   After this has happened, let's assume a DN goes down, and comes up due to a 
restart for any reason. In this case if there is a client that has a token 
issued by SCM2 back when it was the leader, was not be able to read data from 
the restarted DN, as that DN will not be able to fetch the formerly valid token 
from anywhere, while those DNs that have the token in memory will work fine 
further causing intermittent read failures.
   
   Do I miss something, or this is a legit scenario with this approach of 
managing the secret keys?
   I see this happening because the keys are managed in a file, not in rocksDB, 
and so when SCM1 downloads the snapshot of the data to replay logs from that 
snapshot, it might not replay the transaction that updates the file, and the 
file is not in the snapshot, but putting the secret keys to rocksDB with that 
to the snapshot will hurt us later when it comes to conforming with FISMA  for 
example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to