ArafatKhan2198 commented on PR #5445:
URL: https://github.com/apache/ozone/pull/5445#issuecomment-1766631471

   @Pochatkin @sadanand48 thanks for the comments I have mentioned out the 
problem in a detailed manner along with the proposed fix :- 
   ### Cause of the problem ➖
   
   - TestSecureOzoneCluster ➖
       - This test class has a method which is failing intermittently, called 
as `testGetSetRevokeS3Secret` which tests out the revoke feature of Ozone S3 
secrets for a certain user. 
       
https://github.com/apache/ozone/blob/377d6a674f76805d6ed4df640a87159757ed3398/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestSecureOzoneCluster.java#L698
       - The code flow for the test method and the problem encountered is going 
like this ➖
           1. A user accessID is created 
           2. It attempts to get an S3 secret for the user accessID. 
           3. It revokes the existing S3 secret.
           4. It attempts to `set` a new S3 secret for the user, which should 
fail and throw an error *`ACCESS_ID_NOT_FOUND`* because the access ID 
(username) has been revoked earlier for that user. In most of the cases it 
throws an error but in a few of them it does not because somehow even though 
the user has been revoked it is still somehow able to access the secret key for 
that user and change it. 
           5. The reason for that was investigated by adding logs to the method 
called as S3SecretManager method `getSecret` which is responsible for checking 
both the cache and the DB for the secret the user requires. 
https://github.com/apache/ozone/blob/377d6a674f76805d6ed4df640a87159757ed3398/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/S3SecretManagerImpl.java#L54
           6. Now when we tried to set a new s3 secret for an earlier revoked 
user. It first proceeded to check the cache but it could find one and returned 
null because the earlier revoke method had invalidated the user accessID and 
S3Secret from the cache hearby completely removing it.➖
               
               `S3SecretValue cacheValue = s3SecretCache.get(kerberosID);`
               
           7. Afterwards it decided to check the DB S3SecretKeyTable for the 
secret for the user and for some reason it was able to fetch one and returned 
it back and was later changed without throwing an error *`ACCESS_ID_NOT_FOUND`* 
like it shoul. The problem here happened was that the DB was not informed about 
the changes in cache that had happened yet, hence this presented an 
inconsistency between the cache and DB. 
           `S3SecretValue result = s3SecretStore.getSecret(kerberosID);`
           8. The reason for that lag was due to the fact that the background 
Flush Thread had not yet commited the changes of the Cache to the DB. This was 
confirmed by the logs I added where we checked the cache and the db for the 
revoked secret key for the user. 
                ```
                  09:44:46,260 INFO  om.S3SecretManagerImpl 
(S3SecretManagerImpl.java:getSecret(59)) - Get secret for 
kerberosID:om/@EXAMPLE.COM from cache:null
                  09:44:46,260 INFO  om.S3SecretManagerImpl 
(S3SecretManagerImpl.java:getSecret(66)) - Get secret for 
kerberosID:om/@EXAMPLE.COM from 
store:ae367331576fa164eaa43fe906014b579e2df3698ef4c744c5bdc88c8dec538f
                ```
   
   ### Fix for the Problem ➖
   
   - Previously, in the earlier RocksDB cache implementation, when a secret was 
revoked, the system retained the access ID record, but the respective secret 
was stored as null. This null value served as an indicator for the 
SecretManager that a specific access ID had been revoked, eliminating the need 
to check the database for that user. However, with the implementation of a 
newer cache using the Guava Google library, the entire record of the access ID 
and the secret key is removed from the cache upon revocation. This change poses 
a challenge for the SecretManager because it can no longer distinguish whether 
a requested secret has been revoked or not.
   - To address this issue, we propose a fix that enables the SecretManager to 
maintain the record even after revocation. Additionally, an identifier, 
**`boolean isDeleted;`**, will be introduced in the S3SecretValue. This 
identifier will be set to `true` whenever a secret has been revoked, and the 
value of the secret itself will be set to null. Consequently, in the future, if 
we attempt to access or set a revoked secret, the cache will promptly notify us 
that the secret has been revoked. This way, we can avoid unnecessary database 
checks, especially when the database has not been updated with the latest data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to