[jira] [Commented] (HDDS-7874) Intermittent timeout in TestHddsSecureDatanodeInit.testCertificateRotation

Jira Fri, 10 Mar 2023 06:05:05 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698948#comment-17698948
 ]


István Fajth commented on HDDS-7874:
------------------------------------

I think I have figured it out...

It seems that even though the CertificateClient interface implements Closeable, 
we are not closing the client at all in the code. None of them.

This wasn't a problem so far that much, as there were just the test cases where 
we have instantiated the client more than once, then after HDDS-7453 we started 
to have 2 instances in the services if the certificate was renewed at startup. 
Though this might cause some excess resource usage it has none to negligible 
impact.

After HDDS-7339 we introduced a single threaded scheduled thread pool executor, 
that ideally should have just one task scheduled to be run once the certificate 
lifetime approaches its end and enters to the grace period, and thereafter it 
is scheduled more often but if the certificate renewed successfully then it 
does nothing, and at the next restart it delays execution again. (With this we 
might need to deal with, and reset so that we do not even schedule the task 
just if it is necessary, but for now it is a good enough solution as it seemed.

This mechanism on the other hand as the client is not closed starts to multiply 
the scheduled threads especially in this test class, as the RSAKeyGenerator 
does not generate the keys fast enough if multiple threads are trying to 
generate keys at once (possibly depleting entropy in the test environment) 
because it has to wait for entropy the test times out. Having multiple 
instances of the thread means that the clients are not garbage collected 
either. This intermittent test failure also proved that we are incorrectly 
locking within the renew task, as multiple clients have multiple separate lock 
instances to lock on, and multiple executors that are executing these tasks, so 
they are at the end of the day can concurrently access keys and certificate 
material for the service.

I am working on fixing this, by revisiting our execution scheduling, locking, 
and the client closure pieces of the problem together under HDDS-8134.

> Intermittent timeout in TestHddsSecureDatanodeInit.testCertificateRotation
> --------------------------------------------------------------------------
>
>                 Key: HDDS-7874
>                 URL: https://issues.apache.org/jira/browse/HDDS-7874
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Security, test
>    Affects Versions: 1.4.0
>            Reporter: Attila Doroszlai
>            Assignee: István Fajth
>            Priority: Major
>              Labels: pki
>
> {code}
> Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 35.17 s <<< 
> FAILURE! - in org.apache.hadoop.ozone.TestHddsSecureDatanodeInit
> org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.testCertificateRotation  
> Time elapsed: 16.492 s  <<< ERROR!
> java.util.concurrent.TimeoutException: 
> ...
>       at 
> org.apache.ozone.test.GenericTestUtils.waitFor(GenericTestUtils.java:231)
>       at 
> org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.testCertificateRotation(TestHddsSecureDatanodeInit.java:330)
> {code}
> * 
> https://github.com/adoroszlai/ozone-build-results/blob/master/2023/01/06/19380/unit/hadoop-hdds/container-service/org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.txt
> * 
> https://github.com/adoroszlai/ozone-build-results/blob/master/2023/01/12/19500/unit/hadoop-hdds/container-service/org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.txt
> * 
> https://github.com/adoroszlai/ozone-build-results/blob/master/2023/02/02/19864/unit/hadoop-hdds/container-service/org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.txt
> Also:
> {code}
>       at 
> org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.testCertificateRotationRecoverableFailure(TestHddsSecureDatanodeInit.java:432)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7874) Intermittent timeout in TestHddsSecureDatanodeInit.testCertificateRotation

Reply via email to