[
https://issues.apache.org/jira/browse/HDDS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698948#comment-17698948
]
István Fajth commented on HDDS-7874:
------------------------------------
I think I have figured it out...
It seems that even though the CertificateClient interface implements Closeable,
we are not closing the client at all in the code. None of them.
This wasn't a problem so far that much, as there were just the test cases where
we have instantiated the client more than once, then after HDDS-7453 we started
to have 2 instances in the services if the certificate was renewed at startup.
Though this might cause some excess resource usage it has none to negligible
impact.
After HDDS-7339 we introduced a single threaded scheduled thread pool executor,
that ideally should have just one task scheduled to be run once the certificate
lifetime approaches its end and enters to the grace period, and thereafter it
is scheduled more often but if the certificate renewed successfully then it
does nothing, and at the next restart it delays execution again. (With this we
might need to deal with, and reset so that we do not even schedule the task
just if it is necessary, but for now it is a good enough solution as it seemed.
This mechanism on the other hand as the client is not closed starts to multiply
the scheduled threads especially in this test class, as the RSAKeyGenerator
does not generate the keys fast enough if multiple threads are trying to
generate keys at once (possibly depleting entropy in the test environment)
because it has to wait for entropy the test times out. Having multiple
instances of the thread means that the clients are not garbage collected
either. This intermittent test failure also proved that we are incorrectly
locking within the renew task, as multiple clients have multiple separate lock
instances to lock on, and multiple executors that are executing these tasks, so
they are at the end of the day can concurrently access keys and certificate
material for the service.
I am working on fixing this, by revisiting our execution scheduling, locking,
and the client closure pieces of the problem together under HDDS-8134.
> Intermittent timeout in TestHddsSecureDatanodeInit.testCertificateRotation
> --------------------------------------------------------------------------
>
> Key: HDDS-7874
> URL: https://issues.apache.org/jira/browse/HDDS-7874
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Security, test
> Affects Versions: 1.4.0
> Reporter: Attila Doroszlai
> Assignee: István Fajth
> Priority: Major
> Labels: pki
>
> {code}
> Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 35.17 s <<<
> FAILURE! - in org.apache.hadoop.ozone.TestHddsSecureDatanodeInit
> org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.testCertificateRotation
> Time elapsed: 16.492 s <<< ERROR!
> java.util.concurrent.TimeoutException:
> ...
> at
> org.apache.ozone.test.GenericTestUtils.waitFor(GenericTestUtils.java:231)
> at
> org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.testCertificateRotation(TestHddsSecureDatanodeInit.java:330)
> {code}
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2023/01/06/19380/unit/hadoop-hdds/container-service/org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.txt
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2023/01/12/19500/unit/hadoop-hdds/container-service/org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.txt
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2023/02/02/19864/unit/hadoop-hdds/container-service/org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.txt
> Also:
> {code}
> at
> org.apache.hadoop.ozone.TestHddsSecureDatanodeInit.testCertificateRotationRecoverableFailure(TestHddsSecureDatanodeInit.java:432)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]