Ethan Rose created HDDS-13896:
---------------------------------
Summary: Slow failure of metadata volume can cause datanode
startup to hang indefinitely without logging
Key: HDDS-13896
URL: https://issues.apache.org/jira/browse/HDDS-13896
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Datanode
Reporter: Ethan Rose
Assignee: Ethan Rose
A {{RunningDatanodeState}} instance does not use the same {{ExecutorService}}
and {{CompletionService}} across its lifetime. This causes a bug where a
{{RuntimeException}} thrown out {{VersionEndpointTask}} could be dropped
without logging if the heartbeat timeout had elapsed and a new
{{RunningDatanodeState}} + {{CompletionService}} was being polled than the
previous instance that threw the exception. One example we observed:
* After a disk hang, Ratis is unable to read logs from the metadata directory
while starting the Ratis server.
* Ratis throws unchecked `IllegalStateException` or similar when this happens.
* This exception, which took longer than the heartbeat timeout to show up due
to the disk stall, exits {{OzoneContainer#start}} but is not logged.
** Due to the locking mechanism in {{OzoneContainer#start}}, no retries can
make progress in the method.
** Jstacks will show all SCM heartbeat threads in the datanode blocked at the
top of {{OzoneContainer#start}}, but the system will not log any errors.
We should treat all exceptions thrown from {{OzoneContainer#start}} as fatal
since the operations being done there like starting servers are not idempotent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]