Ethan Rose created HDDS-13896:
---------------------------------

             Summary: Slow failure of metadata volume can cause datanode 
startup to hang indefinitely without logging
                 Key: HDDS-13896
                 URL: https://issues.apache.org/jira/browse/HDDS-13896
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Datanode
            Reporter: Ethan Rose
            Assignee: Ethan Rose


A {{RunningDatanodeState}} instance does not use the same {{ExecutorService}} 
and {{CompletionService}} across its lifetime. This causes a bug where a 
{{RuntimeException}} thrown out {{VersionEndpointTask}} could be dropped 
without logging if the heartbeat timeout had elapsed and a new 
{{RunningDatanodeState}} + {{CompletionService}} was being polled than the 
previous instance that threw the exception. One example we observed:
* After a disk hang, Ratis is unable to read logs from the metadata directory 
while starting the Ratis server.
* Ratis throws unchecked `IllegalStateException` or similar when this happens.
* This exception, which took longer than the heartbeat timeout to show up due 
to the disk stall, exits {{OzoneContainer#start}} but is not logged.
** Due to the locking mechanism in {{OzoneContainer#start}}, no retries can 
make progress in the method.
** Jstacks will show all SCM heartbeat threads in the datanode blocked at the 
top of {{OzoneContainer#start}}, but the system will not log any errors.

 We should treat all exceptions thrown from {{OzoneContainer#start}}  as fatal 
since the operations being done there like starting servers are not idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to