Ethan Rose created HDDS-9323:
--------------------------------

             Summary: Better datanode exclude list handling for long-lived 
clients
                 Key: HDDS-9323
                 URL: https://issues.apache.org/jira/browse/HDDS-9323
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Client
            Reporter: Ethan Rose


Currently it is possible that a long lived client can add most or all nodes of 
a small cluster to its exclude list, and further writes using that client 
instance will fail. There are two ways this can be improved:
 #  A timeout to remove nodes from the exclude list after so that they can be 
retried. For EC, this exists and is configured to 10 minutes by default. Ratis 
does not currently have this but it should be added.
 # Allow the write to fall back to nodes in the exclude list if that is all 
that is available. This could be implemented on the server side, or as a retry 
from the client based on the server's initial response.

These issues are especially relevant for S3 gateway, which uses a persistent 
Ozone client to connect to the cluster while it is up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to