sodonnel opened a new pull request, #3723: URL: https://github.com/apache/ozone/pull/3723
## What changes were proposed in this pull request? For Ratis, the number of replicas which must be available when a node goes into maintenance is a simple integer defaulting to 2 in hdds.scm.replication.maintenance.replica.minimum. This means that for a Ratis container, one out of the 3 nodes can be offline without any replication happening. This can be set to 1, letting two go offline or 3 ensuring full redundancy and hence replication when any node is taken offline. It could be argued that 1 would be a better default here. With the default placement of 2 replicas on one rack and 1 on another rack, that should allow for a full rack to be taken offline without replication. For EC, its a little more tricky. Aside from Ratis 1 containers, which are rarely used in practice, EC can tolerate 2 offline (for 3-2), 3 (for 6-3) or 4 (for 10-4). If we use the same default of 2, that means replication will always be required for 3-2 containers. Also the "number of replicas online" doesn't make as much sense for EC, as each replica is not identical. EC is also slightly more tricky - when any of the data copies are offline, online reconctruction must be used to read the data, causing a performance penalty, but that cannot be avoided. If we take the Ratis default of 2 - when there are two replicas out of 3 online, then we have a remaining redundancy of 1 - ie we can afford to lose one more copy and still read data. If we change the Ratis setting to 1, there is a remaining redundancy of 0, because the loss of another replica renders the data unreadable. For EC, if we default the setting to a "remaining redundancy" of 1, this would mean we can tolerate a loss of 1 more replicas and still read the data. This would allow for 3-2 to have 1 replica offline, 6-3 could have 2 and 10-4 could have 3 without any replicaion. In all cases the data redundancy is the same as with Ratis having 2 containers offline. Additionally, its highly likely online recovery will be needed to read the data, eg if 1 container is offline in 10-4 there is a 10 in 14 (5 in 7) chance its a data container, so trying to keep more containers online for larger EC groups is probably not going to help performance much. In a large cluster, ideally EC containers will be spread across racks such that there is only 1 replia per rack, so taking a full rack offline would only reduce the redundancy by 1 meaning even 3-2 containers could tolerate a rack going into maintenance. In summary, I believe the simplest solution, is to have an EC setting hdds.scm.replication.maintenance.ec.remaining.redundancy = 1 which we use for maintenance of EC containers and is basically equivalent to the Ratis default of 2. It may make sense to call the new parameter hdds.scm.replication.maintenance.remaining.redundancy and use the same value for both Ratis and EC, deprecating the old value. For now in this change I have added "hdds.scm.replication.maintenance.remaining.redundancy" and noted in the comments / docs this is for EC only. We should consider how to deprecate the old parameter and bring the two together in another Jira. I am reluctant to call this one `hdds.scm.replication.ec.maintenance.remaining.redundancy` as then we will have to deprecate two parameters in the future. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-6975 ## How was this patch tested? Existing tests cover the maintenance counts in the new RM related classes. I also modified the decommission and maintenance tests to include some EC data, and hence fully test the decommission and maintenance flows with EC data in place. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
