errose28 commented on PR #3360: URL: https://github.com/apache/ozone/pull/3360#issuecomment-1308066256
Hi @JacksonYao287 thanks for continuing the work on this. I do still have some concerns with this proposal though. > first, in this patch, DELETED container is removed from scm by default(this can be configured). and for those unknown container , no matter it is empty or not, we will deleted it by default(this can be configured). This doesn't address my [original point](https://github.com/apache/ozone/pull/3360#issuecomment-1287340191) that these two configs conflict for some values. I also do not think that config keys should be used to fragment core internal operations like deletion. Ozone only needs one delete path and we should be confident that the one we choose works. If we are not confident in this delete container proposal and are trying to hide it behind a config then that is a sign we should not go forward with the change. > although Deleting a non-empty unknown container by default is an aggressive operation, i think it makes sense here. in development environment, we can set the default operation to log WARN, so it give us a chance to investigate the problem. This assumes the issue shows up in dev first, which may not be the case. There's so many variables in an Ozone deployment it is possible that the dev environments work fine, but the perfect storm of issues does not happen until some production deployment. Having debugged multiple dev and prod Ozone deployments I can confirm this does happen for corner case bugs. This is why I have been advocating for a more cautious approach. > my idea here is after container is closed, only deleteblockLog can shrink the container size, not calculateUsage in containerReport handler. the size of container is reduced only after the deleteTransaction is committed in deleteblockLog. so even if a container replica has some orphan blocks and a bigger size, scm has a definite size of the container. when the size of a container in scm is shrinked to zero, scm will delete all the replicas of the container. We should use block count to determine if the container is empty, as used bytes is not exact like you mentioned. Block count is currently used by SCM to determine if a container is empty. There are some invariants that I believe make it impossible to solve this problem without using Recon: - Deleted block log entries must eventually be deleted. - SCM cannot know all possible replicas at a given time. An older one could be resurrected at any time. - SCM's delete block API used by OM must be idempotent. Given these properties I believe we cannot know the definite size of the container once blocks start getting deleted. We can only know the current size of each replica. For example, say a block is deleted from all known container replicas and SCM's "definite" block count decreases by 1. SCM removes the entry from the block log. Now maybe OM missed the ack for that delete so it resends it. SCM would repeat the same steps, incorrectly decrementing the "definite" block count again. It has no permanent record of which blocks were already deleted. Since old datanodes can be resurrected, they are also not a source of truth for which block deletions have already been issued by SCM. In summary, I feel that my [proposal](https://github.com/apache/ozone/pull/3360#issuecomment-1297759990) has the following advantages: - No config changes Existing config keys remain with their original defaults and functionality. The only difference is that `hdds.scm.unknown-container.action` now only applies to unknown *and empty* containers. No new configs need to be added. No need to define behavior for unintuitive configuration combinations. - Only confirmed empty containers are deleted. This minimizes the risk of data loss due to current code bugs or future regressions that could for any reason cause SCM to "forget" a container, or datanodes to misreport a container. The disadvantage of my proposal is that for the [block deletion corner case](https://github.com/apache/ozone/pull/3360#issuecomment-1119204766) discussed earlier, containers with those blocks will remain in the system. However this is true even today with the current container delete flow, since all replicas must have zero block count for deletion to happen. The orphan block problem is a block problem, not a container problem. IMO this means it should not be solved as an extension of container deletion by deleting non-empty containers. I think orphan block cleanup with Recon is the correct answer to this, even if it may not be implemented in the near future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
