Gargi-jais11 commented on PR #10109: URL: https://github.com/apache/ozone/pull/10109#issuecomment-4427909615
@sodonnel @ChenSammi Yesterday night I tried implementing above diskbalancer flow but it seems to be complicated and bug prone. **(save old paths → readUnlock → writeLock → update volume/metadata/chunks/db paths → unlock → cleanup Disk1 from saved locals)**. _@sodonnel direction is right in principle: one live Container object means no more updateContainer-style staleness for anything that always goes through getContainer(id) and the container lock._ In practice I found it quite easy to get wrong, especially **Phase C (marking/deleting the old replica):** the live object must never be used for Disk1 cleanup, V3 DB/cache handling needs care, crash recovery between phases has to be spelled out, and the r**eadUnlock → writeLock** gap still needs a post-lock state check . Separately, @ChenSammi already called out the bigger issue: many call sites read `KeyValueContainerData` paths without holding the container lock. In-place updates to several fields are atomic under the write lock, but not automatically safe for those **unsynchronized readers** unless we address visibility and torn reads (e.g. volatile/immutable snapshot) and review a lot of DN code — so the “fix in one place” story is not complete without that wider work. I’m leaning to keep the re-fetch/retry fix for this staleness issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
