Hi Tsz Wo Sze, Thank you for the review! I have tried to answer a few of the questions, hoping it is clear enough.
*- Why does SCM need a DiskBalancerManager? Should the DiskBalancerService just be a local service in Datanode?* *SCM's* *DiskBalancerManager* provides essential *centralized coordination *that local DataNode services cannot achieve. SCM can identify which DataNodes most urgently need disk balancing by calculating volume density sums and ranking them. In HDFS Disk Balancer, the schedule is split into two steps: PLAN and EXECUTE. Users first get the balance plan, then according to the plan, the requests are sent to datanodes to do the real balance job. These are two commands for the users to run, which is a little complex. So in Ozone, Client will retrieve the node information from SCM and decide the balance of which node or the whole cluster. Single point to start/stop/configure disk balancing across multiple DataNodes simultaneously. *DataNode:* Handles actual container movement (local execution) *SCM:* Provides coordination and cluster-wide management. *- In DiskBalancerTask, it holds the container read lock. Would it be possible that a file gets deleted when another thread is reading it and results in an exception?* No, this scenario won't happen. - DiskBalancerTask explicitly acquires a container read lock before any file operations to prevent concurrent container state modifications like block deletion. - All deletion operations require a write lock on the container (BlockDeletingTask, markContainerForDelete, etc.). - ReadWriteLock semantics guarantee mutual exclusion between read and write locks - while DiskBalancerTask holds the read lock during copy operations, no deletion thread can acquire the required write lock. - The copy operation (copyContainerData) safely reads all container files under the protection of the read lock, ensuring no concurrent deletions can interfere. The locking design specifically addresses this concern, as evidenced by the comment in the code: 'hold read lock on the container first, to avoid other threads to update the container state, such as block deletion.' *- In DefaultContainerChoosingPolicy, it has a CACHE for caching container iterators. Would it cause ConcurrentModificationException if the underlying containerMap is modified?* *No*, CACHE will NOT cause ConcurrentModificationException when the underlying containerMap is modified. Here's why: - The underlying *containerMap* is a *ConcurrentSkipListMap*, which provides fail-safe iterators that don't throw ConcurrentModificationException. - The *cached iterator* operates on a snapshot created by the stream operations (filter + sort), not directly on the live map. This snapshot is immune to concurrent modifications. - The existing test *testContainerDeletionAfterIteratorGeneration()* specifically validates this scenario by removing containers after iterator creation and confirms no exceptions occur. - The cache invalidates exhausted iterators, ensuring fresh snapshots for future requests. * - In DefaultVolumeChoosingPolicy.chooseVolume(..), getCurrentUsage() is called multiple times so it can return different values.* Yes, a valid concurrency issue can happen. The current implementation under the *ReentrantLock* prevents concurrent calls to *chooseVolume()* but doesn't prevent the underlying usage values from changing during execution. *Agreed, this part needs to be fixed.* *- Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as StorageVolume.TMP_DIR_NAME? If yes, use StorageVolume.TMP_DIR_NAME instead of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.* Yes, they both use the same value "tmp". *Thank you for pointing to the below refactorings, we will do the changes.* - DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead of DatanodeDetailsProto - Remove DatanodeDiskBalancerInfoType since it is never used. - Remove the new Container.importContainerData(Path containerPath) method since it is never used. - Rename Container.copyContainerData to copyContainerDirectory. Otherwise, it sounds like copy/import the ContainerData class. - In DiskBalancerService, use ContainerID instead of Long for container id. - Do not use Optional for parameters; see https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/ - Use Objects.requireNonNull(..) instead of Preconditions.checkNotNull(..). On Sat, Aug 9, 2025 at 12:19 AM Tsz Wo Sze <szets...@apache.org> wrote: > Hi Sammi and others, > > Thanks for working on the Disk Balancer feature! It is going to be very > useful. > > Some comments: > > - Why does SCM need a DiskBalancerManager? Should the DiskBalancerService > just be a local service in Datanode? > > - In DiskBalancerTask, it holds the container read lock. Would it be > possible that a file gets deleted when another thread is reading it and > results in an exception? > > - In DefaultContainerChoosingPolicy, it has a CACHE for caching container > iterators. Would it cause ConcurrentModificationException if the > underlying containerMap is modified? > > - In DefaultVolumeChoosingPolicy.chooseVolume(..), getCurrentUsage() is > called multiple times so it can return different values. > > - Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as > StorageVolume.TMP_DIR_NAME? If yes, use StorageVolume.TMP_DIR_NAME instead > of adding DiskBalancerService.DISK_BALANCER_TMP_DIR. > > - Do not throw RuntimeException -- how to handle it when it happens? > > - Since this is a new feature, how about not supporting SCHEMA_V1/V2 ? If > we want to support them, we should add some tests. > > - DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead of > DatanodeDetailsProto > > - Remove DatanodeDiskBalancerInfoType since it is never used. > > - Remove the new Container.importContainerData(Path containerPath) method > since it is never used. > > - Rename Container.copyContainerData to copyContainerDirectory. Otherwise, > it sounds like copy/import the ContainerData class. > > - In DiskBalancerService, use ContainerID instead of Long for container id. > > - Do not use Optional for parameters; see > > https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/ > > - Use Objects.requireNonNull(..) instead of Preconditions.checkNotNull(..). > > Regards, > Tsz-Wo > > > On Fri, Aug 8, 2025 at 9:57 AM Ethan Rose <er...@apache.org> wrote: > > > Hi Sammi, thanks for working on this feature. > > > > There should not be any incompatible changes introduced with this > feature. > > > > This section needs more information because it only talks about a feature > > flag, which is orthogonal to any upgrade/downgrade concerns because > feature > > flags do not prevent downgrades like finalization does. > > > > - What happens if the feature flag is enabled, and later disabled? > > - If an in-progress container move is interrupted by a downgrade, what > > does the old version do? > > - Are components persisting any state about the ongoing operations, > and > > if so, what happens to that during a downgrade then upgrade flow? > > > > The disk balancer has a feature flag that currently defaults to false. In > > this > > comment < > https://github.com/apache/ozone/pull/8869#issuecomment-3130009094 > > > > > I explained why feature flags for features that are only triggered by > > admins are redundant. I’ve heard other devs from HDFS also express > concern > > over everything in that project being behind feature flags and trying to > > avoid sending Ozone down that path as well. IMO if we want to indicate > that > > this is not ready for prod use, it would be better to hide the CLI > command > > <https://picocli.info/quick-guide.html#_hidden_options_and_parameters> > and > > add a disclaimer to the help message indicating this. The current config > > flag approach provides no user facing information as to why it is > disabled > > by default or what the implications of enabling it are. Additionally, why > > is this set tofalse by default but considered ready to merge with master? > > > > Ethan > > > > On Fri, Aug 8, 2025 at 4:53 AM Sammi Chen <sammic...@apache.org> wrote: > > > > > Hi Ozone developers, > > > > > > I would like to propose merging the HDDS-5713 Disk Balancer feature > > branch > > > into master. > > > > > > HDDS-5713 adds the support of the disk volume utilization balancer > > function > > > by selecting and moving containers from high utilized data volume to > > lower > > > utilized data volume, to achieve an all disk volumes' utilization even > > > state. > > > > > > > > > Feature Jira Link: > > > https://issues.apache.org/jira/browse/HDDS-5713 > > > > > > > > > Checklist for feature branch merge: > > > > > > > > > https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713 > > > < > > > > > > https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713 > > > > > > > > > > > > > This vote will be open for at least a week. > > > > > > Thanks, > > > Sammi Chen > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > > For additional commands, e-mail: dev-h...@ozone.apache.org > > > > > >