Re: [VOTE] Merge HDDS-5713 Disk Balancer into master

Gargi Jaiswal Mon, 11 Aug 2025 05:47:19 -0700

Hi Tsz Wo Sze,
Thank you for the review!
I have tried to answer a few of the questions, hoping it is clear enough.


*- Why does SCM need a DiskBalancerManager?  Should the DiskBalancerService
just be a local service in Datanode?*
*SCM's* *DiskBalancerManager* provides essential *centralized
coordination *that
local DataNode services cannot achieve. SCM can identify which DataNodes
most urgently need disk balancing by calculating volume density sums and
ranking them.

In HDFS Disk Balancer, the schedule is split into two steps: PLAN and
EXECUTE. Users first get the balance plan, then according to the plan, the
requests are sent to datanodes to do the real balance job. These are two
commands for the users to run, which is a little complex. So in Ozone, Client
will retrieve the node information from SCM and decide the balance of
which node or the whole cluster.
Single point to start/stop/configure disk balancing across multiple
DataNodes simultaneously.
*DataNode:* Handles actual container movement (local execution)
*SCM:* Provides coordination and cluster-wide management.

*- In DiskBalancerTask, it holds the container read lock.  Would it be
possible that a file gets deleted when another thread is reading it and
results in an exception?*
No, this scenario won't happen.

   - DiskBalancerTask explicitly acquires a container read lock before any
   file operations to prevent concurrent container state modifications like
   block deletion.
   - All deletion operations require a write lock on the container
   (BlockDeletingTask, markContainerForDelete, etc.).
   - ReadWriteLock semantics guarantee mutual exclusion between read and
   write locks - while DiskBalancerTask holds the read lock during copy
   operations, no deletion thread can acquire the required write lock.
   - The copy operation (copyContainerData) safely reads all container
   files under the protection of the read lock, ensuring no concurrent
   deletions can interfere.

The locking design specifically addresses this concern, as evidenced by the
comment in the code: 'hold read lock on the container first, to avoid other
threads to update the container state, such as block deletion.'

*- In DefaultContainerChoosingPolicy, it has a CACHE for caching container
iterators.  Would it cause ConcurrentModificationException if the
underlying containerMap is modified?*
*No*, CACHE will NOT cause ConcurrentModificationException when the
underlying containerMap is modified. Here's why:

   - The underlying *containerMap* is a *ConcurrentSkipListMap*, which
   provides fail-safe iterators that don't throw
   ConcurrentModificationException.
   - The *cached iterator* operates on a snapshot created by the stream
   operations (filter + sort), not directly on the live map. This snapshot is
   immune to concurrent modifications.
   - The existing test
*testContainerDeletionAfterIteratorGeneration()* specifically
   validates this scenario by removing containers after iterator creation and
   confirms no exceptions occur.
   - The cache invalidates exhausted iterators, ensuring fresh snapshots
   for future requests.

* - In DefaultVolumeChoosingPolicy.chooseVolume(..), getCurrentUsage() is
called multiple times so it can return different values.*
Yes, a valid concurrency issue can happen. The current implementation under
the *ReentrantLock*  prevents concurrent calls to *chooseVolume()* but
doesn't prevent the underlying usage values from changing during execution.
*Agreed, this part needs to be fixed.*

*- Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as
StorageVolume.TMP_DIR_NAME?  If yes, use StorageVolume.TMP_DIR_NAME instead
of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.*
 Yes, they both use the same value "tmp".

*Thank you for pointing to the below refactorings, we will do the changes.*
- DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead of
DatanodeDetailsProto

- Remove DatanodeDiskBalancerInfoType since it is never used.

- Remove the new Container.importContainerData(Path containerPath) method
since it is never used.

- Rename Container.copyContainerData to copyContainerDirectory.  Otherwise,
it sounds like copy/import the ContainerData class.

- In DiskBalancerService, use ContainerID instead of Long for container id.

- Do not use Optional for parameters; see
https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/

- Use Objects.requireNonNull(..) instead of Preconditions.checkNotNull(..).

On Sat, Aug 9, 2025 at 12:19 AM Tsz Wo Sze <szets...@apache.org> wrote:

> Hi Sammi and others,
>
> Thanks for working on the Disk Balancer feature!  It is going to be very
> useful.
>
> Some comments:
>
> - Why does SCM need a DiskBalancerManager?  Should the DiskBalancerService
> just be a local service in Datanode?
>
> - In DiskBalancerTask, it holds the container read lock.  Would it be
> possible that a file gets deleted when another thread is reading it and
> results in an exception?
>
> - In DefaultContainerChoosingPolicy, it has a CACHE for caching container
> iterators.  Would it cause ConcurrentModificationException if the
> underlying containerMap is modified?
>
> - In DefaultVolumeChoosingPolicy.chooseVolume(..), getCurrentUsage() is
> called multiple times so it can return different values.
>
> - Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as
> StorageVolume.TMP_DIR_NAME?  If yes, use StorageVolume.TMP_DIR_NAME instead
> of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.
>
> - Do not throw RuntimeException -- how to handle it when it happens?
>
> - Since this is a new feature, how about not supporting SCHEMA_V1/V2 ?  If
> we want to support them, we should add some tests.
>
> - DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead of
> DatanodeDetailsProto
>
> - Remove DatanodeDiskBalancerInfoType since it is never used.
>
> - Remove the new Container.importContainerData(Path containerPath) method
> since it is never used.
>
> - Rename Container.copyContainerData to copyContainerDirectory.  Otherwise,
> it sounds like copy/import the ContainerData class.
>
> - In DiskBalancerService, use ContainerID instead of Long for container id.
>
> - Do not use Optional for parameters; see
>
> https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/
>
> - Use Objects.requireNonNull(..) instead of Preconditions.checkNotNull(..).
>
> Regards,
> Tsz-Wo
>
>
> On Fri, Aug 8, 2025 at 9:57 AM Ethan Rose <er...@apache.org> wrote:
>
> > Hi Sammi, thanks for working on this feature.
> >
> > There should not be any incompatible changes introduced with this
> feature.
> >
> > This section needs more information because it only talks about a feature
> > flag, which is orthogonal to any upgrade/downgrade concerns because
> feature
> > flags do not prevent downgrades like finalization does.
> >
> >    - What happens if the feature flag is enabled, and later disabled?
> >    - If an in-progress container move is interrupted by a downgrade, what
> >    does the old version do?
> >    - Are components persisting any state about the ongoing operations,
> and
> >    if so, what happens to that during a downgrade then upgrade flow?
> >
> > The disk balancer has a feature flag that currently defaults to false. In
> > this
> > comment <
> https://github.com/apache/ozone/pull/8869#issuecomment-3130009094
> > >
> > I explained why feature flags for features that are only triggered by
> > admins are redundant. I’ve heard other devs from HDFS also express
> concern
> > over everything in that project being behind feature flags and trying to
> > avoid sending Ozone down that path as well. IMO if we want to indicate
> that
> > this is not ready for prod use, it would be better to hide the CLI
> command
> > <https://picocli.info/quick-guide.html#_hidden_options_and_parameters>
> and
> > add a disclaimer to the help message indicating this. The current config
> > flag approach provides no user facing information as to why it is
> disabled
> > by default or what the implications of enabling it are. Additionally, why
> > is this set tofalse by default but considered ready to merge with master?
> >
> > Ethan
> >
> > On Fri, Aug 8, 2025 at 4:53 AM Sammi Chen <sammic...@apache.org> wrote:
> >
> > > Hi Ozone developers,
> > >
> > > I would like to propose merging the HDDS-5713 Disk Balancer feature
> > branch
> > > into master.
> > >
> > > HDDS-5713 adds the support of the disk volume utilization balancer
> > function
> > > by selecting and moving containers from high utilized data volume to
> > lower
> > > utilized data volume, to achieve an all disk volumes' utilization even
> > > state.
> > >
> > >
> > > Feature Jira Link:
> > > https://issues.apache.org/jira/browse/HDDS-5713
> > >
> > >
> > > Checklist for feature branch merge:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713
> > > >
> > >
> > >
> > > This vote will be open for at least a week.
> > >
> > > Thanks,
> > > Sammi Chen
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > > For additional commands, e-mail: dev-h...@ozone.apache.org
> > >
> >
>

Re: [VOTE] Merge HDDS-5713 Disk Balancer into master

Reply via email to