Re: [VOTE] Merge HDDS-5713 Disk Balancer into master

Sammi Chen Fri, 15 Aug 2025 00:30:03 -0700

Thanks Ethan and Tsz-Wo for so detailed feedback.

As this is a new feature, although we believe it's mature enough to merge
back to master branch, from the conservative point of view, and
also considered others' suggestions, the feature flag is introduced and set
to false.  We will position it as an experimental feature in the next Ozone
release(could be 2.1.0).


   - What happens if the feature flag is enabled, and later disabled?    -
Since rolling upgrade is not supported, updating the feature flag requires
service restart, after restart, disk balancer will not be started.
   - If an in-progress container move is interrupted by a downgrade, what
does the old version do?  -  If an in-progress container move is
interrupted, there could be residual container directory under destination
volume tmp directory, or RECOVERING or original state container replica
under destination volume.  Then DN is downgraded, the old version DN is
responsible to clean up these directories.
   - Are components persisting any state about the ongoing operations, and
if so, what happens to that during a downgrade then upgrade flow? -  There
will be a "diskBalancer.info" under DN's metadata directory once the disk
balancer has started. This "diskBalancer.info" persists disk balancer
configurations set by the user through CLI.  All container move task
information is in memory, not persisted.  If DN is downgraded,
"diskBalancer.info" will be left on DN metadata directory.  If DN is
further upgraded,  disk balancer will reuse this "diskBalancer.info.

Why does it need centralized coordination?  Put it in another way, why does
balancing the disks in datanode A have anything to do with datanode B?
 -   Disk balancer doesn't need SCM coordination.  It uses SCM as a proxy
to relay commands sent by user through CLI command to datanode, and query
disk balancer status information gathered by SCM from DN heartbeat.  In
HDFS, the user has to send CLI commands one by one to each datanode to
generate a plan, execute plan, query plan execution status, and cancel the
plan. If there are 1000 datanodes, the user has to send the command to 1000
Datanodes one by one(of course, in this case, I believe they will leverage
bash scripts to do that).  In Ozone, we leverage SCM to help pass around
commands and gather information(there is no information persisted in SCM),
which from cluster maintainer point of view, is easy to use.  Besides, this
feature's first big patch is contributed by symious from Shopee. Shopee has
adopted the feature in their production cluster for quite a long time, and
they are also responsible for maintaining their HDFS and Ozone clusters,
they also shared their experience about HDFS disk balancer, so based on
that, we have this Ozone disk balancer design.
- Also the disk balancer is different from the data scanner. The same part
is both will consume and compete a lot disk IO with cluster's daily job,
could slow down the daily data read/write job.  That's why the disk scanner
is not enabled internally for a long time. The different part is that the
data scanner detects the corrupt replica to guarantee the data integrity,
which is very important and urgent to run.  While disk balancer, which is
also useful, but not that urgent.  Considering it will compete the IO
resource, it's better to start the service manually by cluster admin, so
that admin will be aware that disk balancer is running on that DN, then a
slightly degraded user data read/write performance will not cause confusion
to admin.


- In DiskBalancerTask, it holds the container read lock.  Would it be
possible that a file gets deleted when another thread is reading it and
results in an exception?

Current chunk reader doesn't hold any container lock.  It finds the read
file, and then opens the file for read.  There is a smaller window between
block file found and block file open for read that block file is deleted
due to replica is deleted(disk balancer, or container replica deletion due
to excessive replica count, or unhealthy replica deletion) which will cause
chunk read failure. After the block file is opened, although the block file
is deleted, the block data read will still succeed, OS will guarantee
that.  So no matter DiskBalancerTask holds read lock or write lock of
container, this will happen, even when disk balancer service is not
enabled.  The reason why this read failure doesn't bubble up on the client
side, is that the client has a retry mechanism, client will try the next
container replica on the next datanode, so the read will finally succeed.
BTW, both chunk read and chunk write don't hold the container lock.  Only
the container metadata change will use the container lock.  It makes sense
from concurrency point of view, otherwise, read block A will block write
block B, or write block A will block write block B.

For the rest of the feedback, Gargi will help to do the changes.


Bests,
Sammi Chen










On Tue, 12 Aug 2025 at 01:18, Tsz-Wo Nicholas Sze <szets...@gmail.com>
wrote:

> Hi Gargi,
>
> Thanks a lot for the detailed explanation!
>
> > ... SCM's* *DiskBalancerManager* provides essential *centralized
> coordination* ...
>
> Why does it need centralized coordination?  Put it in another way, why does
> balancing the disks in datanode A have anything to do with datanode B?
>
> It seems that DiskBalancer should be similar to the data scanner.  It
> should just keep working locally.
>
> > DiskBalancerTask explicitly acquires a container read lock before any
> file operations to prevent concurrent container state modifications like
> block deletion.
>
> But the DiskBalancerTask itself will delete the blocks when it is holding
> only the readlock.  This seems like a problem as below:
> 1) DiskBalancerTask has acquired and started copying a container from an
> old dir to a new dir.
> 2) A reader has retrieved the old dir file path of a block but has not yet
> opened the file.
> 3) DiskBalancerTask has finished copying.
> 4) DiskBalancerTask has deleted the old dir.
> 5) The reader opens the file using the old dir file path and gets
> FileNotFoundException.
>
> Tsz-Wo
>
>
> On Mon, Aug 11, 2025 at 5:47 AM Gargi Jaiswal
> <gargi.jais...@cloudera.com.invalid> wrote:
>
> > Hi Tsz Wo Sze,
> > Thank you for the review!
> > I have tried to answer a few of the questions, hoping it is clear enough.
> >
> > *- Why does SCM need a DiskBalancerManager?  Should the
> DiskBalancerService
> > just be a local service in Datanode?*
> > *SCM's* *DiskBalancerManager* provides essential *centralized
> > coordination *that
> > local DataNode services cannot achieve. SCM can identify which DataNodes
> > most urgently need disk balancing by calculating volume density sums and
> > ranking them.
> >
> > In HDFS Disk Balancer, the schedule is split into two steps: PLAN and
> > EXECUTE. Users first get the balance plan, then according to the plan,
> the
> > requests are sent to datanodes to do the real balance job. These are two
> > commands for the users to run, which is a little complex. So in Ozone,
> > Client
> > will retrieve the node information from SCM and decide the balance of
> > which node or the whole cluster.
> > Single point to start/stop/configure disk balancing across multiple
> > DataNodes simultaneously.
> > *DataNode:* Handles actual container movement (local execution)
> > *SCM:* Provides coordination and cluster-wide management.
> >
> > *- In DiskBalancerTask, it holds the container read lock.  Would it be
> > possible that a file gets deleted when another thread is reading it and
> > results in an exception?*
> > No, this scenario won't happen.
> >
> >    - DiskBalancerTask explicitly acquires a container read lock before
> any
> >    file operations to prevent concurrent container state modifications
> like
> >    block deletion.
> >    - All deletion operations require a write lock on the container
> >    (BlockDeletingTask, markContainerForDelete, etc.).
> >    - ReadWriteLock semantics guarantee mutual exclusion between read and
> >    write locks - while DiskBalancerTask holds the read lock during copy
> >    operations, no deletion thread can acquire the required write lock.
> >    - The copy operation (copyContainerData) safely reads all container
> >    files under the protection of the read lock, ensuring no concurrent
> >    deletions can interfere.
> >
> > The locking design specifically addresses this concern, as evidenced by
> the
> > comment in the code: 'hold read lock on the container first, to avoid
> other
> > threads to update the container state, such as block deletion.'
> >
> > *- In DefaultContainerChoosingPolicy, it has a CACHE for caching
> container
> > iterators.  Would it cause ConcurrentModificationException if the
> > underlying containerMap is modified?*
> > *No*, CACHE will NOT cause ConcurrentModificationException when the
> > underlying containerMap is modified. Here's why:
> >
> >    - The underlying *containerMap* is a *ConcurrentSkipListMap*, which
> >    provides fail-safe iterators that don't throw
> >    ConcurrentModificationException.
> >    - The *cached iterator* operates on a snapshot created by the stream
> >    operations (filter + sort), not directly on the live map. This
> snapshot
> > is
> >    immune to concurrent modifications.
> >    - The existing test
> > *testContainerDeletionAfterIteratorGeneration()* specifically
> >    validates this scenario by removing containers after iterator creation
> > and
> >    confirms no exceptions occur.
> >    - The cache invalidates exhausted iterators, ensuring fresh snapshots
> >    for future requests.
> >
> > * - In DefaultVolumeChoosingPolicy.chooseVolume(..), getCurrentUsage() is
> > called multiple times so it can return different values.*
> > Yes, a valid concurrency issue can happen. The current implementation
> under
> > the *ReentrantLock*  prevents concurrent calls to *chooseVolume()* but
> > doesn't prevent the underlying usage values from changing during
> execution.
> > *Agreed, this part needs to be fixed.*
> >
> > *- Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as
> > StorageVolume.TMP_DIR_NAME?  If yes, use StorageVolume.TMP_DIR_NAME
> instead
> > of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.*
> >  Yes, they both use the same value "tmp".
> >
> > *Thank you for pointing to the below refactorings, we will do the
> changes.*
> > - DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead of
> > DatanodeDetailsProto
> >
> > - Remove DatanodeDiskBalancerInfoType since it is never used.
> >
> > - Remove the new Container.importContainerData(Path containerPath) method
> > since it is never used.
> >
> > - Rename Container.copyContainerData to copyContainerDirectory.
> Otherwise,
> > it sounds like copy/import the ContainerData class.
> >
> > - In DiskBalancerService, use ContainerID instead of Long for container
> id.
> >
> > - Do not use Optional for parameters; see
> >
> >
> https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/
> >
> > - Use Objects.requireNonNull(..) instead of
> Preconditions.checkNotNull(..).
> >
> > On Sat, Aug 9, 2025 at 12:19 AM Tsz Wo Sze <szets...@apache.org> wrote:
> >
> > > Hi Sammi and others,
> > >
> > > Thanks for working on the Disk Balancer feature!  It is going to be
> very
> > > useful.
> > >
> > > Some comments:
> > >
> > > - Why does SCM need a DiskBalancerManager?  Should the
> > DiskBalancerService
> > > just be a local service in Datanode?
> > >
> > > - In DiskBalancerTask, it holds the container read lock.  Would it be
> > > possible that a file gets deleted when another thread is reading it and
> > > results in an exception?
> > >
> > > - In DefaultContainerChoosingPolicy, it has a CACHE for caching
> container
> > > iterators.  Would it cause ConcurrentModificationException if the
> > > underlying containerMap is modified?
> > >
> > > - In DefaultVolumeChoosingPolicy.chooseVolume(..), getCurrentUsage() is
> > > called multiple times so it can return different values.
> > >
> > > - Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as
> > > StorageVolume.TMP_DIR_NAME?  If yes, use StorageVolume.TMP_DIR_NAME
> > instead
> > > of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.
> > >
> > > - Do not throw RuntimeException -- how to handle it when it happens?
> > >
> > > - Since this is a new feature, how about not supporting SCHEMA_V1/V2 ?
> > If
> > > we want to support them, we should add some tests.
> > >
> > > - DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead of
> > > DatanodeDetailsProto
> > >
> > > - Remove DatanodeDiskBalancerInfoType since it is never used.
> > >
> > > - Remove the new Container.importContainerData(Path containerPath)
> method
> > > since it is never used.
> > >
> > > - Rename Container.copyContainerData to copyContainerDirectory.
> > Otherwise,
> > > it sounds like copy/import the ContainerData class.
> > >
> > > - In DiskBalancerService, use ContainerID instead of Long for container
> > id.
> > >
> > > - Do not use Optional for parameters; see
> > >
> > >
> >
> https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/
> > >
> > > - Use Objects.requireNonNull(..) instead of
> > Preconditions.checkNotNull(..).
> > >
> > > Regards,
> > > Tsz-Wo
> > >
> > >
> > > On Fri, Aug 8, 2025 at 9:57 AM Ethan Rose <er...@apache.org> wrote:
> > >
> > > > Hi Sammi, thanks for working on this feature.
> > > >
> > > > There should not be any incompatible changes introduced with this
> > > feature.
> > > >
> > > > This section needs more information because it only talks about a
> > feature
> > > > flag, which is orthogonal to any upgrade/downgrade concerns because
> > > feature
> > > > flags do not prevent downgrades like finalization does.
> > > >
> > > >    - What happens if the feature flag is enabled, and later disabled?
> > > >    - If an in-progress container move is interrupted by a downgrade,
> > what
> > > >    does the old version do?
> > > >    - Are components persisting any state about the ongoing
> operations,
> > > and
> > > >    if so, what happens to that during a downgrade then upgrade flow?
> > > >
> > > > The disk balancer has a feature flag that currently defaults to
> false.
> > In
> > > > this
> > > > comment <
> > > https://github.com/apache/ozone/pull/8869#issuecomment-3130009094
> > > > >
> > > > I explained why feature flags for features that are only triggered by
> > > > admins are redundant. I’ve heard other devs from HDFS also express
> > > concern
> > > > over everything in that project being behind feature flags and trying
> > to
> > > > avoid sending Ozone down that path as well. IMO if we want to
> indicate
> > > that
> > > > this is not ready for prod use, it would be better to hide the CLI
> > > command
> > > > <
> https://picocli.info/quick-guide.html#_hidden_options_and_parameters>
> > > and
> > > > add a disclaimer to the help message indicating this. The current
> > config
> > > > flag approach provides no user facing information as to why it is
> > > disabled
> > > > by default or what the implications of enabling it are. Additionally,
> > why
> > > > is this set tofalse by default but considered ready to merge with
> > master?
> > > >
> > > > Ethan
> > > >
> > > > On Fri, Aug 8, 2025 at 4:53 AM Sammi Chen <sammic...@apache.org>
> > wrote:
> > > >
> > > > > Hi Ozone developers,
> > > > >
> > > > > I would like to propose merging the HDDS-5713 Disk Balancer feature
> > > > branch
> > > > > into master.
> > > > >
> > > > > HDDS-5713 adds the support of the disk volume utilization balancer
> > > > function
> > > > > by selecting and moving containers from high utilized data volume
> to
> > > > lower
> > > > > utilized data volume, to achieve an all disk volumes' utilization
> > even
> > > > > state.
> > > > >
> > > > >
> > > > > Feature Jira Link:
> > > > > https://issues.apache.org/jira/browse/HDDS-5713
> > > > >
> > > > >
> > > > > Checklist for feature branch merge:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713
> > > > > >
> > > > >
> > > > >
> > > > > This vote will be open for at least a week.
> > > > >
> > > > > Thanks,
> > > > > Sammi Chen
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > > > > For additional commands, e-mail: dev-h...@ozone.apache.org
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] Merge HDDS-5713 Disk Balancer into master

Reply via email to