Re: [VOTE] Merge HDDS-5713 Disk Balancer into master

Tsz Wo Sze Wed, 20 Aug 2025 11:01:51 -0700

Hi Sammi and others,

Sammi, thanks for filing the new JIRAs!


For HDDS-13599 <https://issues.apache.org/jira/browse/HDDS-13599>, it is a
problem specific to Disk Balancer since it is only moving but not really
deleting blocks.  Balancing should not make the data unavailable.  For the
real delete cases, it is okay since the read can be considered after
delete.  Let's continue the discussion in the JIRA.

We should remove the unused code before merge.  Otherwise, we have to start
unnecessarily maintaining them.
- Remove DatanodeDiskBalancerInfoType since it is never used.
- Remove the new Container.importContainerData(Path containerPath) method
since it is never used.
- The SlidingWindow class is unused except in tests.  Let's remove it.

The other issues can be addressed after the merge.

Tsz-Wo



On Wed, Aug 20, 2025 at 2:23 AM Sammi Chen <[email protected]> wrote:

> Hi Tsz-Wo,
>
> > ... We should have some ways to run disk  balancer without SCM.
> Filed HDDS-13598 <https://issues.apache.org/jira/browse/HDDS-13598>
>
> > ... The case for disk balancer is different since it indeed is moving but
> not deleting the block.  Suppose the datanode contains the only replica of
> the
> container. If a reader fails to read it while the disk balancer is moving
> it.  It will not be able to retry and have to declare read failure.
>
> If the data has only one replica, and the timing is perfect, then this time
> read will fail,  next time read will succeed.  The only way to solve this
> corner case, is to acquire the write lock of all the block files under the
> container directory before deleting all the block files.  Since it will
> generally benefit data read from DN, filed a general JIRA HDDS-13599
> <https://issues.apache.org/jira/browse/HDDS-13599>.
>
>
> On Tue, 19 Aug 2025 at 01:17, Tsz-Wo Nicholas Sze <[email protected]>
> wrote:
>
> > Hi Sammi,
> >
> > Thanks  a lot for your detailed explanation!
> >
> > > ... Shopee has adopted the feature in their production cluster for
> quite
> > a long time ...
> >
> > Good to hear that!  It seems that running disk balancer via SCM should be
> > an optional convenient feature.  We should have some ways to run disk
> > balancer without SCM.
> >
> > > ... While disk balancer, which is also useful, but not that urgent. ...
> >
> > Agree.  It should run as a low priority background service.  It would be
> > great if it could detect and run only if the machine is idle.
> >
> > > ... There is a smaller window between block file found and block file
> > open for read that block file is deleted due to replica is deleted(disk
> > balancer, or container replica deletion due
> > to excessive replica count, or unhealthy replica deletion) which will
> cause
> > chunk read failure. ...
> >
> > The case for disk balancer is different since it indeed is moving but not
> > deleting the block.  Suppose the datanode contains the only replica of
> the
> > container. If a reader fails to read it while the disk balancer is moving
> > it.  It will not be able to retry and have to declare read failure.
> >
> > > ... After the block file is opened, although the block file is deleted,
> > the block data read will still succeed, OS will guarantee that. ...
> >
> > This depends on the OS.  Windows probably will fail the delete.
> >
> > BTW, just found that
> > - the SlidingWindow class is unused except in tests.  Let's remove it.
> >
> > Good work on the feature.  I look forward to merging it.
> >
> > Tsz-Wo
> >
> >
> > On Fri, Aug 15, 2025 at 12:30 AM Sammi Chen <[email protected]>
> wrote:
> >
> > > Thanks Ethan and Tsz-Wo for so detailed feedback.
> > >
> > > As this is a new feature, although we believe it's mature enough to
> merge
> > > back to master branch, from the conservative point of view, and
> > > also considered others' suggestions, the feature flag is introduced and
> > set
> > > to false.  We will position it as an experimental feature in the next
> > Ozone
> > > release(could be 2.1.0).
> > >
> > >    - What happens if the feature flag is enabled, and later disabled?
> > -
> > > Since rolling upgrade is not supported, updating the feature flag
> > requires
> > > service restart, after restart, disk balancer will not be started.
> > >    - If an in-progress container move is interrupted by a downgrade,
> what
> > > does the old version do?  -  If an in-progress container move is
> > > interrupted, there could be residual container directory under
> > destination
> > > volume tmp directory, or RECOVERING or original state container replica
> > > under destination volume.  Then DN is downgraded, the old version DN is
> > > responsible to clean up these directories.
> > >    - Are components persisting any state about the ongoing operations,
> > and
> > > if so, what happens to that during a downgrade then upgrade flow? -
> > There
> > > will be a "diskBalancer.info" under DN's metadata directory once the
> disk
> > > balancer has started. This "diskBalancer.info" persists disk balancer
> > > configurations set by the user through CLI.  All container move task
> > > information is in memory, not persisted.  If DN is downgraded,
> > > "diskBalancer.info" will be left on DN metadata directory.  If DN is
> > > further upgraded,  disk balancer will reuse this "diskBalancer.info.
> > >
> > > Why does it need centralized coordination?  Put it in another way, why
> > does
> > > balancing the disks in datanode A have anything to do with datanode B?
> > >  -   Disk balancer doesn't need SCM coordination.  It uses SCM as a
> proxy
> > > to relay commands sent by user through CLI command to datanode, and
> query
> > > disk balancer status information gathered by SCM from DN heartbeat.  In
> > > HDFS, the user has to send CLI commands one by one to each datanode to
> > > generate a plan, execute plan, query plan execution status, and cancel
> > the
> > > plan. If there are 1000 datanodes, the user has to send the command to
> > 1000
> > > Datanodes one by one(of course, in this case, I believe they will
> > leverage
> > > bash scripts to do that).  In Ozone, we leverage SCM to help pass
> around
> > > commands and gather information(there is no information persisted in
> > SCM),
> > > which from cluster maintainer point of view, is easy to use.  Besides,
> > this
> > > feature's first big patch is contributed by symious from Shopee. Shopee
> > has
> > > adopted the feature in their production cluster for quite a long time,
> > and
> > > they are also responsible for maintaining their HDFS and Ozone
> clusters,
> > > they also shared their experience about HDFS disk balancer, so based on
> > > that, we have this Ozone disk balancer design.
> > > - Also the disk balancer is different from the data scanner. The same
> > part
> > > is both will consume and compete a lot disk IO with cluster's daily
> job,
> > > could slow down the daily data read/write job.  That's why the disk
> > scanner
> > > is not enabled internally for a long time. The different part is that
> the
> > > data scanner detects the corrupt replica to guarantee the data
> integrity,
> > > which is very important and urgent to run.  While disk balancer, which
> is
> > > also useful, but not that urgent.  Considering it will compete the IO
> > > resource, it's better to start the service manually by cluster admin,
> so
> > > that admin will be aware that disk balancer is running on that DN,
> then a
> > > slightly degraded user data read/write performance will not cause
> > confusion
> > > to admin.
> > >
> > >
> > > - In DiskBalancerTask, it holds the container read lock.  Would it be
> > > possible that a file gets deleted when another thread is reading it and
> > > results in an exception?
> > >
> > > Current chunk reader doesn't hold any container lock.  It finds the
> read
> > > file, and then opens the file for read.  There is a smaller window
> > between
> > > block file found and block file open for read that block file is
> deleted
> > > due to replica is deleted(disk balancer, or container replica deletion
> > due
> > > to excessive replica count, or unhealthy replica deletion) which will
> > cause
> > > chunk read failure. After the block file is opened, although the block
> > file
> > > is deleted, the block data read will still succeed, OS will guarantee
> > > that.  So no matter DiskBalancerTask holds read lock or write lock of
> > > container, this will happen, even when disk balancer service is not
> > > enabled.  The reason why this read failure doesn't bubble up on the
> > client
> > > side, is that the client has a retry mechanism, client will try the
> next
> > > container replica on the next datanode, so the read will finally
> succeed.
> > > BTW, both chunk read and chunk write don't hold the container lock.
> Only
> > > the container metadata change will use the container lock.  It makes
> > sense
> > > from concurrency point of view, otherwise, read block A will block
> write
> > > block B, or write block A will block write block B.
> > >
> > > For the rest of the feedback, Gargi will help to do the changes.
> > >
> > >
> > > Bests,
> > > Sammi Chen
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, 12 Aug 2025 at 01:18, Tsz-Wo Nicholas Sze <[email protected]>
> > > wrote:
> > >
> > > > Hi Gargi,
> > > >
> > > > Thanks a lot for the detailed explanation!
> > > >
> > > > > ... SCM's* *DiskBalancerManager* provides essential *centralized
> > > > coordination* ...
> > > >
> > > > Why does it need centralized coordination?  Put it in another way,
> why
> > > does
> > > > balancing the disks in datanode A have anything to do with datanode
> B?
> > > >
> > > > It seems that DiskBalancer should be similar to the data scanner.  It
> > > > should just keep working locally.
> > > >
> > > > > DiskBalancerTask explicitly acquires a container read lock before
> any
> > > > file operations to prevent concurrent container state modifications
> > like
> > > > block deletion.
> > > >
> > > > But the DiskBalancerTask itself will delete the blocks when it is
> > holding
> > > > only the readlock.  This seems like a problem as below:
> > > > 1) DiskBalancerTask has acquired and started copying a container from
> > an
> > > > old dir to a new dir.
> > > > 2) A reader has retrieved the old dir file path of a block but has
> not
> > > yet
> > > > opened the file.
> > > > 3) DiskBalancerTask has finished copying.
> > > > 4) DiskBalancerTask has deleted the old dir.
> > > > 5) The reader opens the file using the old dir file path and gets
> > > > FileNotFoundException.
> > > >
> > > > Tsz-Wo
> > > >
> > > >
> > > > On Mon, Aug 11, 2025 at 5:47 AM Gargi Jaiswal
> > > > <[email protected]> wrote:
> > > >
> > > > > Hi Tsz Wo Sze,
> > > > > Thank you for the review!
> > > > > I have tried to answer a few of the questions, hoping it is clear
> > > enough.
> > > > >
> > > > > *- Why does SCM need a DiskBalancerManager?  Should the
> > > > DiskBalancerService
> > > > > just be a local service in Datanode?*
> > > > > *SCM's* *DiskBalancerManager* provides essential *centralized
> > > > > coordination *that
> > > > > local DataNode services cannot achieve. SCM can identify which
> > > DataNodes
> > > > > most urgently need disk balancing by calculating volume density
> sums
> > > and
> > > > > ranking them.
> > > > >
> > > > > In HDFS Disk Balancer, the schedule is split into two steps: PLAN
> and
> > > > > EXECUTE. Users first get the balance plan, then according to the
> > plan,
> > > > the
> > > > > requests are sent to datanodes to do the real balance job. These
> are
> > > two
> > > > > commands for the users to run, which is a little complex. So in
> > Ozone,
> > > > > Client
> > > > > will retrieve the node information from SCM and decide the balance
> of
> > > > > which node or the whole cluster.
> > > > > Single point to start/stop/configure disk balancing across multiple
> > > > > DataNodes simultaneously.
> > > > > *DataNode:* Handles actual container movement (local execution)
> > > > > *SCM:* Provides coordination and cluster-wide management.
> > > > >
> > > > > *- In DiskBalancerTask, it holds the container read lock.  Would it
> > be
> > > > > possible that a file gets deleted when another thread is reading it
> > and
> > > > > results in an exception?*
> > > > > No, this scenario won't happen.
> > > > >
> > > > >    - DiskBalancerTask explicitly acquires a container read lock
> > before
> > > > any
> > > > >    file operations to prevent concurrent container state
> > modifications
> > > > like
> > > > >    block deletion.
> > > > >    - All deletion operations require a write lock on the container
> > > > >    (BlockDeletingTask, markContainerForDelete, etc.).
> > > > >    - ReadWriteLock semantics guarantee mutual exclusion between
> read
> > > and
> > > > >    write locks - while DiskBalancerTask holds the read lock during
> > copy
> > > > >    operations, no deletion thread can acquire the required write
> > lock.
> > > > >    - The copy operation (copyContainerData) safely reads all
> > container
> > > > >    files under the protection of the read lock, ensuring no
> > concurrent
> > > > >    deletions can interfere.
> > > > >
> > > > > The locking design specifically addresses this concern, as
> evidenced
> > by
> > > > the
> > > > > comment in the code: 'hold read lock on the container first, to
> avoid
> > > > other
> > > > > threads to update the container state, such as block deletion.'
> > > > >
> > > > > *- In DefaultContainerChoosingPolicy, it has a CACHE for caching
> > > > container
> > > > > iterators.  Would it cause ConcurrentModificationException if the
> > > > > underlying containerMap is modified?*
> > > > > *No*, CACHE will NOT cause ConcurrentModificationException when the
> > > > > underlying containerMap is modified. Here's why:
> > > > >
> > > > >    - The underlying *containerMap* is a *ConcurrentSkipListMap*,
> > which
> > > > >    provides fail-safe iterators that don't throw
> > > > >    ConcurrentModificationException.
> > > > >    - The *cached iterator* operates on a snapshot created by the
> > stream
> > > > >    operations (filter + sort), not directly on the live map. This
> > > > snapshot
> > > > > is
> > > > >    immune to concurrent modifications.
> > > > >    - The existing test
> > > > > *testContainerDeletionAfterIteratorGeneration()* specifically
> > > > >    validates this scenario by removing containers after iterator
> > > creation
> > > > > and
> > > > >    confirms no exceptions occur.
> > > > >    - The cache invalidates exhausted iterators, ensuring fresh
> > > snapshots
> > > > >    for future requests.
> > > > >
> > > > > * - In DefaultVolumeChoosingPolicy.chooseVolume(..),
> > getCurrentUsage()
> > > is
> > > > > called multiple times so it can return different values.*
> > > > > Yes, a valid concurrency issue can happen. The current
> implementation
> > > > under
> > > > > the *ReentrantLock*  prevents concurrent calls to *chooseVolume()*
> > but
> > > > > doesn't prevent the underlying usage values from changing during
> > > > execution.
> > > > > *Agreed, this part needs to be fixed.*
> > > > >
> > > > > *- Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as
> > > > > StorageVolume.TMP_DIR_NAME?  If yes, use StorageVolume.TMP_DIR_NAME
> > > > instead
> > > > > of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.*
> > > > >  Yes, they both use the same value "tmp".
> > > > >
> > > > > *Thank you for pointing to the below refactorings, we will do the
> > > > changes.*
> > > > > - DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead
> of
> > > > > DatanodeDetailsProto
> > > > >
> > > > > - Remove DatanodeDiskBalancerInfoType since it is never used.
> > > > >
> > > > > - Remove the new Container.importContainerData(Path containerPath)
> > > method
> > > > > since it is never used.
> > > > >
> > > > > - Rename Container.copyContainerData to copyContainerDirectory.
> > > > Otherwise,
> > > > > it sounds like copy/import the ContainerData class.
> > > > >
> > > > > - In DiskBalancerService, use ContainerID instead of Long for
> > container
> > > > id.
> > > > >
> > > > > - Do not use Optional for parameters; see
> > > > >
> > > > >
> > > >
> > >
> >
> https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/
> > > > >
> > > > > - Use Objects.requireNonNull(..) instead of
> > > > Preconditions.checkNotNull(..).
> > > > >
> > > > > On Sat, Aug 9, 2025 at 12:19 AM Tsz Wo Sze <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Sammi and others,
> > > > > >
> > > > > > Thanks for working on the Disk Balancer feature!  It is going to
> be
> > > > very
> > > > > > useful.
> > > > > >
> > > > > > Some comments:
> > > > > >
> > > > > > - Why does SCM need a DiskBalancerManager?  Should the
> > > > > DiskBalancerService
> > > > > > just be a local service in Datanode?
> > > > > >
> > > > > > - In DiskBalancerTask, it holds the container read lock.  Would
> it
> > be
> > > > > > possible that a file gets deleted when another thread is reading
> it
> > > and
> > > > > > results in an exception?
> > > > > >
> > > > > > - In DefaultContainerChoosingPolicy, it has a CACHE for caching
> > > > container
> > > > > > iterators.  Would it cause ConcurrentModificationException if the
> > > > > > underlying containerMap is modified?
> > > > > >
> > > > > > - In DefaultVolumeChoosingPolicy.chooseVolume(..),
> > getCurrentUsage()
> > > is
> > > > > > called multiple times so it can return different values.
> > > > > >
> > > > > > - Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as
> > > > > > StorageVolume.TMP_DIR_NAME?  If yes, use
> StorageVolume.TMP_DIR_NAME
> > > > > instead
> > > > > > of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.
> > > > > >
> > > > > > - Do not throw RuntimeException -- how to handle it when it
> > happens?
> > > > > >
> > > > > > - Since this is a new feature, how about not supporting
> > SCHEMA_V1/V2
> > > ?
> > > > > If
> > > > > > we want to support them, we should add some tests.
> > > > > >
> > > > > > - DatanodeDiskBalancerInfoProto should use DatanodeIDProto
> instead
> > of
> > > > > > DatanodeDetailsProto
> > > > > >
> > > > > > - Remove DatanodeDiskBalancerInfoType since it is never used.
> > > > > >
> > > > > > - Remove the new Container.importContainerData(Path
> containerPath)
> > > > method
> > > > > > since it is never used.
> > > > > >
> > > > > > - Rename Container.copyContainerData to copyContainerDirectory.
> > > > > Otherwise,
> > > > > > it sounds like copy/import the ContainerData class.
> > > > > >
> > > > > > - In DiskBalancerService, use ContainerID instead of Long for
> > > container
> > > > > id.
> > > > > >
> > > > > > - Do not use Optional for parameters; see
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/
> > > > > >
> > > > > > - Use Objects.requireNonNull(..) instead of
> > > > > Preconditions.checkNotNull(..).
> > > > > >
> > > > > > Regards,
> > > > > > Tsz-Wo
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 8, 2025 at 9:57 AM Ethan Rose <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Hi Sammi, thanks for working on this feature.
> > > > > > >
> > > > > > > There should not be any incompatible changes introduced with
> this
> > > > > > feature.
> > > > > > >
> > > > > > > This section needs more information because it only talks
> about a
> > > > > feature
> > > > > > > flag, which is orthogonal to any upgrade/downgrade concerns
> > because
> > > > > > feature
> > > > > > > flags do not prevent downgrades like finalization does.
> > > > > > >
> > > > > > >    - What happens if the feature flag is enabled, and later
> > > disabled?
> > > > > > >    - If an in-progress container move is interrupted by a
> > > downgrade,
> > > > > what
> > > > > > >    does the old version do?
> > > > > > >    - Are components persisting any state about the ongoing
> > > > operations,
> > > > > > and
> > > > > > >    if so, what happens to that during a downgrade then upgrade
> > > flow?
> > > > > > >
> > > > > > > The disk balancer has a feature flag that currently defaults to
> > > > false.
> > > > > In
> > > > > > > this
> > > > > > > comment <
> > > > > >
> https://github.com/apache/ozone/pull/8869#issuecomment-3130009094
> > > > > > > >
> > > > > > > I explained why feature flags for features that are only
> > triggered
> > > by
> > > > > > > admins are redundant. I’ve heard other devs from HDFS also
> > express
> > > > > > concern
> > > > > > > over everything in that project being behind feature flags and
> > > trying
> > > > > to
> > > > > > > avoid sending Ozone down that path as well. IMO if we want to
> > > > indicate
> > > > > > that
> > > > > > > this is not ready for prod use, it would be better to hide the
> > CLI
> > > > > > command
> > > > > > > <
> > > > https://picocli.info/quick-guide.html#_hidden_options_and_parameters
> >
> > > > > > and
> > > > > > > add a disclaimer to the help message indicating this. The
> current
> > > > > config
> > > > > > > flag approach provides no user facing information as to why it
> is
> > > > > > disabled
> > > > > > > by default or what the implications of enabling it are.
> > > Additionally,
> > > > > why
> > > > > > > is this set tofalse by default but considered ready to merge
> with
> > > > > master?
> > > > > > >
> > > > > > > Ethan
> > > > > > >
> > > > > > > On Fri, Aug 8, 2025 at 4:53 AM Sammi Chen <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Ozone developers,
> > > > > > > >
> > > > > > > > I would like to propose merging the HDDS-5713 Disk Balancer
> > > feature
> > > > > > > branch
> > > > > > > > into master.
> > > > > > > >
> > > > > > > > HDDS-5713 adds the support of the disk volume utilization
> > > balancer
> > > > > > > function
> > > > > > > > by selecting and moving containers from high utilized data
> > volume
> > > > to
> > > > > > > lower
> > > > > > > > utilized data volume, to achieve an all disk volumes'
> > utilization
> > > > > even
> > > > > > > > state.
> > > > > > > >
> > > > > > > >
> > > > > > > > Feature Jira Link:
> > > > > > > > https://issues.apache.org/jira/browse/HDDS-5713
> > > > > > > >
> > > > > > > >
> > > > > > > > Checklist for feature branch merge:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > This vote will be open for at least a week.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Sammi Chen
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > > For additional commands, e-mail: [email protected]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] Merge HDDS-5713 Disk Balancer into master

Reply via email to