Hi Liebing and all,

This is a good FIP to resolve bottlenecks in the remote storage. Thanks for
your effort. The design looks good to me and the above discussion has
covered some concerns in my mind.

Now there are some further considerations I'm thinking of:

1. What happens if a path goes down?
  Right now, there’s no automatic failover. If one S3 bucket (or HDFS path)
dies, every table or partition assigned to it just fails. Could we add
simple health checks? If a path looks dead, the remote dir selector
temporarily skips it until it’s back up.

2. New paths don't always help old data.
The routing only happens when a new table or new partition is created. And
it depends on the partition strategy.
- If the table is using time-based partitions (e.g., daily), adding new
paths works well because new data goes to new partitions on new paths.
- But for non-partitioned tables, or if it keeps writing to old partitions,
the new paths sit idle. The traffic never shifts over.
It requires developers to think further about partition strategy and input
data when adding remote dirs.

3. Managing "weights" is tricky manually for developers/maintainers.
Since the weighted round-robin is static:
    - Developers/Maintainers have to determine the right weights based on
current traffic.
    - If you skew weights to favor a path, you have to remember to
rebalance them later, or that path gets overloaded forever. E.g. If two
paths are weighted [1, 2] in the beginning to rebalance the higher traffic
in the first path. Developers/Maintainers should remember to change the
weights back to [1, 1] after the traffic is balanced between two paths.
Otherwise the traffic in the second path will keep growing.
    - Also, setting a weight to 0 behaves differently depending on your
partition type (time-based paths eventually go quiet, but field-based ones
like "country=US" keep writing there forever).
Instead of manual tuning, could we eventually make this dynamic? Let the
system adjust weights based on real-time latency or throttling metrics.

The points above are about future operational considerations—regarding
failover and maintenance after this solution is deployed. I think they
won't block this FIP. We may not need to fix these right now. Just bring
them into this discussion.

Regards,
Yang Guo

On Fri, Feb 27, 2026 at 5:53 PM Liebing Yu <[email protected]> wrote:

> Hi Lorenzo, sorry for the late reply.
>
> Thanks for the AWS example! This further solidifies the case for multi-path
> support.
>
> Regarding your question about multi-cloud support:
> Our current design naturally supports multi-cloud object storage systems.
> Since the implementation is built upon a multi-schema filesystem
> abstraction (supporting schemes like s3://, oss://, abfs://, etc.), the
> system is inherently "cloud-agnostic."
>
> Best regards,
> Liebing Yu
>
>
> On Wed, 4 Feb 2026 at 23:37, Lorenzo Affetti via dev <[email protected]
> >
> wrote:
>
> > This is quite an interesting FIP and I think it is a significant
> > enhancement, especially for large-scale clusters.
> >
> > I think you can also add the AWS case in your motivation:
> >
> >
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-high-request-rate
> > AWS automatically scales if requests exceed 5,500 per second for the same
> > prefix, which results in transient 503 errors.
> > Your approach would eliminate this problem by providing another bucket.
> >
> > I was wondering if it might also provide the possibility of configuring
> the
> > same Fluss cluster for multi-cloud object storage systems.
> > From a design perspective, nothing should prevent me from storing remote
> > data on both Azure and AWS at the same time, probably resulting in
> > different performance numbers for different partitions/tables.
> > Should the design force the use of only 1 filesystem implementation?
> >
> > Thank you again!
> >
> > On Fri, Jan 30, 2026 at 7:59 AM Liebing Yu <[email protected]> wrote:
> >
> > > Hi Yuxia, thanks for the thoughtful response. Let me go through your
> > > questions one by one.
> > >
> > > 1. I think after we support `remote.data.dirs`, different schemas will
> be
> > > supported naturally.
> > > 2. Yes, I think we should change from `PbTablePath` to
> > > `PbPhysicalTablePath`.
> > > 3. Thanks for the reminder. I'll poc authentication in
> > > https://github.com/apache/fluss/issues/2518. But it doesn't block the
> > > multiple-paths implementation in Fluss server in
> > > https://github.com/apache/fluss/issues/2517.
> > > 4. For a partition table, the table itself has a remote data dir for
> > > metadata (such as lake offset). And each partition has its own remote
> dir
> > > for table data (e.g. kv or log data).
> > > 5. Legacy clients can access data in the new cluster.
> > >
> > >    - If the permissions of the paths specified in `remote.data.dirs` on
> > the
> > >    new cluster match those configured in `remote.data.dir`, seamless
> > > access is
> > >    achievable.
> > >    - If the permissions are inconsistent, access permissions must be
> > >    explicitly configured. For example, when using OSS, a policy
> granting
> > >    access permissions to the account identified by `fs.oss.roleArn`
> must
> > be
> > >    configured for each bucket specified in `remote.data.dirs`.
> > >
> > >
> > > Best regards,
> > > Liebing Yu
> > >
> > >
> > > On Thu, 29 Jan 2026 at 10:07, Yuxia Luo <[email protected]> wrote:
> > >
> > > > Hi, Liebing
> > > >
> > > > Thanks for the detailed FIP. I have a few questions:
> > > > 1. Does `remote.data.dirs` support paths with different schemes? For
> > > > example:
> > > > ```
> > > > remote.data.dirs: oss://bucket1/fluss-data, s3://bucket2/fluss-data
> > > > ```
> > > >
> > > > 2. Should `GetFileSystemSecurityTokenRequest` include partition?
> > > > The FIP adds `table_path` to the request, but since different
> > partitions
> > > > may reside on different remote paths (and require different tokens),
> > > > should the request also include partition information?
> > > >
> > > > 3. Just a reminder that `DefaultSecurityTokenManager` will become
> more
> > > > complex...
> > > > This is not a blocker, but worth a poc to recoginize any complexity
> > > >
> > > > 4. I want to confirm my understanding: For a partitioned table, does
> > the
> > > > table itself have a remote dir, AND each partition also has its own
> > > remote
> > > > dir?
> > > >
> > > > Or is it:
> > > > - Non-partitioned table → table-level remote dir
> > > > - Partitioned table → only partition-level remote dirs (no
> > table-level)?
> > > >
> > > > 5. Can old clients (without table path in token request) still read
> > data
> > > > from new clusters?
> > > > One possibe solution is : For RPCs without table information, the
> > server
> > > > returns a token for the first dir in `remote.data.dirs`. Or other
> ways
> > > that
> > > > allow users to configure the cluster to keep compatibility
> > > >
> > > >
> > > >
> > > > On 2026/01/21 03:52:29 Zhe Wang wrote:
> > > > > Thanks for your response, now it looks good to me.
> > > > >
> > > > > Best regards,
> > > > > Zhe Wang
> > > > >
> > > > > Liebing Yu <[email protected]> 于2026年1月20日周二 14:29写道:
> > > > >
> > > > > > Hi Zhe, sorry for the late reply.
> > > > > >
> > > > > > The primary focus of this FIP is not to address read/write issues
> > at
> > > > the
> > > > > > table or partition level, but rather to overcome limitations at
> the
> > > > cluster
> > > > > > level. Given the current capabilities of object storage,
> read/write
> > > > > > performance for a single table or partition is unlikely to be a
> > > > bottleneck;
> > > > > > however, for a large-scale Fluss cluster, it can easily become
> one.
> > > > > > Therefore, the core objective here is to distribute the
> > cluster-wide
> > > > > > read/write traffic across multiple remote storage systems.
> > > > > >
> > > > > > Best regards,
> > > > > > Liebing Yu
> > > > > >
> > > > > >
> > > > > > On Wed, 14 Jan 2026 at 16:07, Zhe Wang <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hi Liebing, Thanks for the clarification.
> > > > > > > >1. To clarify, the data is currently split by partition level
> > for
> > > > > > > partitioned tables and by table for non-partitioned tables.
> > > > > > >
> > > > > > > Therefore the main aim of this FIP is improving the speed of
> read
> > > > data
> > > > > > from
> > > > > > > different partitions, store data speed may still limit for a
> > single
> > > > > > system?
> > > > > > >
> > > > > > > Best,
> > > > > > > Zhe Wang
> > > > > > >
> > > > > > > Liebing Yu <[email protected]> 于2026年1月13日周二 19:11写道:
> > > > > > >
> > > > > > > > Hi Zhe, Thanks for the questions!
> > > > > > > >
> > > > > > > > 1. To clarify, the data is currently split by partition level
> > for
> > > > > > > > partitioned tables and by table for non-partitioned tables.
> > > > > > > >
> > > > > > > > 2. Regarding RemoteStorageCleaner, you are absolutely right.
> > > > Supporting
> > > > > > > > remote.data.dirs there is necessary for a complete cleanup
> > when a
> > > > table
> > > > > > > is
> > > > > > > > dropped.
> > > > > > > >
> > > > > > > > Thanks for pointing that out!
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Liebing Yu
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, 12 Jan 2026 at 17:02, Zhe Wang <
> [email protected]>
> > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Liebing,
> > > > > > > > >
> > > > > > > > > Thanks for driving this, I think it's a really useful
> > feature.
> > > > > > > > > I have two small questions:
> > > > > > > > > 1. What's the scope for split data in dirs, I see there's a
> > > > > > partitionId
> > > > > > > > in
> > > > > > > > > ZK Data, so the data will spit by partition in different
> > > > directories,
> > > > > > > or
> > > > > > > > by
> > > > > > > > > bucket?
> > > > > > > > > 2. Maybe it needs to support remote.data.dirs in
> > > > > > RemoteStorageCleaner?
> > > > > > > So
> > > > > > > > > we can delete all remoteStorage when delete table.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Zhe Wang
> > > > > > > > >
> > > > > > > > > Liebing Yu <[email protected]> 于2026年1月8日周四 20:10写道:
> > > > > > > > >
> > > > > > > > > > Hi devs,
> > > > > > > > > >
> > > > > > > > > > I propose initiating discussion on FIP-25[1]. Fluss
> > leverages
> > > > > > remote
> > > > > > > > > > storage systems—such as Amazon S3, HDFS, and Alibaba
> Cloud
> > > > OSS—to
> > > > > > > > > deliver a
> > > > > > > > > > cost-efficient, highly available, and fault-tolerant
> > storage
> > > > > > solution
> > > > > > > > > > compared to local disk. *However, in production
> > environments,
> > > > we
> > > > > > > often
> > > > > > > > > find
> > > > > > > > > > that the bandwidth of a single remote storage becomes a
> > > > bottleneck.
> > > > > > > > > *Taking
> > > > > > > > > > OSS[2] as an example, the typical upload bandwidth limit
> > for
> > > a
> > > > > > single
> > > > > > > > > > account is 20 Gbit/s (Internal) and 10 Gbit/s (Public).
> So
> > I
> > > > > > > initiated
> > > > > > > > > this
> > > > > > > > > > FIP which aims to introduce support for multiple remote
> > > storage
> > > > > > paths
> > > > > > > > and
> > > > > > > > > > enables the dynamic addition of new storage paths without
> > > > service
> > > > > > > > > > interruption.
> > > > > > > > > >
> > > > > > > > > > Any feedback and suggestions on this proposal are
> welcome!
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-25%3A+Support+Multi-Location+for+Remote+Storage
> > > > > > > > > > [2]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://www.alibabacloud.com/help/en/oss/user-guide/limits?spm=a2c63.l28256.help-menu-31815.d_0_0_5.2ac34d06oZYFvK
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Liebing Yu
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Lorenzo Affetti
> > Senior Software Engineer @ Flink Team
> > Ververica <http://www.ververica.com>
> >
>

Reply via email to