Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg

Mehul Batra Mon, 07 Jul 2025 13:26:47 -0700

Hi Yuxia, Cheng,

Thank you both for the insights.


>From a user’s perspective, I believe our goal should be to abstract away as
much operational complexity as possible. For example, TableFlow handles
both data writing and maintenance seamlessly for the user, which avoids the
burden of running separate processes.
https://docs.confluent.io/cloud/current/topics/tableflow/overview.html#table-maintenance-and-optimizations

In Fluss Integration, if users are expected to run a separate maintenance
job (e.g., for snapshot expiration or orphan file cleanup), there's a real
risk of job overlap and failure, especially due to optimistic concurrency
issues when both (tiering & maintenance) try to commit around the same time.

Yuxia, you mentioned that the LakeCommitter will respect the
history.expire.max-snapshot-age-ms property (similar to Paimon). I just
wanted to clarify, while the property sets the retention policy, we still
need to trigger the snapshot expiration action explicitly. Do we envision
Fluss's tiering job playing that role?

If so, that would be a great win, it could help automate snapshot
expiration and indirectly clean up orphan files, making things much
smoother for users.

Please correct me if I’ve misunderstood anything.

Best regards,
Mehul

On Mon, Jul 7, 2025 at 11:32 AM Wang Cheng <[email protected]> wrote:

> Hi Mehul,
>
>
> I agree with Yuxia's point. We should leave such table maintenance work
> like expiring snapshots and deleting&nbsp;orphan files to Iceberg users
> rather than relying on Fluss tiering job.
>
>
>
> Regards,
> Cheng
>
>
>
> &nbsp;
>
>
>
>
> ------------------&nbsp;Original&nbsp;------------------
> From:
>                                                   "dev"
>                                                                 <
> [email protected]&gt;;
> Date:&nbsp;Sat, Jul 5, 2025 11:38 PM
> To:&nbsp;"dev"<[email protected]&gt;;
>
> Subject:&nbsp;Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
>
>
>
> Hi Yuxia,
> Great, that sounds good to me and will help the user to have a better read
> latency.
> How about the Snapshot expiration (to regulate metadata) and removing the
> orphan files(which are no longer referenced or dangling files of failed
> tasks)?
> Are we planning to introduce them as part of automated maintenance provided
> by the Fluss cluster?
> Warm regards,
> Mehul Batra
>
> On Fri, Jul 4, 2025 at 5:02 PM yuxia <[email protected]&gt;
> wrote:
>
> &gt; Hi, Mehul.
> &gt; Thanks for your attention. I think we don't need to introduce an extra
> &gt; post-commit hook to manage small files. In the design, all files that
> belong
> &gt; to same bucket(in iceberg, it'll be same partition) be distributed to
> same
> &gt; task to write. So, the task can compact these small files then for the
> &gt; partition.
> &gt; As this FIP said, while creating IcebergLakeWriter in one round of
> &gt; tiering, the writer can scan manifest to know the files in this
> bucket, if
> &gt; found compaction is available, it can
> &gt; compact these files while writing new files. We have a similar logic
> for
> &gt; tiering to paimon.
> &gt;
> &gt; Best regards,
> &gt; Yuxia
> &gt;
> &gt; ----- 原始邮件 -----
> &gt; 发件人: "Mehul Batra" <[email protected]&gt;
> &gt; 收件人: "dev" <[email protected]&gt;
> &gt; 发送时间: 星期四, 2025年 7 月 03日 下午 5:04:18
> &gt; 主题: Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
> &gt;
> &gt; +1 This will help us to address the missing table format and provide
> better
> &gt; ecosystem interoperability. Iceberg's growing adoption in the data
> &gt; lakehouse space makes this a valuable addition to Fluss's tiering
> &gt; capabilities.
> &gt; Are there any plans to integrate the Maintenance services as part of
> &gt; tiering itself as a post-commit hook to manage small files?
> &gt; Warm regards,
> &gt; Mehul Batra
> &gt;
> &gt; On Thu, Jul 3, 2025 at 2:24 PM yuxia <[email protected]&gt;
> wrote:
> &gt;
> &gt; &gt; Hi,
> &gt; &gt;
> &gt; &gt; Fluss currently supports tiering data to Apache Paimon, enabling
> &gt; &gt; cost-effective storage management for warm/cold data. However,
> the lack
> &gt; of
> &gt; &gt; native Iceberg tiering support limits flexibility and ecosystem
> &gt; integration
> &gt; &gt; for users who rely on Iceberg’s open table format.
> &gt; &gt;
> &gt; &gt; To address this gap, I’d like to propose FIP-3: Support Tiering
> Fluss
> &gt; Data
> &gt; &gt; to Iceberg[1] which aims to integrate Iceberg into Fluss’s
> tiering
> &gt; &gt; capabilities.
> &gt; &gt;
> &gt; &gt; Welcome your feedback and suggestions on this proposal. Looking
> forward
> &gt; to
> &gt; &gt; a productive discussion!
> &gt; &gt;
> &gt; &gt; [1]:
> &gt; &gt;
> &gt;
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg
> &gt
> <https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg&gt>;
> &gt;
> &gt; &gt; Best regards,
> &gt; &gt; Yuxia
> &gt; &gt;
> &gt; &gt;
> &gt;

Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg

Reply via email to