Re: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg

Mehul Batra Tue, 08 Jul 2025 00:24:06 -0700

Hi Yuxia,

Thanks for the clarification.


It's good to know that compaction/expiring-snapshots will be addressed in
the initial version, and I completely understand your point on the
complexity of orphan file cleanup. I agree it's better to avoid over-design
at this stage and evolve as needed based on future usage patterns and
feedback.

Also, thanks for confirming that snapshot expiration will be triggered
explicitly via the LakeCommitter. That clears things up for me.

Looking forward to working on this with you and the community, seeing how
this evolves!

Best regards,
Mehul
On Tue, Jul 8, 2025 at 7:16 AM yuxia <[email protected]> wrote:

> Hi, Mehul
>
> "Tableflow automates table maintenance by compacting and cleaning up small
> files generated by continuous streaming data in object storage."
> Seems table flow supports compacting, which is covered in this FIP.
> Haven't seen orphan file cleanup in table flow.
> Orphan file cleanup is not straightforward and a little of complex, which
> required to list all files, and compare with the files in iceberg manifest
> to find the orphan files.
> I still prefer not to introduce the complexity currently for the first
> version of iceberg support, which will cause overdesign. Let's just see in
> the future what'll happen.
>
> As for snapshot expiration. Yes, LakeCommitter should trigger the snapshot
> expiration action explicitly. It's a slight operation.
>
> Best regards,
> Yuxia
>
> ----- 原始邮件 -----
> 发件人: "Mehul Batra" <[email protected]>
> 收件人: "dev" <[email protected]>
> 发送时间: 星期二, 2025年 7 月 08日 上午 4:26:19
> 主题: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
>
> Hi Yuxia, Cheng,
>
> Thank you both for the insights.
>
> From a user’s perspective, I believe our goal should be to abstract away as
> much operational complexity as possible. For example, TableFlow handles
> both data writing and maintenance seamlessly for the user, which avoids the
> burden of running separate processes.
>
> https://docs.confluent.io/cloud/current/topics/tableflow/overview.html#table-maintenance-and-optimizations
>
> In Fluss Integration, if users are expected to run a separate maintenance
> job (e.g., for snapshot expiration or orphan file cleanup), there's a real
> risk of job overlap and failure, especially due to optimistic concurrency
> issues when both (tiering & maintenance) try to commit around the same
> time.
>
> Yuxia, you mentioned that the LakeCommitter will respect the
> history.expire.max-snapshot-age-ms property (similar to Paimon). I just
> wanted to clarify, while the property sets the retention policy, we still
> need to trigger the snapshot expiration action explicitly. Do we envision
> Fluss's tiering job playing that role?
>
> If so, that would be a great win, it could help automate snapshot
> expiration and indirectly clean up orphan files, making things much
> smoother for users.
>
> Please correct me if I’ve misunderstood anything.
>
> Best regards,
> Mehul
>
> On Mon, Jul 7, 2025 at 11:32 AM Wang Cheng <[email protected]>
> wrote:
>
> > Hi Mehul,
> >
> >
> > I agree with Yuxia's point. We should leave such table maintenance work
> > like expiring snapshots and deleting&nbsp;orphan files to Iceberg users
> > rather than relying on Fluss tiering job.
> >
> >
> >
> > Regards,
> > Cheng
> >
> >
> >
> > &nbsp;
> >
> >
> >
> >
> > ------------------&nbsp;Original&nbsp;------------------
> > From:
> >                                                   "dev"
> >                                                                 <
> > [email protected]&gt;;
> > Date:&nbsp;Sat, Jul 5, 2025 11:38 PM
> > To:&nbsp;"dev"<[email protected]&gt;;
> >
> > Subject:&nbsp;Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
> >
> >
> >
> > Hi Yuxia,
> > Great, that sounds good to me and will help the user to have a better
> read
> > latency.
> > How about the Snapshot expiration (to regulate metadata) and removing the
> > orphan files(which are no longer referenced or dangling files of failed
> > tasks)?
> > Are we planning to introduce them as part of automated maintenance
> provided
> > by the Fluss cluster?
> > Warm regards,
> > Mehul Batra
> >
> > On Fri, Jul 4, 2025 at 5:02 PM yuxia <[email protected]&gt;
> > wrote:
> >
> > &gt; Hi, Mehul.
> > &gt; Thanks for your attention. I think we don't need to introduce an
> extra
> > &gt; post-commit hook to manage small files. In the design, all files
> that
> > belong
> > &gt; to same bucket(in iceberg, it'll be same partition) be distributed
> to
> > same
> > &gt; task to write. So, the task can compact these small files then for
> the
> > &gt; partition.
> > &gt; As this FIP said, while creating IcebergLakeWriter in one round of
> > &gt; tiering, the writer can scan manifest to know the files in this
> > bucket, if
> > &gt; found compaction is available, it can
> > &gt; compact these files while writing new files. We have a similar logic
> > for
> > &gt; tiering to paimon.
> > &gt;
> > &gt; Best regards,
> > &gt; Yuxia
> > &gt;
> > &gt; ----- 原始邮件 -----
> > &gt; 发件人: "Mehul Batra" <[email protected]&gt;
> > &gt; 收件人: "dev" <[email protected]&gt;
> > &gt; 发送时间: 星期四, 2025年 7 月 03日 下午 5:04:18
> > &gt; 主题: Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
> > &gt;
> > &gt; +1 This will help us to address the missing table format and provide
> > better
> > &gt; ecosystem interoperability. Iceberg's growing adoption in the data
> > &gt; lakehouse space makes this a valuable addition to Fluss's tiering
> > &gt; capabilities.
> > &gt; Are there any plans to integrate the Maintenance services as part of
> > &gt; tiering itself as a post-commit hook to manage small files?
> > &gt; Warm regards,
> > &gt; Mehul Batra
> > &gt;
> > &gt; On Thu, Jul 3, 2025 at 2:24 PM yuxia <[email protected]
> &gt;
> > wrote:
> > &gt;
> > &gt; &gt; Hi,
> > &gt; &gt;
> > &gt; &gt; Fluss currently supports tiering data to Apache Paimon,
> enabling
> > &gt; &gt; cost-effective storage management for warm/cold data. However,
> > the lack
> > &gt; of
> > &gt; &gt; native Iceberg tiering support limits flexibility and ecosystem
> > &gt; integration
> > &gt; &gt; for users who rely on Iceberg’s open table format.
> > &gt; &gt;
> > &gt; &gt; To address this gap, I’d like to propose FIP-3: Support Tiering
> > Fluss
> > &gt; Data
> > &gt; &gt; to Iceberg[1] which aims to integrate Iceberg into Fluss’s
> > tiering
> > &gt; &gt; capabilities.
> > &gt; &gt;
> > &gt; &gt; Welcome your feedback and suggestions on this proposal. Looking
> > forward
> > &gt; to
> > &gt; &gt; a productive discussion!
> > &gt; &gt;
> > &gt; &gt; [1]:
> > &gt; &gt;
> > &gt;
> >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg
> > &gt
> > <
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg&gt
> >;
> > &gt;
> > &gt; &gt; Best regards,
> > &gt; &gt; Yuxia
> > &gt; &gt;
> > &gt; &gt;
> > &gt;
>

Re: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg

Reply via email to