Re: [PROPOSAL] Object store functionality

Pierre Laporte Mon, 12 Jan 2026 01:49:26 -0800

> If your position now is [...]
My position has not changed.

> a simple solution like this, https://github.com/apache/polaris/pull/3415.
Have you tested this solution?  It seems to me that the space complexity in
this code has not changed at all...


--

Pierre


On Sun, Jan 11, 2026 at 8:57 PM Yufei Gu <[email protected]> wrote:

> >  it applies regardless of the final answer to the base64 question.
>
> Pierre, this is exactly the point where I disagree. The statement that all
> manifest files are materialized at once being “the cause of the reported
> heap pressure” is a claim that has been made in the issue
> https://github.com/apache/polaris/issues/2365. I didn't see any
> misunderstanding would be made from that. If your position now is that heap
> pressure exists regardless of base64 encoding of all manifest files, due to
> the inherent cost of traversing large metadata histories, then that is a
> different explanation and the original attribution should be corrected. We
> should be explicit about it in the issue and the dev mailing list. Treating
> these as equivalent is not accurate.
>
> BTW, for the second issue claimed, we could have a simple solution like
> this, https://github.com/apache/polaris/pull/3415.
>
> Yufei
>
>
> On Fri, Jan 9, 2026 at 7:02 AM Adam Christian <
> [email protected]> wrote:
>
> > So, I wanted to bring all discussions into this thread as we are jumping
> > between the issue filed, this thread, and the GitHub PR.
> >
> > In regards to Yufei's comment on the PR [1], sorry if I wasn't clear when
> > Yufei, Prashant, & I had our conversation. From what I remember, there
> were
> > two points brought up:
> > 1. I wanted to check out Yufei's comment on the memory consumption
> > analysis. I did NOT recommend pausing on the work. I did want to do an
> > investigation on my side on Yufei's comment on the GitHub issue.
> > 2. I did want to see what other options there were for implementation,
> but
> > those can be done as a follow up. Prashant had mentioned leveraging
> > AllEntriesTable [2].
> >
> > Let me address these one-by-one.
> >
> > On the memory pressure analysis:
> > 1. I agree with Yufei that the encoding is just on the ManifestFile
> > metadata object; not the Manifest file content. That alone should not
> > explain significant memory pressure.
> > 2. However, this is NOT what I read the ticket to be about since the
> ticket
> > is titled "Purge table task implementation prone to OOMs" and, as Pierre
> > pointed out, there is the line "Although some Java stream handling is
> being
> > used, all manifest-files of the table to purge are materialized at once
> on
> > the Java heap."
> > 3. I agree with Pierre's analysis of the memory consumption patterns of
> > TableCleanupTaskHandler and I believe that our current implementation can
> > be improved.
> > 4. I believe that Robert's PR improves our current implementation and,
> > because it is an incremental way to get us to a more stable product, we
> > should get it in.
> > 5. I believe that improvements to the current codebase should not be
> > hindered by future concerns like the asynchronous task or delegation
> > service work as long as it does not block those future decisions. This PR
> > is orthogonal to those decisions. This PR helps improve existing
> > functionality and can be used however we decide to move forward in the
> > future.
> >
> > On the other options for implementation:
> > 1. I agree with Prashant that it would be great to reuse existing
> > functionality when it applies to our use cases. I did not know about the
> > AllEntriesTable functionality in Iceberg before and it could be fruitful
> > for someone to do that analysis in the future.
> > 2. That being said, I do NOT believe that we need to wait for someone to
> do
> > that analysis before moving forward. The beauty of having the separation
> > between the File Operations APIs and the implementation modules is that
> we
> > can change the implementation when we find a better one.
> >
> > So, personally, I would like to get this work in and start alleviating
> the
> > community of any potential issues with the current purge table operation.
> >
> > [1] -
> https://github.com/apache/polaris/pull/3256#issuecomment-3725228243
> > [2] -
> >
> >
> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/AllEntriesTable.java
> >
> > Go community,
> >
> > Adam
> >
> > On Fri, Jan 9, 2026 at 4:33 AM Pierre Laporte <[email protected]>
> > wrote:
> >
> > > Oh I see where the misunderstanding is.  At this time, what you are
> > > questioning is this sentence from the #2365 original bug report:
> > >
> > > > The base64 encoded full binary Iceberg manifest files are included in
> > the
> > > task entities
> > >
> > > Now, consider the sentence that immediately follows:
> > >
> > > > Although some Java stream handling is being used, all manifest-files
> of
> > > the table to purge are materialized at once on the Java heap.
> > >
> > > This is indeed what my analysis is about, as it is what defines the
> > memory
> > > consumption pattern of the current logic.  Because it is the cause of
> the
> > > reported heap pressure, it applies regardless of the final answer to
> the
> > > base64 question.
> > >
> > > --
> > >
> > > Pierre
> > >
> > >
> > > On Fri, Jan 9, 2026 at 5:58 AM Yufei Gu <[email protected]> wrote:
> > >
> > > > >
> > > > > The getMetadataFileBatches method has a space complexity of `O(PM +
> > S +
> > > > ST
> > > > > + PST)`.
> > > > > Same thing for the getMetadataTaskStream method.
> > > > > The getManifestTaskStream method has a space complexity of `O(UM)`.
> > > > > The handleTask method has a space complexity of `O(UM + PM + S +
> ST +
> > > > PST +
> > > > > T)`
> > > >
> > > >
> > > > Pierre, thanks for the analysis. I think this response is addressing
> a
> > > > slightly different issue than the one originally raised. The initial
> > > > concern was that embedding base64 encoded manifest files causes
> > excessive
> > > > heap usage. The new complexity breakdown instead argues that
> operations
> > > > scale with table history and activity. If the concern is really about
> > the
> > > > cost of operating on large tables with long metadata history, that
> > seems
> > > > like a broader, pre existing characteristic rather than a defect in
> > this
> > > > specific approach. In that case, it may be clearer to track this as a
> > > > separate issue or PR. For reference, there were discussions of moving
> > the
> > > > heavy workloads like this out of Polaris instance to a delegation
> > > service.
> > > >
> > > > Yufei
> > > >
> > > >
> > > > On Thu, Jan 8, 2026 at 2:04 AM Pierre Laporte <[email protected]
> >
> > > > wrote:
> > > >
> > > > > As far as I can tell, here is the space complexity for each method.
> > > The
> > > > > names used correspond to:
> > > > >
> > > > > * PM = number of previous metadata files
> > > > > * S = number of snapshots
> > > > > * ST = number of statistics files
> > > > > * PST = number of partition statistics files
> > > > > * UM = number of unique manifest files across all snapshots
> > > > > * T = total number of created TaskEntities
> > > > >
> > > > > The getMetadataFileBatches method has a space complexity of `O(PM +
> > S +
> > > > ST
> > > > > + PST)`.
> > > > > Same thing for the getMetadataTaskStream method.
> > > > > The getManifestTaskStream method has a space complexity of `O(UM)`.
> > > > > The handleTask method has a space complexity of `O(UM + PM + S +
> ST +
> > > > PST +
> > > > > T)`
> > > > >
> > > > > Based on those elements, it is clear that the current
> implementation
> > > will
> > > > > run into heap pressure for tables with many snapshots and frequent
> > > > updates,
> > > > > or tables with long metadata history.
> > > > >
> > > > > On Wed, Jan 7, 2026 at 9:55 PM Yufei Gu <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > After taking a closer look, I am not sure the issue as currently
> > > > > described
> > > > > > is actually valid.
> > > > > >
> > > > > > The base64 encoded manifest objects[1] being discussed are not
> the
> > > > > manifest
> > > > > > files themselves. They are objects representing manifest files,
> > which
> > > > can
> > > > > > be reconstructed from the manifest entries stored in the
> > ManifestList
> > > > > > files. As a result, the in memory footprint should be roughly
> > > > equivalent
> > > > > to
> > > > > > the size of a single manifest list file per snapshot, plus some
> > > > > additional
> > > > > > base64 encoding overhead. That overhead does not seem significant
> > > > enough
> > > > > on
> > > > > > its own to explain large heap pressure.
> > > > > >
> > > > > > This pattern is also handled in practice today. For example,
> > multiple
> > > > > Spark
> > > > > > procedures/actions and Spark planning process these manifest
> > > > > > representations within a single node, typically the driver,
> without
> > > > > > materializing full manifest files in memory. One concrete example
> > is
> > > > > here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/bf1074ff373c929614a3f92dd4e46780028ac1ca/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java#L290
> > > > > >
> > > > > > Given this, I am not convinced that embedding these manifest
> > > > > > representations is inherently problematic from a memory
> > perspective.
> > > If
> > > > > > there are concrete scenarios where this still leads to excessive
> > > memory
> > > > > > usage, it would be helpful to clarify where the amplification
> > happens
> > > > > > beyond what is already expected from manifest list processing.
> > > > > >
> > > > > > Happy to be corrected if I am missing something, but wanted to
> > share
> > > > this
> > > > > > observation before we anchor further design decisions on this
> > > > assumption.
> > > > > >
> > > > > > 1.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java#L194
> > > > > >
> > > > > > Yufei
> > > > > >
> > > > > >
> > > > > > On Tue, Jan 6, 2026 at 8:46 PM Yufei Gu <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Thanks Adam and Robert for the replies.
> > > > > > >
> > > > > > > Just to make sure I am understanding this correctly.
> > > > > > >
> > > > > > > The core issue the proposal is addressing is described in
> > > > > > > https://github.com/apache/polaris/issues/2365 that, today, the
> > > full
> > > > > > > binary Iceberg manifest files, base64 encoded, are embedded in
> > the
> > > > task
> > > > > > > entities. As a result, when a purge runs, all manifests for a
> > table
> > > > end
> > > > > > up
> > > > > > > being materialized in memory at once. This behavior creates
> > > > significant
> > > > > > > heap pressure and may lead to out of memory failures during
> purge
> > > > > > > operations.
> > > > > > >
> > > > > > > Please let me know if this matches your intent, or if I am
> > missing
> > > > > > > anything.
> > > > > > >
> > > > > > > Yufei
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Dec 20, 2025 at 4:53 AM Robert Stupp <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> Thanks Adam! You're spot on!
> > > > > > >>
> > > > > > >> I wanted to separate the discussions about this functionality
> > and
> > > > the
> > > > > > >> related async & reliable tasks proposal.
> > > > > > >>
> > > > > > >> The functionality of this one is generally intended for
> long(er)
> > > > > > >> running operations against object stores, and already provides
> > the
> > > > > > >> necessary functionality to fix the existing OOM issue.
> > > > > > >>
> > > > > > >> Robert
> > > > > > >>
> > > > > > >> [1]
> > > > https://lists.apache.org/thread/kqm0w38p7bnojq455yz7d2vdfp6ky1h7
> > > > > > >>
> > > > > > >> On Fri, Dec 19, 2025 at 3:43 PM Adam Christian
> > > > > > >> <[email protected]> wrote:
> > > > > > >> >
> > > > > > >> > Howdy Robert,
> > > > > > >> >
> > > > > > >> > I reviewed the PR and it is very clean. I really enjoy the
> > > > > simplicity
> > > > > > of
> > > > > > >> > the FileOperations interface. I think that's going to be a
> > great
> > > > > > >> extension
> > > > > > >> > point.
> > > > > > >> >
> > > > > > >> > One of the things I was wondering about was what to do with
> > the
> > > > > > current
> > > > > > >> > purge implementation. I understand that it has a high
> > likelihood
> > > > of
> > > > > > >> having
> > > > > > >> > an Out of Memory exception [1]. I'm wondering if your idea
> was
> > > to
> > > > > > build
> > > > > > >> > this, then replace the current implementation. I'd love to
> > > ensure
> > > > > that
> > > > > > >> we
> > > > > > >> > have a plan for one clean, stable implementation.
> > > > > > >> >
> > > > > > >> > [1] - https://github.com/apache/polaris/issues/2365
> > > > > > >> >
> > > > > > >> > Go community,
> > > > > > >> >
> > > > > > >> > Adam
> > > > > > >> >
> > > > > > >> > On Tue, Dec 16, 2025 at 10:25 AM Adam Christian <
> > > > > > >> > [email protected]> wrote:
> > > > > > >> >
> > > > > > >> > > Hi Yufei,
> > > > > > >> > >
> > > > > > >> > > Great questions!
> > > > > > >> > >
> > > > > > >> > > From what I can see in the PR, here are the answers to
> your
> > > > > > questions:
> > > > > > >> > > 1. The first major scenario is improving the memory
> concerns
> > > > with
> > > > > > >> purge.
> > > > > > >> > > That's important to stabilize a core use case in the
> > service.
> > > > > > >> > > 2. These are related specifically to file operations. I
> > cannot
> > > > > see a
> > > > > > >> way
> > > > > > >> > > that it would be broader than that.
> > > > > > >> > >
> > > > > > >> > > Go community,
> > > > > > >> > >
> > > > > > >> > > Adam
> > > > > > >> > >
> > > > > > >> > > On Mon, Dec 15, 2025, 3:20 PM Yufei Gu <
> > [email protected]>
> > > > > > wrote:
> > > > > > >> > >
> > > > > > >> > >> Hi Robert,
> > > > > > >> > >>
> > > > > > >> > >> Thanks for sharing the proposal and the PR. Before diving
> > > > deeper
> > > > > > >> into the
> > > > > > >> > >> API shape, I was hoping to better understand the intended
> > use
> > > > > cases
> > > > > > >> you
> > > > > > >> > >> have in mind:
> > > > > > >> > >>
> > > > > > >> > >> 1. What concrete scenarios are you primarily targeting
> with
> > > > these
> > > > > > >> > >> long-running object store operations?
> > > > > > >> > >> 2. Are these mostly expected to be file/object-level
> > > > maintenance
> > > > > > >> tasks
> > > > > > >> > >> (e.g. purge, cleanup), or do you envision broader
> > categories
> > > of
> > > > > > >> operations
> > > > > > >> > >> leveraging the same abstraction?
> > > > > > >> > >>
> > > > > > >> > >> Having a clearer picture of the motivating use cases
> would
> > > help
> > > > > > >> evaluate
> > > > > > >> > >> the right level of abstraction and where this should live
> > > > > > >> architecturally.
> > > > > > >> > >>
> > > > > > >> > >> Looking forward to the discussion.
> > > > > > >> > >>
> > > > > > >> > >> Yufei
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >> On Fri, Dec 12, 2025 at 3:48 AM Robert Stupp <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >> > >>
> > > > > > >> > >> > Hi all,
> > > > > > >> > >> >
> > > > > > >> > >> > I'd like to propose an API and corresponding
> > implementation
> > > > for
> > > > > > >> (long
> > > > > > >> > >> > running) object store operations.
> > > > > > >> > >> >
> > > > > > >> > >> > It provides a CPU and heap-friendly API and
> > implementation
> > > to
> > > > > > work
> > > > > > >> > >> > against object stores. It is built in a way to provide
> > > > > > "pluggable"
> > > > > > >> > >> > functionality. What I mean is this (Java pseudo code):
> > > > > > >> > >> > ---
> > > > > > >> > >> > FileOperations fileOps =
> > > > > > >> > >> >
> > fileOperationsFactory.createFileOperations(fileIoInstance);
> > > > > > >> > >> > Stream<FileSpec> allIcebergTableFiles = fileOps.
> > > > > > >> > >> >     identifyIcebergTableFiles(metadataLocation);
> > > > > > >> > >> > PurgeStats purged =
> fileOps.purge(allIcebergTableFiles);
> > > > > > >> > >> > // or simpler:
> > > > > > >> > >> > PurgeStats purged =
> > > > > fileOps.purgeIcebergTable(metadataLocation);
> > > > > > >> > >> > // or similarly for Iceberg views
> > > > > > >> > >> > PurgeStats purged =
> > > > fileOps.purgeIcebergView(metadataLocation);
> > > > > > >> > >> > // or to purge all files underneath a prefix
> > > > > > >> > >> > PurgeStats purged =
> > > fileOps.purge(fileOps.findFiles(prefix));
> > > > > > >> > >> > ---
> > > > > > >> > >> >
> > > > > > >> > >> > Not mentioned in the pseudo code is the ability to
> > > rate-limit
> > > > > the
> > > > > > >> > >> > number of purged files or batch-deletions and configure
> > the
> > > > > > >> deletion
> > > > > > >> > >> > batch-size.
> > > > > > >> > >> >
> > > > > > >> > >> > The PR already contains tests against an on-heap object
> > > store
> > > > > > mock
> > > > > > >> and
> > > > > > >> > >> > integration tests against S3/GCS/Azure emulators.
> > > > > > >> > >> >
> > > > > > >> > >> > More details can be found in the README [2] included in
> > the
> > > > PR
> > > > > > and
> > > > > > >> of
> > > > > > >> > >> > course in the code in the PR.
> > > > > > >> > >> >
> > > > > > >> > >> > Robert
> > > > > > >> > >> >
> > > > > > >> > >> > [1] https://github.com/apache/polaris/pull/3256
> > > > > > >> > >> > [2]
> > > > > > >> > >> >
> > > > > > >> > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/snazy/polaris/blob/obj-store-ops/storage/files/README.md
> > > > > > >> > >> >
> > > > > > >> > >>
> > > > > > >> > >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [PROPOSAL] Object store functionality

Reply via email to