danielcweeks commented on code in PR #14264:
URL: https://github.com/apache/iceberg/pull/14264#discussion_r2446353511
##########
core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java:
##########
@@ -133,13 +158,473 @@ private static Map<Long, Integer>
computeSnapshotOrdinals(Deque<Snapshot> snapsh
return snapshotOrdinals;
}
+ /**
+ * Builds a delete file index for existing deletes that were present before
the start snapshot.
+ * These deletes should be applied to data files but should not generate
DELETE changelog rows.
+ * Uses manifest pruning and caching to optimize performance.
+ */
+ private DeleteFileIndex buildExistingDeleteIndex(
+ Long fromSnapshotIdExclusive, Map<Long, DeleteFileIndex>
addedDeletesBySnapshot) {
+ if (fromSnapshotIdExclusive == null) {
+ return DeleteFileIndex.builderFor(ImmutableList.of()).build();
+ }
+
+ // Check if we need existingDeleteIndex for equality deletes
+ boolean needsExistingDeleteIndex = false;
Review Comment:
It feels like we're doing two passes over the deletes. Can't we integrate
this into the first pass.
I also feel like this needs to be lazy (this isn't really lazy) since we may
need to build this as we pass through the data files to determine if any of
them have been removed.
I may be wrong, but we could potentially build this in at two different
times:
1. building the existing delete index for the snapshots in the scan range if
there any equality deletes found
2. building the delete index while processing the manifest files if there
are any removed files
I feel like this should be possible just-in-time since a lot of CDF type use
cases may apply to append only workloads and would avoid building the delete
index.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]