Hi My understanding of the problem here is during the "transition" period when updating from V2 and V3. The reader/writer can check format-version to see what to expect in terms of DVs (metrics).
Regards JB On Mon, Jul 28, 2025 at 6:43 AM Manu Zhang <owenzhang1...@gmail.com> wrote: > > Hi Jordan, > > FYI, Anton explained his rationale of not adding total-dvs in the original > PR. [1]. > You may also refer to iceberg-java's implementation[2] for scan planning, > which looks straight forward to handle both position deletes and deletion > vectors. > > I'm curious which language you are building your engine in. I think all > implementations need to handle this and you don't need to build your own. > > 1. https://github.com/apache/iceberg/pull/11464/files#r1828388869 > 2. > https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java > > Regards, > Manu > > On Fri, Jul 25, 2025 at 12:13 AM Jordano Mark <jordanom...@gmail.com> wrote: >> >> Hi everyone, below I intend to contextualize an observation I’ve noticed in >> hopes of discussing with the community. >> >> >> Context: >> >> Some query engines construct scan plans dynamically based on the metrics >> provided in Iceberg table's metadata.json. For example, when an engine >> encounters a table with equality deletes, it may rely on the >> 'total-equality-deletes' metric (as defined in the Iceberg specification >> here: https://iceberg.apache.org/spec/#metrics) to determine whether >> equality delete handling logic needs to be engaged during scan planning. >> >> A similar approach is commonly taken for position deletes. Engines may use >> the 'total-position-deletes' metric to decide whether position deletes need >> to be accounted for. However, with the introduction of Deletion Vectors (DV) >> in Iceberg V3, this interpretation of the 'total-position-deletes' field >> becomes more ambiguous. >> >> >> Problem: >> >> The core issue is this: when total-position-deletes > 0 in a V3 table, it >> may indicate: >> >> Legacy position delete files (V2) exist >> >> Deletion vectors (V3) exist >> >> Or both >> >> This ambiguity introduces complexity in scan planning. In cases where the >> physical plan for reading legacy position deletes differs meaningfully from >> reading deletion vectors, engines must conservatively assume both mechanisms >> might be in play—even if only one is present. This can lead to unnecessarily >> complex or suboptimal planning. >> >> I’ve noticed there is an 'added-dvs' metric, but no 'total-dvs' equivalent >> listed in the Iceberg spec’s Metrics section. As a result, >> total-position-deletes appears to serve as a catch-all for both V2 and V3 >> position deletes. For engines that rely solely on snapshot-level metrics, >> this becomes a blind spot. The issue extends beyond the transition period >> between V2 and V3 too - Even after migrating fully to V3, a table might >> still retain legacy delete files. Currently, there appears to be no >> consistent, guaranteed way to prove at the metadata level that only V3 >> deletion vectors are in use. Some inference is possible by walking the >> snapshot history and aggregating metrics, but this is fragile and >> case-specific. >> >> It is not viable in to perform manifest scans at runtime to infer delete >> formats >> >> I’m curious if others in the community have encountered this challenge — and >> if so, how you’re addressing it. Is there an established pattern to help >> distinguish V2 vs V3 deletes at the metadata level, without relying on >> manifest/file-level inspection? >> >> >> Looking forward to hearing your thoughts. >> >> Best, >> >> Jordan