Re: [DISCUSS] Ambiguity over 'total-position-deletes' as V2 (Legacy) or V3 (Deletion Vectors) in Scan Planning

Manu Zhang Sun, 27 Jul 2025 21:43:37 -0700

Hi Jordan,

FYI, Anton explained his rationale of not adding total-dvs in the
original PR. [1].
You may also refer to iceberg-java's implementation[2] for scan planning,
which looks straight forward to handle both position deletes and deletion
vectors.


I'm curious which language you are building your engine in. I think all
implementations need to handle this and you don't need to build your own.

1. https://github.com/apache/iceberg/pull/11464/files#r1828388869
2.
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

Regards,
Manu

On Fri, Jul 25, 2025 at 12:13 AM Jordano Mark <[email protected]> wrote:

> Hi everyone, below I intend to contextualize an observation I’ve noticed
> in hopes of discussing with the community.
>
>
> *Context:*
>
> Some query engines construct scan plans dynamically based on the metrics
> provided in Iceberg table's metadata.json. For example, when an engine
> encounters a table with equality deletes, it may rely on the '
> total-equality-deletes' metric (as defined in the Iceberg specification
> here: https://iceberg.apache.org/spec/#metrics) to determine whether
> equality delete handling logic needs to be engaged during scan planning.
>
> A similar approach is commonly taken for position deletes. Engines may use
> the 'total-position-deletes' metric to decide whether position deletes
> need to be accounted for. However, with the introduction of Deletion
> Vectors (DV) in Iceberg V3, this interpretation of the '
> total-position-deletes' field becomes more ambiguous.
>
>
> *Problem:*
>
> The core issue is this: when total-position-deletes > 0 in a V3 table, it
> may indicate:
>
>
>    -
>
>    Legacy position delete files (V2) exist
>    -
>
>    Deletion vectors (V3) exist
>    -
>
>    Or both
>
> This ambiguity introduces complexity in scan planning. In cases where the
> physical plan for reading legacy position deletes differs meaningfully from
> reading deletion vectors, *engines must conservatively assume both
> mechanisms might be in play*—even if only one is present. This can lead
> to unnecessarily complex or suboptimal planning.
>
> I’ve noticed there is an 'added-dvs' metric, but no 'total-dvs' equivalent
> listed in the Iceberg spec’s Metrics
> <https://iceberg.apache.org/spec/#metrics> section. As a result,
> total-position-deletes appears to serve as a catch-all for both V2 and V3
> position deletes. For engines that rely solely on snapshot-level metrics,
> this becomes a blind spot. The issue extends beyond the transition period
> between V2 and V3 too - Even after migrating fully to V3, a table might
> still retain legacy delete files. Currently, there appears to be no
> consistent, guaranteed way to prove at the metadata level that only V3
> deletion vectors are in use. Some inference is possible by walking the
> snapshot history and aggregating metrics, but this is fragile and
> case-specific.
>
> It is not viable in to perform manifest scans at runtime to infer delete
> formats
>
> I’m curious if others in the community have encountered this challenge —
> and if so, how you’re addressing it. Is there an established pattern to
> help distinguish V2 vs V3 deletes at the metadata level, without relying on
> manifest/file-level inspection?
>
>
> Looking forward to hearing your thoughts.
>
> Best,
>
> *Jordan*
>

Re: [DISCUSS] Ambiguity over 'total-position-deletes' as V2 (Legacy) or V3 (Deletion Vectors) in Scan Planning

Reply via email to