Re: [DISCUSS] Ambiguity over 'total-position-deletes' as V2 (Legacy) or V3 (Deletion Vectors) in Scan Planning

Jean-Baptiste Onofré Sun, 27 Jul 2025 22:58:48 -0700

Hi

My understanding of the problem here is during the "transition" period
when updating from V2 and V3. The reader/writer can check
format-version to see what to expect in terms of DVs (metrics).


Regards
JB

On Mon, Jul 28, 2025 at 6:43 AM Manu Zhang <owenzhang1...@gmail.com> wrote:
>
> Hi Jordan,
>
> FYI, Anton explained his rationale of not adding total-dvs in the original 
> PR. [1].
> You may also refer to iceberg-java's implementation[2] for scan planning, 
> which looks straight forward to handle both position deletes and deletion 
> vectors.
>
> I'm curious which language you are building your engine in. I think all 
> implementations need to handle this and you don't need to build your own.
>
> 1. https://github.com/apache/iceberg/pull/11464/files#r1828388869
> 2. 
> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java
>
> Regards,
> Manu
>
> On Fri, Jul 25, 2025 at 12:13 AM Jordano Mark <jordanom...@gmail.com> wrote:
>>
>> Hi everyone, below I intend to contextualize an observation I’ve noticed in 
>> hopes of discussing with the community.
>>
>>
>> Context:
>>
>> Some query engines construct scan plans dynamically based on the metrics 
>> provided in Iceberg table's metadata.json. For example, when an engine 
>> encounters a table with equality deletes, it may rely on the 
>> 'total-equality-deletes' metric (as defined in the Iceberg specification 
>> here: https://iceberg.apache.org/spec/#metrics) to determine whether 
>> equality delete handling logic needs to be engaged during scan planning.
>>
>> A similar approach is commonly taken for position deletes. Engines may use 
>> the 'total-position-deletes' metric to decide whether position deletes need 
>> to be accounted for. However, with the introduction of Deletion Vectors (DV) 
>> in Iceberg V3, this interpretation of the 'total-position-deletes' field 
>> becomes more ambiguous.
>>
>>
>> Problem:
>>
>> The core issue is this: when total-position-deletes > 0 in a V3 table, it 
>> may indicate:
>>
>> Legacy position delete files (V2) exist
>>
>> Deletion vectors (V3) exist
>>
>> Or both
>>
>> This ambiguity introduces complexity in scan planning. In cases where the 
>> physical plan for reading legacy position deletes differs meaningfully from 
>> reading deletion vectors, engines must conservatively assume both mechanisms 
>> might be in play—even if only one is present. This can lead to unnecessarily 
>> complex or suboptimal planning.
>>
>> I’ve noticed there is an 'added-dvs' metric, but no 'total-dvs' equivalent 
>> listed in the Iceberg spec’s Metrics section. As a result, 
>> total-position-deletes appears to serve as a catch-all for both V2 and V3 
>> position deletes. For engines that rely solely on snapshot-level metrics, 
>> this becomes a blind spot. The issue extends beyond the transition period 
>> between V2 and V3 too - Even after migrating fully to V3, a table might 
>> still retain legacy delete files. Currently, there appears to be no 
>> consistent, guaranteed way to prove at the metadata level that only V3 
>> deletion vectors are in use. Some inference is possible by walking the 
>> snapshot history and aggregating metrics, but this is fragile and 
>> case-specific.
>>
>> It is not viable in to perform manifest scans at runtime to infer delete 
>> formats
>>
>> I’m curious if others in the community have encountered this challenge — and 
>> if so, how you’re addressing it. Is there an established pattern to help 
>> distinguish V2 vs V3 deletes at the metadata level, without relying on 
>> manifest/file-level inspection?
>>
>>
>> Looking forward to hearing your thoughts.
>>
>> Best,
>>
>> Jordan

Re: [DISCUSS] Ambiguity over 'total-position-deletes' as V2 (Legacy) or V3 (Deletion Vectors) in Scan Planning

Reply via email to