Re: [DISCUSS] Partition tuples in v4

Micah Kornfield Wed, 06 May 2026 15:43:25 -0700

Was there discussion on whether the tuple or stats will be used for
Identity partition columns in column projection [1]? This is an edge case
we support for migrated hive tables.


Thanks,
Micah

[1] https://iceberg.apache.org/spec/#column-projection

On Wed, May 6, 2026 at 1:46 PM Steven Wu <[email protected]> wrote:

> I watched the recording. Ryan's arguments make sense (especially on where
> we spend the effort). I am onboard with keeping the partition tuple for now.
>
> I also agree with Russell's point about limiting partition tuples only to
> equality deletes in v4 and extending the stats approach to cover
> non-monotonic bucketing transforms and multi-arg transforms for pruning.
>
> On Mon, May 4, 2026 at 2:51 PM Russell Spitzer <[email protected]>
> wrote:
>
>> As we discussed in the community sync, I recommend we keep the partition
>> tuple for now. It's the simplest way to maintain the guarantees needed for
>> equality deletes.
>>
>> Going forward, we shouldn't rely on these values for filtering (imho) and
>> should instead work to extend the stats struct approach to cover bucketing,
>> non-range-preserving, and multi-arg transforms. To this end, I would try to
>> make sure none of our v4 planning code interacts with the tuple directly,
>> except when falling back for v3-based logic. Isolating tuple access this
>> way means we can cleanly remove it later without reworking v4 planning
>> paths.
>>
>> In my ideal world we drop the tuple and equality deletes, but this seems
>> like the way to make progress now while leaving the door open to remove the
>> tuple before v4 is finalized.
>>
>> On Mon, May 4, 2026 at 10:00 AM Anoop Johnson <[email protected]> wrote:
>>
>>> Amogh,
>>>
>>> That is a good point. But the partition and stats-based evaluation paths
>>> are typically separate. For partition evaluation, we compare against an
>>> exact value, and for stats-based pruning, we look at the range of values in
>>> the column stats.
>>>
>>> Even if we store partition values in the content stats, it would follow
>>> the partition evaluation path. The new V4 manifest reader would just need
>>> to look at the partition value's lower_bound in the content stats instead
>>> of an explicit partition tuple field. The partition evaluator itself will
>>> be unchanged.
>>>
>>> This is conceptually no different than the current partition tuple.
>>> Storing it in content_stats with only lower_bound preserves the same
>>> semantics, but aligns with how the rest of the column stats are stored.
>>>
>>> But let's discuss the tradeoffs of the various options.  Looking forward
>>> to the discussion in an hour.
>>>
>>> Best,
>>> Anoop
>>>
>>> On Sun, May 3, 2026 at 6:45 PM Amogh Jahagirdar <[email protected]>
>>> wrote:
>>>
>>>> I realized I gave a poor example of the semantic issue with removing
>>>> upper bound for partition outputs, but the crux is that in that
>>>> modeling the stats on partition outputs would be treated in a special way
>>>> where upper bound being null means it's partitioned rather than "unknown",
>>>> which is inconsistent with the other stats.
>>>>
>>>>>

Re: [DISCUSS] Partition tuples in v4

Reply via email to