+1 (non-binding) On Thu, Apr 17, 2025 at 5:14 PM Aihua Xu <aihu...@gmail.com> wrote:
> + (non-binding). > > On Thu, Apr 17, 2025 at 11:22 AM Steven Wu <stevenz...@gmail.com> wrote: > >> +1 (binding) >> >> On Thu, Apr 17, 2025 at 11:09 AM Amogh Jahagirdar <2am...@gmail.com> >> wrote: >> >>> +1 (binding) >>> >>> On Thu, Apr 17, 2025 at 11:54 AM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>> +1 (binding) Seems cleaner to me. >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Thu, Apr 17, 2025 at 10:31 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> +1 >>>>> >>>>> On Thu, Apr 17, 2025 at 12:30 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>> >>>>>> Adding my own +1. >>>>>> >>>>>> On Thu, Apr 17, 2025 at 10:19 AM Daniel Weeks <dwe...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> +1 (binding) >>>>>>> >>>>>>> I think this update really helps ensure row ids will be present and >>>>>>> reliable for upgraded tables. Thanks Ryan! >>>>>>> >>>>>>> On Wed, Apr 16, 2025 at 4:09 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I’d like to start a vote to incorporate the spec changes in PR >>>>>>>> 12781 <https://github.com/apache/iceberg/pull/12781>. >>>>>>>> >>>>>>>> There are two main changes. First, the current language says that >>>>>>>> upgrading a table to v3 leaves all row IDs null and they are assigned >>>>>>>> when >>>>>>>> the rows are rewritten for the first time (either to move or modify the >>>>>>>> row). The problem with this is that row IDs are missing until the >>>>>>>> entire >>>>>>>> table is rewritten, which means that the feature is unreliable. >>>>>>>> Instead, I >>>>>>>> propose that row IDs are assigned in the first write after upgrading >>>>>>>> to v3. >>>>>>>> >>>>>>>> In addition to making row IDs more useful, the change to how we >>>>>>>> upgrade tables allows us to simplify the spec with statements like “any >>>>>>>> added or existing data file without first_row_id should be >>>>>>>> assigned one via inheritance” and “any manifest without a >>>>>>>> first_row_id must be assigned one when writing a manifest list”. I >>>>>>>> think this sets clearer expectations. >>>>>>>> >>>>>>>> Second, I found some issues with the strict way that first_row_id >>>>>>>> is inherited and assigned in the metadata tree. The current wording >>>>>>>> would >>>>>>>> prevent writers from assigning row IDs to existing data files because >>>>>>>> assignment was strict and only accounted for added files. Instead, I >>>>>>>> propose changing the wording to “must be greater than or equal to” so >>>>>>>> that >>>>>>>> there is some flexibility, and giving simple examples that are safe, >>>>>>>> like first_row_id >>>>>>>> = last_assigned.first_row_id + last_assigned.added_rows_count + >>>>>>>> last_assigned.existing_rows_count. >>>>>>>> >>>>>>>> Please take a look at the PR and vote in the next 72 hours. >>>>>>>> >>>>>>>> [ ] +1 Add these changes to the spec for v3 row lineage >>>>>>>> [ ] +0 >>>>>>>> [ ] -1 I have questions and/or concerns >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>