On 18.02.25 17:10, Fokko Driesprong wrote:

    Reality is that Iceberg did write '-1' into current-snapshot-id
    (and other "non-exist" marker values for schema/spec/sort) instead
    of omitting the field.


Yes, but this is wrong. The spec dictates <https://iceberg.apache.org/spec/#table-metadata-fields> under current-snapshot-id:

    long ID of the current table snapshot; must be the same as the
    current ID of the main branch in refs.

True, it's wrong. (Or: Iceberg violated its own spec ;) )

In the Java 1.7.1 code, we don't see this entry in refs <https://github.com/apache/iceberg/blob/4a432839233f2343a9eae8255532f911f06358ef/core/src/test/java/org/apache/iceberg/rest/responses/TestLoadTableResponseParser.java#L110-L111>. My point is, we need to exclude -1 from that as well. My concern is that this is a slippery slope, and start encoding all kinds of historical bugs into the spec.

Quite slippery, yes. But the -1 values are out there in the wild. The (special) meaning of that value needs to be in the spec IMHO.

    I don't think there is any reason to change the behavior of
    writing null / missing unless that's really a point of
    confusion for folks.


This is laid out in the reader/writer requirements <https://iceberg.apache.org/spec/#writer-requirements>, and should be updated if it is confusing.

    Is there a JSON library folks are using which distinguishes
    between missing/null?


Yes, pydantic allows you to control this. If you define a field, and it is missing, it will throw an error when you don't have a default value defined. The Java TableMetadataV2ValidMinimalTest <https://github.com/apache/iceberg/blob/main/core/src/test/resources/TableMetadataV2ValidMinimal.json> omits current-snapshot-id.

Jackson as well BTW.


How about reverting #11560 <https://github.com/apache/iceberg/pull/11560> for 1.8.1, and then reinstating this for 2.0.0?

SGTM!



Kind regards,
Fokko





Op di 18 feb 2025 om 16:55 schreef Russell Spitzer <russell.spit...@gmail.com>:

    The only thing I think I agree with is defining that -1 has meant
    "no current snapshot" in the past and
    is equivalent to a missing/nuil (if we have to specify that) .

    I don't think there is any reason to change the behavior of
    writing null / missing unless
    that's really a point of confusion for folks. Is there a JSON
    library folks are using
     which distinguishes between missing/null?

    Would having field:null be the same as missing have avoided the
    issue because it seems like libraries
    that only handled non-null would also not handle "missing"
    current-snapshot-id as well?

    I'd really like to hear from other implementers here since
    changing all this again would be a lot of
    work and I thought we had understanding of Optional == Nullable as
    a valid thing in the spec.

    On Tue, Feb 18, 2025 at 7:35 AM Robert Stupp <sn...@snazy.de> wrote:

        Correcting myself: schema/spec/sort seem to be always present
        - please ignore that part in my previous email. The valid
        values for those fields however should be defined.

        On 18.02.25 14:29, Robert Stupp wrote:

        Reality is that Iceberg did write '-1' into
        current-snapshot-id (and other "non-exist" marker values for
        schema/spec/sort) instead of omitting the field.

        Side note: the table-spec says that these fields are
        optional, but nothing about whether it is nullable.

        The spec should at least be amended to explicitly define the
        valid values (-1 for current-snapshot-id for no snapshot is
        just a fact now that it's there). IMO that field being 'null'
        isn't defined in the spec, but the absence of the field is.

        Proposal for the implementation:

        * revert https://github.com/apache/iceberg/pull/11560 in
        1.8.1 and on main

        Proposal for the spec (for current-snapshot-id):

        * Define the valid value ranges for the field
        * Define that the absence of the field means "no current
        snapshot"
        * Define that the value -1 of the field means "no current
        snapshot"
        * Define that the value 'null' of the field means "no current
        snapshot"
        * Define that new implementations must not write the field if
        there's no current snapshot

        Proposal for the spec (for schema/spec/sort):

        * Define the valid value ranges for the fields
        * Define the "schema/spec/sort not present" values (the
        fields are optional for v1 but required for v2+v3).
        * OR Define that "schema/spec/sort must be absent" if there
        is no current schema/spec/sort.

        WDYT?


        On 17.02.25 21:07, Russell Spitzer wrote:
        It sounds like the argument here is that we should change
        the Spec for V1, V2, and V3 to mark current-snapshot-id
        as required. Then we should change all other implementations
        to follow this new standard. I'm not sure that
        is a good solution going forwards but I'm not sure of how we
        can support catalogs/engines that cannot handle a null
        correctly in this situation otherwise. Perhaps we should
        source out to see if any other implementers worked off the
        assumption of a non-optional "current-snapshot-id" and if we
        get a critical mass we can try to make that change?
        Because of how wide that change would be, I think we would
        need pretty broad consensus to do so.

        We could possibly also have a flag to allow the old behavior
        but that also feels wrong to me, we have often gone with a
        motto of
        read "wrong" write "correct" for things like this in the
        past and continuing to write "wrong" is a disservice to any
        new implementers . When we do have a contradiction between
        our implementation and the spec I think we have
        to trust that implementers followed the spec and fix the
        core library.

        Are there any other solutions here?

        On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong
        <fo...@apache.org> wrote:

            Hey Robert,

                The thing is, that -1 cannot "go away".


            Yes, I agree, but that's also the case for null, as the
            field is optional in the spec
            
<https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>.
            Therefore we support both in PyIceberg
            
<https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>,
            Iceberg-Rust
            
<https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>,
            Iceberg-Java
            
<https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>,
            and Iceberg-Go
            
<https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>.
            On the write side, they all produce null instead of -1.
            Therefore, I was surprised that it comes up now, and not
            earlier.

                I'd prefer to keep the previous behavior - otherwise
                implementations may fall back to 0, which is
                definitely wrong.


            I'm not seeing why it would fall back to 0, and I agree,
            that's wrong.

                Would be better IMHO not to break existing
                implementations / render existing setups
                incompatible with Iceberg 1.8.


            In my opinion, if this had been caught in an RC, it
            would be open for discussion, but that ship has sailed.
            Let's hear what others think.

            Kind regards,
            Fokko

            Op ma 17 feb 2025 om 18:16 schreef Robert Stupp
            <sn...@snazy.de>:

                Hi Fokko,

                sure, in general "absent" or "null" would be
                cleaner. But now we have two representations for the
                same case - I suspect most went with the "reference
                behavior".

                The thing is, that -1 cannot "go away".

                I'd prefer to keep the previous behavior - otherwise
                implementations may fall back to 0, which is
                definitely wrong. Would be better IMHO not to break
                existing implementations / render existing setups
                incompatible with Iceberg 1.8.


                On 17.02.25 15:49, Fokko Driesprong wrote:
                Hey Robert,

                Thanks for raising this.

                    snapshot-ID -1 isn't per-se invalid, because
                    the valid values are not defined in the spec.


                For me, this is invalid, since there is no snapshot
                with -1 in the snapshots property. In the tests
                with the PR, you can see that there are no
                snapshots
                
<https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>.
                A year ago we had a similar discussion on PyIceberg
                
<https://py.iceberg.apache.org/configuration/#backward-compatibility>
                around this and this ended up in adding a flag to
                fall back to the old behavior. I do agree that we
                should have communicated this more clearly with
                the release.

                Kind regards,
                Fokko



                Op ma 17 feb 2025 om 12:25 schreef Robert Stupp
                <sn...@snazy.de>:

                    Feels like
                    https://github.com/apache/iceberg/pull/11560
                    introduced a behavior change.

                    snapshot-ID -1 isn't per-se invalid, because
                    the valid values are not defined in the spec.

                    Previous Iceberg-Java versions always produced
                    -1 if there's no current snapshot - 1.8
                    produces `null` in that case. So there are now
                    two _different_ values for "no current snapshot".

                    Implementations that rely on the behavior of
                    the "reference implementation" (Iceberg-Java)
                    do now fail in case there's no current snapshot.

                    On 13.02.25 10:09, Amogh Jahagirdar wrote:
                    I'm pleased to announce the release of Apache
                    Iceberg 1.8.0!

                    Apache Iceberg is an open table format for
                    huge analytic datasets. Iceberg
                    delivers high query performance for tables
                    with tens of petabytes of data,
                    along with atomic commits, concurrent writes,
                    and SQL-compatible table
                    evolution.

                    This release can be downloaded from:
                    
https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz


                    Release notes:
                    https://iceberg.apache.org/releases/#180-release
                    <https://iceberg.apache.org/releases/180-release>

                    Java artifacts are available from Maven Central.

                    Thanks to everyone for contributing!

-- Robert Stupp
                    @snazy

-- Robert Stupp
                @snazy

-- Robert Stupp
        @snazy

-- Robert Stupp
        @snazy

--
Robert Stupp
@snazy

Reply via email to