Correcting myself: schema/spec/sort seem to be always present - please ignore that part in my previous email. The valid values for those fields however should be defined.

On 18.02.25 14:29, Robert Stupp wrote:

Reality is that Iceberg did write '-1' into current-snapshot-id (and other "non-exist" marker values for schema/spec/sort) instead of omitting the field.

Side note: the table-spec says that these fields are optional, but nothing about whether it is nullable.

The spec should at least be amended to explicitly define the valid values (-1 for current-snapshot-id for no snapshot is just a fact now that it's there). IMO that field being 'null' isn't defined in the spec, but the absence of the field is.

Proposal for the implementation:

* revert https://github.com/apache/iceberg/pull/11560 in 1.8.1 and on main

Proposal for the spec (for current-snapshot-id):

* Define the valid value ranges for the field
* Define that the absence of the field means "no current snapshot"
* Define that the value -1 of the field means "no current snapshot"
* Define that the value 'null' of the field means "no current snapshot"
* Define that new implementations must not write the field if there's no current snapshot

Proposal for the spec (for schema/spec/sort):

* Define the valid value ranges for the fields
* Define the "schema/spec/sort not present" values (the fields are optional for v1 but required for v2+v3). * OR Define that "schema/spec/sort must be absent" if there is no current schema/spec/sort.

WDYT?


On 17.02.25 21:07, Russell Spitzer wrote:
It sounds like the argument here is that we should change the Spec for V1, V2, and V3 to mark current-snapshot-id as required. Then we should change all other implementations to follow this new standard. I'm not sure that is a good solution going forwards but I'm not sure of how we can support catalogs/engines that cannot handle a null correctly in this situation otherwise. Perhaps we should source out to see if any other implementers worked off the assumption of a non-optional "current-snapshot-id" and if we get a critical mass we can try to make that change? Because of how wide that change would be, I think we would need pretty broad consensus to do so.

We could possibly also have a flag to allow the old behavior but that also feels wrong to me, we have often gone with a motto of read "wrong" write "correct" for things like this in the past and continuing to write "wrong" is a disservice to any new implementers . When we do have a contradiction between our implementation and the spec I think we have
to trust that implementers followed the spec and fix the core library.

Are there any other solutions here?

On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong <fo...@apache.org> wrote:

    Hey Robert,

        The thing is, that -1 cannot "go away".


    Yes, I agree, but that's also the case for null, as the field is
    optional in the spec
    <https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>.
    Therefore we support both in PyIceberg
    
<https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>,
    Iceberg-Rust
    
<https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>,
    Iceberg-Java
    
<https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>,
    and Iceberg-Go
    
<https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>.
    On the write side, they all produce null instead of -1.
    Therefore, I was surprised that it comes up now, and not earlier.

        I'd prefer to keep the previous behavior - otherwise
        implementations may fall back to 0, which is definitely wrong.


    I'm not seeing why it would fall back to 0, and I agree, that's
    wrong.

        Would be better IMHO not to break existing implementations /
        render existing setups incompatible with Iceberg 1.8.


    In my opinion, if this had been caught in an RC, it would be open
    for discussion, but that ship has sailed. Let's hear what others
    think.

    Kind regards,
    Fokko

    Op ma 17 feb 2025 om 18:16 schreef Robert Stupp <sn...@snazy.de>:

        Hi Fokko,

        sure, in general "absent" or "null" would be cleaner. But now
        we have two representations for the same case - I suspect
        most went with the "reference behavior".

        The thing is, that -1 cannot "go away".

        I'd prefer to keep the previous behavior - otherwise
        implementations may fall back to 0, which is definitely
        wrong. Would be better IMHO not to break existing
        implementations / render existing setups incompatible with
        Iceberg 1.8.


        On 17.02.25 15:49, Fokko Driesprong wrote:
        Hey Robert,

        Thanks for raising this.

            snapshot-ID -1 isn't per-se invalid, because the valid
            values are not defined in the spec.


        For me, this is invalid, since there is no snapshot with -1
        in the snapshots property. In the tests with the PR, you can
        see that there are no snapshots
        
<https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>.
        A year ago we had a similar discussion on PyIceberg
        <https://py.iceberg.apache.org/configuration/#backward-compatibility>
        around this and this ended up in adding a flag to fall back
        to the old behavior. I do agree that we should have
        communicated this more clearly with the release.

        Kind regards,
        Fokko



        Op ma 17 feb 2025 om 12:25 schreef Robert Stupp
        <sn...@snazy.de>:

            Feels like https://github.com/apache/iceberg/pull/11560
            introduced a behavior change.

            snapshot-ID -1 isn't per-se invalid, because the valid
            values are not defined in the spec.

            Previous Iceberg-Java versions always produced -1 if
            there's no current snapshot - 1.8 produces `null` in
            that case. So there are now two _different_ values for
            "no current snapshot".

            Implementations that rely on the behavior of the
            "reference implementation" (Iceberg-Java) do now fail in
            case there's no current snapshot.

            On 13.02.25 10:09, Amogh Jahagirdar wrote:
            I'm pleased to announce the release of Apache Iceberg
            1.8.0!

            Apache Iceberg is an open table format for huge
            analytic datasets. Iceberg
            delivers high query performance for tables with tens of
            petabytes of data,
            along with atomic commits, concurrent writes, and
            SQL-compatible table
            evolution.

            This release can be downloaded from:
            
https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz


            Release notes:
            https://iceberg.apache.org/releases/#180-release
            <https://iceberg.apache.org/releases/180-release>

            Java artifacts are available from Maven Central.

            Thanks to everyone for contributing!

-- Robert Stupp
            @snazy

-- Robert Stupp
        @snazy

--
Robert Stupp
@snazy

--
Robert Stupp
@snazy

Reply via email to