Re: [ANNOUNCE] Apache Iceberg release 1.8.0

Robert Stupp Tue, 18 Feb 2025 08:53:47 -0800


On 18.02.25 17:10, Fokko Driesprong wrote:


    Reality is that Iceberg did write '-1' into current-snapshot-id
    (and other "non-exist" marker values for schema/spec/sort) instead
    of omitting the field.

Yes, but this is wrong. The spec dictates<https://iceberg.apache.org/spec/#table-metadata-fields> undercurrent-snapshot-id:


    long ID of the current table snapshot; must be the same as the
    current ID of the main branch in refs.

True, it's wrong. (Or: Iceberg violated its own spec ;) )

In the Java 1.7.1 code, we don't see this entry in refs<https://github.com/apache/iceberg/blob/4a432839233f2343a9eae8255532f911f06358ef/core/src/test/java/org/apache/iceberg/rest/responses/TestLoadTableResponseParser.java#L110-L111>.My point is, we need to exclude -1 from that as well. My concern isthat this is a slippery slope, and start encoding all kinds ofhistorical bugs into the spec.

Quite slippery, yes. But the -1 values are out there in the wild. The(special) meaning of that value needs to be in the spec IMHO.

    I don't think there is any reason to change the behavior of
    writing null / missing unless that's really a point of
    confusion for folks.
This is laid out in the reader/writer requirements<https://iceberg.apache.org/spec/#writer-requirements>, and should beupdated if it is confusing.
    Is there a JSON library folks are using which distinguishes
    between missing/null?
Yes, pydantic allows you to control this. If you define a field, andit is missing, it will throw an error when you don't have a defaultvalue defined. The Java TableMetadataV2ValidMinimalTest<https://github.com/apache/iceberg/blob/main/core/src/test/resources/TableMetadataV2ValidMinimal.json>omits current-snapshot-id.


Jackson as well BTW.

How about reverting #11560<https://github.com/apache/iceberg/pull/11560> for 1.8.1, and thenreinstating this for 2.0.0?


SGTM!


Kind regards,
Fokko

Op di 18 feb 2025 om 16:55 schreef Russell Spitzer<russell.spit...@gmail.com>:


    The only thing I think I agree with is defining that -1 has meant
    "no current snapshot" in the past and
    is equivalent to a missing/nuil (if we have to specify that) .

    I don't think there is any reason to change the behavior of
    writing null / missing unless
    that's really a point of confusion for folks. Is there a JSON
    library folks are using
     which distinguishes between missing/null?

    Would having field:null be the same as missing have avoided the
    issue because it seems like libraries
    that only handled non-null would also not handle "missing"
    current-snapshot-id as well?

    I'd really like to hear from other implementers here since
    changing all this again would be a lot of
    work and I thought we had understanding of Optional == Nullable as
    a valid thing in the spec.

    On Tue, Feb 18, 2025 at 7:35 AM Robert Stupp <sn...@snazy.de> wrote:

        Correcting myself: schema/spec/sort seem to be always present
        - please ignore that part in my previous email. The valid
        values for those fields however should be defined.

        On 18.02.25 14:29, Robert Stupp wrote:


        Reality is that Iceberg did write '-1' into
        current-snapshot-id (and other "non-exist" marker values for
        schema/spec/sort) instead of omitting the field.

        Side note: the table-spec says that these fields are
        optional, but nothing about whether it is nullable.

        The spec should at least be amended to explicitly define the
        valid values (-1 for current-snapshot-id for no snapshot is
        just a fact now that it's there). IMO that field being 'null'
        isn't defined in the spec, but the absence of the field is.

        Proposal for the implementation:

        * revert https://github.com/apache/iceberg/pull/11560 in
        1.8.1 and on main

        Proposal for the spec (for current-snapshot-id):

        * Define the valid value ranges for the field
        * Define that the absence of the field means "no current
        snapshot"
        * Define that the value -1 of the field means "no current
        snapshot"
        * Define that the value 'null' of the field means "no current
        snapshot"
        * Define that new implementations must not write the field if
        there's no current snapshot

        Proposal for the spec (for schema/spec/sort):

        * Define the valid value ranges for the fields
        * Define the "schema/spec/sort not present" values (the
        fields are optional for v1 but required for v2+v3).
        * OR Define that "schema/spec/sort must be absent" if there
        is no current schema/spec/sort.

        WDYT?


        On 17.02.25 21:07, Russell Spitzer wrote:

It sounds like the argument here is that we should change
the Spec for V1, V2, and V3 to mark current-snapshot-id
as required. Then we should change all other implementations
to follow this new standard. I'm not sure that
is a good solution going forwards but I'm not sure of how we
can support catalogs/engines that cannot handle a null
correctly in this situation otherwise. Perhaps we should
source out to see if any other implementers worked off the
assumption of a non-optional "current-snapshot-id" and if we
get a critical mass we can try to make that change?
Because of how wide that change would be, I think we would
need pretty broad consensus to do so.

We could possibly also have a flag to allow the old behavior
but that also feels wrong to me, we have often gone with a
motto of
read "wrong" write "correct" for things like this in the
past and continuing to write "wrong" is a disservice to any
new implementers . When we do have a contradiction between
our implementation and the spec I think we have
to trust that implementers followed the spec and fix the
core library.

Are there any other solutions here?

On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong
<fo...@apache.org> wrote:

Hey Robert,

The thing is, that -1 cannot "go away".

Yes, I agree, but that's also the case for null, as the
field is optional in the spec

<https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>.
Therefore we support both in PyIceberg

<https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>,
Iceberg-Rust

<https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>,
Iceberg-Java

<https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>,
and Iceberg-Go

<https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>.
On the write side, they all produce null instead of -1.
Therefore, I was surprised that it comes up now, and not
earlier.

I'd prefer to keep the previous behavior - otherwise
implementations may fall back to 0, which is
definitely wrong.

I'm not seeing why it would fall back to 0, and I agree,
that's wrong.

Would be better IMHO not to break existing
implementations / render existing setups
incompatible with Iceberg 1.8.

In my opinion, if this had been caught in an RC, it
would be open for discussion, but that ship has sailed.
Let's hear what others think.

Kind regards,
Fokko

Op ma 17 feb 2025 om 18:16 schreef Robert Stupp
<sn...@snazy.de>:

Hi Fokko,

sure, in general "absent" or "null" would be
cleaner. But now we have two representations for the
same case - I suspect most went with the "reference
behavior".

The thing is, that -1 cannot "go away".

I'd prefer to keep the previous behavior - otherwise
implementations may fall back to 0, which is
definitely wrong. Would be better IMHO not to break
existing implementations / render existing setups
incompatible with Iceberg 1.8.

On 17.02.25 15:49, Fokko Driesprong wrote:

Hey Robert,

Thanks for raising this.

snapshot-ID -1 isn't per-se invalid, because
the valid values are not defined in the spec.

For me, this is invalid, since there is no snapshot
with -1 in the snapshots property. In the tests
with the PR, you can see that there are no
snapshots

<https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>.
A year ago we had a similar discussion on PyIceberg

<https://py.iceberg.apache.org/configuration/#backward-compatibility>
around this and this ended up in adding a flag to
fall back to the old behavior. I do agree that we
should have communicated this more clearly with
the release.

Kind regards,
Fokko

Op ma 17 feb 2025 om 12:25 schreef Robert Stupp
<sn...@snazy.de>:

Feels like
https://github.com/apache/iceberg/pull/11560
introduced a behavior change.

snapshot-ID -1 isn't per-se invalid, because
the valid values are not defined in the spec.

Previous Iceberg-Java versions always produced
-1 if there's no current snapshot - 1.8
produces `null` in that case. So there are now
two _different_ values for "no current snapshot".

Implementations that rely on the behavior of
the "reference implementation" (Iceberg-Java)
do now fail in case there's no current snapshot.

On 13.02.25 10:09, Amogh Jahagirdar wrote:

                    I'm pleased to announce the release of Apache
                    Iceberg 1.8.0!

                    Apache Iceberg is an open table format for
                    huge analytic datasets. Iceberg
                    delivers high query performance for tables
                    with tens of petabytes of data,
                    along with atomic commits, concurrent writes,
                    and SQL-compatible table
                    evolution.

                    This release can be downloaded from:
                    
https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz


                    Release notes:
                    https://iceberg.apache.org/releases/#180-release
                    <https://iceberg.apache.org/releases/180-release>

                    Java artifacts are available from Maven Central.

                    Thanks to everyone for contributing!

--Robert Stupp

                    @snazy

--Robert Stupp

                @snazy

--Robert Stupp

        @snazy

--Robert Stupp

        @snazy

--
Robert Stupp
@snazy

Re: [ANNOUNCE] Apache Iceberg release 1.8.0

Reply via email to