Re: [ANNOUNCE] Apache Iceberg release 1.8.0

Robert Stupp Tue, 18 Feb 2025 05:35:23 -0800

Correcting myself: schema/spec/sort seem to be always present - pleaseignore that part in my previous email. The valid values for those fieldshowever should be defined.


On 18.02.25 14:29, Robert Stupp wrote:

Reality is that Iceberg did write '-1' into current-snapshot-id (andother "non-exist" marker values for schema/spec/sort) instead ofomitting the field.

Side note: the table-spec says that these fields are optional, butnothing about whether it is nullable.

The spec should at least be amended to explicitly define the validvalues (-1 for current-snapshot-id for no snapshot is just a fact nowthat it's there). IMO that field being 'null' isn't defined in thespec, but the absence of the field is.


Proposal for the implementation:

* revert https://github.com/apache/iceberg/pull/11560 in 1.8.1 and on main

Proposal for the spec (for current-snapshot-id):

* Define the valid value ranges for the field
* Define that the absence of the field means "no current snapshot"
* Define that the value -1 of the field means "no current snapshot"
* Define that the value 'null' of the field means "no current snapshot"

* Define that new implementations must not write the field if there'sno current snapshot


Proposal for the spec (for schema/spec/sort):

* Define the valid value ranges for the fields

* Define the "schema/spec/sort not present" values (the fields areoptional for v1 but required for v2+v3).* OR Define that "schema/spec/sort must be absent" if there is nocurrent schema/spec/sort.


WDYT?


On 17.02.25 21:07, Russell Spitzer wrote:

It sounds like the argument here is that we should change the Specfor V1, V2, and V3 to mark current-snapshot-idas required. Then we should change all other implementations tofollow this new standard. I'm not sure thatis a good solution going forwards but I'm not sure of how we cansupport catalogs/engines that cannot handle a nullcorrectly in this situation otherwise. Perhaps we should source outto see if any other implementers worked off theassumption of a non-optional "current-snapshot-id" and if we get acritical mass we can try to make that change?Because of how wide that change would be, I think we would needpretty broad consensus to do so.

We could possibly also have a flag to allow the old behavior but thatalso feels wrong to me, we have often gone with a motto ofread "wrong" write "correct" for things like this in the past andcontinuing to write "wrong" is a disservice to anynew implementers . When we do have a contradiction between ourimplementation and the spec I think we have

to trust that implementers followed the spec and fix the core library.

Are there any other solutions here?

On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong <fo...@apache.org>wrote:

Hey Robert,

The thing is, that -1 cannot "go away".

Yes, I agree, but that's also the case for null, as the field is
optional in the spec
<https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>.
Therefore we support both in PyIceberg

<https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>,
Iceberg-Rust

<https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>,
Iceberg-Java

<https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>,
and Iceberg-Go

<https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>.
On the write side, they all produce null instead of -1.
Therefore, I was surprised that it comes up now, and not earlier.

I'd prefer to keep the previous behavior - otherwise
implementations may fall back to 0, which is definitely wrong.

I'm not seeing why it would fall back to 0, and I agree, that's
wrong.

Would be better IMHO not to break existing implementations /
render existing setups incompatible with Iceberg 1.8.

In my opinion, if this had been caught in an RC, it would be open
for discussion, but that ship has sailed. Let's hear what others
think.

Kind regards,
Fokko

Op ma 17 feb 2025 om 18:16 schreef Robert Stupp <sn...@snazy.de>:

Hi Fokko,

sure, in general "absent" or "null" would be cleaner. But now
we have two representations for the same case - I suspect
most went with the "reference behavior".

The thing is, that -1 cannot "go away".

I'd prefer to keep the previous behavior - otherwise
implementations may fall back to 0, which is definitely
wrong. Would be better IMHO not to break existing
implementations / render existing setups incompatible with
Iceberg 1.8.

On 17.02.25 15:49, Fokko Driesprong wrote:

Hey Robert,

Thanks for raising this.

snapshot-ID -1 isn't per-se invalid, because the valid
values are not defined in the spec.

For me, this is invalid, since there is no snapshot with -1
in the snapshots property. In the tests with the PR, you can
see that there are no snapshots

<https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>.
A year ago we had a similar discussion on PyIceberg
<https://py.iceberg.apache.org/configuration/#backward-compatibility>
around this and this ended up in adding a flag to fall back
to the old behavior. I do agree that we should have
communicated this more clearly with the release.

Kind regards,
Fokko

Op ma 17 feb 2025 om 12:25 schreef Robert Stupp
<sn...@snazy.de>:

Feels like https://github.com/apache/iceberg/pull/11560
introduced a behavior change.

snapshot-ID -1 isn't per-se invalid, because the valid
values are not defined in the spec.

Previous Iceberg-Java versions always produced -1 if
there's no current snapshot - 1.8 produces `null` in
that case. So there are now two _different_ values for
"no current snapshot".

Implementations that rely on the behavior of the
"reference implementation" (Iceberg-Java) do now fail in
case there's no current snapshot.

On 13.02.25 10:09, Amogh Jahagirdar wrote:

            I'm pleased to announce the release of Apache Iceberg
            1.8.0!

            Apache Iceberg is an open table format for huge
            analytic datasets. Iceberg
            delivers high query performance for tables with tens of
            petabytes of data,
            along with atomic commits, concurrent writes, and
            SQL-compatible table
            evolution.

            This release can be downloaded from:
            
https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz


            Release notes:
            https://iceberg.apache.org/releases/#180-release
            <https://iceberg.apache.org/releases/180-release>

            Java artifacts are available from Maven Central.

            Thanks to everyone for contributing!

--Robert Stupp

            @snazy

--Robert Stupp

        @snazy

--
Robert Stupp
@snazy


--
Robert Stupp
@snazy

Re: [ANNOUNCE] Apache Iceberg release 1.8.0

Reply via email to