Correcting myself: schema/spec/sort seem to be always present - please
ignore that part in my previous email. The valid values for those fields
however should be defined.
On 18.02.25 14:29, Robert Stupp wrote:
Reality is that Iceberg did write '-1' into current-snapshot-id (and
other "non-exist" marker values for schema/spec/sort) instead of
omitting the field.
Side note: the table-spec says that these fields are optional, but
nothing about whether it is nullable.
The spec should at least be amended to explicitly define the valid
values (-1 for current-snapshot-id for no snapshot is just a fact now
that it's there). IMO that field being 'null' isn't defined in the
spec, but the absence of the field is.
Proposal for the implementation:
* revert https://github.com/apache/iceberg/pull/11560 in 1.8.1 and on main
Proposal for the spec (for current-snapshot-id):
* Define the valid value ranges for the field
* Define that the absence of the field means "no current snapshot"
* Define that the value -1 of the field means "no current snapshot"
* Define that the value 'null' of the field means "no current snapshot"
* Define that new implementations must not write the field if there's
no current snapshot
Proposal for the spec (for schema/spec/sort):
* Define the valid value ranges for the fields
* Define the "schema/spec/sort not present" values (the fields are
optional for v1 but required for v2+v3).
* OR Define that "schema/spec/sort must be absent" if there is no
current schema/spec/sort.
WDYT?
On 17.02.25 21:07, Russell Spitzer wrote:
It sounds like the argument here is that we should change the Spec
for V1, V2, and V3 to mark current-snapshot-id
as required. Then we should change all other implementations to
follow this new standard. I'm not sure that
is a good solution going forwards but I'm not sure of how we can
support catalogs/engines that cannot handle a null
correctly in this situation otherwise. Perhaps we should source out
to see if any other implementers worked off the
assumption of a non-optional "current-snapshot-id" and if we get a
critical mass we can try to make that change?
Because of how wide that change would be, I think we would need
pretty broad consensus to do so.
We could possibly also have a flag to allow the old behavior but that
also feels wrong to me, we have often gone with a motto of
read "wrong" write "correct" for things like this in the past and
continuing to write "wrong" is a disservice to any
new implementers . When we do have a contradiction between our
implementation and the spec I think we have
to trust that implementers followed the spec and fix the core library.
Are there any other solutions here?
On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong <fo...@apache.org>
wrote:
Hey Robert,
The thing is, that -1 cannot "go away".
Yes, I agree, but that's also the case for null, as the field is
optional in the spec
<https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>.
Therefore we support both in PyIceberg
<https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>,
Iceberg-Rust
<https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>,
Iceberg-Java
<https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>,
and Iceberg-Go
<https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>.
On the write side, they all produce null instead of -1.
Therefore, I was surprised that it comes up now, and not earlier.
I'd prefer to keep the previous behavior - otherwise
implementations may fall back to 0, which is definitely wrong.
I'm not seeing why it would fall back to 0, and I agree, that's
wrong.
Would be better IMHO not to break existing implementations /
render existing setups incompatible with Iceberg 1.8.
In my opinion, if this had been caught in an RC, it would be open
for discussion, but that ship has sailed. Let's hear what others
think.
Kind regards,
Fokko
Op ma 17 feb 2025 om 18:16 schreef Robert Stupp <sn...@snazy.de>:
Hi Fokko,
sure, in general "absent" or "null" would be cleaner. But now
we have two representations for the same case - I suspect
most went with the "reference behavior".
The thing is, that -1 cannot "go away".
I'd prefer to keep the previous behavior - otherwise
implementations may fall back to 0, which is definitely
wrong. Would be better IMHO not to break existing
implementations / render existing setups incompatible with
Iceberg 1.8.
On 17.02.25 15:49, Fokko Driesprong wrote:
Hey Robert,
Thanks for raising this.
snapshot-ID -1 isn't per-se invalid, because the valid
values are not defined in the spec.
For me, this is invalid, since there is no snapshot with -1
in the snapshots property. In the tests with the PR, you can
see that there are no snapshots
<https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>.
A year ago we had a similar discussion on PyIceberg
<https://py.iceberg.apache.org/configuration/#backward-compatibility>
around this and this ended up in adding a flag to fall back
to the old behavior. I do agree that we should have
communicated this more clearly with the release.
Kind regards,
Fokko
Op ma 17 feb 2025 om 12:25 schreef Robert Stupp
<sn...@snazy.de>:
Feels like https://github.com/apache/iceberg/pull/11560
introduced a behavior change.
snapshot-ID -1 isn't per-se invalid, because the valid
values are not defined in the spec.
Previous Iceberg-Java versions always produced -1 if
there's no current snapshot - 1.8 produces `null` in
that case. So there are now two _different_ values for
"no current snapshot".
Implementations that rely on the behavior of the
"reference implementation" (Iceberg-Java) do now fail in
case there's no current snapshot.
On 13.02.25 10:09, Amogh Jahagirdar wrote:
I'm pleased to announce the release of Apache Iceberg
1.8.0!
Apache Iceberg is an open table format for huge
analytic datasets. Iceberg
delivers high query performance for tables with tens of
petabytes of data,
along with atomic commits, concurrent writes, and
SQL-compatible table
evolution.
This release can be downloaded from:
https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz
Release notes:
https://iceberg.apache.org/releases/#180-release
<https://iceberg.apache.org/releases/180-release>
Java artifacts are available from Maven Central.
Thanks to everyone for contributing!
--
Robert Stupp
@snazy
--
Robert Stupp
@snazy
--
Robert Stupp
@snazy
--
Robert Stupp
@snazy