On 18.02.25 17:10, Fokko Driesprong wrote:
Reality is that Iceberg did write '-1' into current-snapshot-id
(and other "non-exist" marker values for schema/spec/sort) instead
of omitting the field.
Yes, but this is wrong. The spec dictates
<https://iceberg.apache.org/spec/#table-metadata-fields> under
current-snapshot-id:
long ID of the current table snapshot; must be the same as the
current ID of the main branch in refs.
True, it's wrong. (Or: Iceberg violated its own spec ;) )
In the Java 1.7.1 code, we don't see this entry in refs
<https://github.com/apache/iceberg/blob/4a432839233f2343a9eae8255532f911f06358ef/core/src/test/java/org/apache/iceberg/rest/responses/TestLoadTableResponseParser.java#L110-L111>.
My point is, we need to exclude -1 from that as well. My concern is
that this is a slippery slope, and start encoding all kinds of
historical bugs into the spec.
Quite slippery, yes. But the -1 values are out there in the wild. The
(special) meaning of that value needs to be in the spec IMHO.
I don't think there is any reason to change the behavior of
writing null / missing unless that's really a point of
confusion for folks.
This is laid out in the reader/writer requirements
<https://iceberg.apache.org/spec/#writer-requirements>, and should be
updated if it is confusing.
Is there a JSON library folks are using which distinguishes
between missing/null?
Yes, pydantic allows you to control this. If you define a field, and
it is missing, it will throw an error when you don't have a default
value defined. The Java TableMetadataV2ValidMinimalTest
<https://github.com/apache/iceberg/blob/main/core/src/test/resources/TableMetadataV2ValidMinimal.json>
omits current-snapshot-id.
Jackson as well BTW.
How about reverting #11560
<https://github.com/apache/iceberg/pull/11560> for 1.8.1, and then
reinstating this for 2.0.0?
SGTM!
Kind regards,
Fokko
Op di 18 feb 2025 om 16:55 schreef Russell Spitzer
<russell.spit...@gmail.com>:
The only thing I think I agree with is defining that -1 has meant
"no current snapshot" in the past and
is equivalent to a missing/nuil (if we have to specify that) .
I don't think there is any reason to change the behavior of
writing null / missing unless
that's really a point of confusion for folks. Is there a JSON
library folks are using
which distinguishes between missing/null?
Would having field:null be the same as missing have avoided the
issue because it seems like libraries
that only handled non-null would also not handle "missing"
current-snapshot-id as well?
I'd really like to hear from other implementers here since
changing all this again would be a lot of
work and I thought we had understanding of Optional == Nullable as
a valid thing in the spec.
On Tue, Feb 18, 2025 at 7:35 AM Robert Stupp <sn...@snazy.de> wrote:
Correcting myself: schema/spec/sort seem to be always present
- please ignore that part in my previous email. The valid
values for those fields however should be defined.
On 18.02.25 14:29, Robert Stupp wrote:
Reality is that Iceberg did write '-1' into
current-snapshot-id (and other "non-exist" marker values for
schema/spec/sort) instead of omitting the field.
Side note: the table-spec says that these fields are
optional, but nothing about whether it is nullable.
The spec should at least be amended to explicitly define the
valid values (-1 for current-snapshot-id for no snapshot is
just a fact now that it's there). IMO that field being 'null'
isn't defined in the spec, but the absence of the field is.
Proposal for the implementation:
* revert https://github.com/apache/iceberg/pull/11560 in
1.8.1 and on main
Proposal for the spec (for current-snapshot-id):
* Define the valid value ranges for the field
* Define that the absence of the field means "no current
snapshot"
* Define that the value -1 of the field means "no current
snapshot"
* Define that the value 'null' of the field means "no current
snapshot"
* Define that new implementations must not write the field if
there's no current snapshot
Proposal for the spec (for schema/spec/sort):
* Define the valid value ranges for the fields
* Define the "schema/spec/sort not present" values (the
fields are optional for v1 but required for v2+v3).
* OR Define that "schema/spec/sort must be absent" if there
is no current schema/spec/sort.
WDYT?
On 17.02.25 21:07, Russell Spitzer wrote:
It sounds like the argument here is that we should change
the Spec for V1, V2, and V3 to mark current-snapshot-id
as required. Then we should change all other implementations
to follow this new standard. I'm not sure that
is a good solution going forwards but I'm not sure of how we
can support catalogs/engines that cannot handle a null
correctly in this situation otherwise. Perhaps we should
source out to see if any other implementers worked off the
assumption of a non-optional "current-snapshot-id" and if we
get a critical mass we can try to make that change?
Because of how wide that change would be, I think we would
need pretty broad consensus to do so.
We could possibly also have a flag to allow the old behavior
but that also feels wrong to me, we have often gone with a
motto of
read "wrong" write "correct" for things like this in the
past and continuing to write "wrong" is a disservice to any
new implementers . When we do have a contradiction between
our implementation and the spec I think we have
to trust that implementers followed the spec and fix the
core library.
Are there any other solutions here?
On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong
<fo...@apache.org> wrote:
Hey Robert,
The thing is, that -1 cannot "go away".
Yes, I agree, but that's also the case for null, as the
field is optional in the spec
<https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>.
Therefore we support both in PyIceberg
<https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>,
Iceberg-Rust
<https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>,
Iceberg-Java
<https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>,
and Iceberg-Go
<https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>.
On the write side, they all produce null instead of -1.
Therefore, I was surprised that it comes up now, and not
earlier.
I'd prefer to keep the previous behavior - otherwise
implementations may fall back to 0, which is
definitely wrong.
I'm not seeing why it would fall back to 0, and I agree,
that's wrong.
Would be better IMHO not to break existing
implementations / render existing setups
incompatible with Iceberg 1.8.
In my opinion, if this had been caught in an RC, it
would be open for discussion, but that ship has sailed.
Let's hear what others think.
Kind regards,
Fokko
Op ma 17 feb 2025 om 18:16 schreef Robert Stupp
<sn...@snazy.de>:
Hi Fokko,
sure, in general "absent" or "null" would be
cleaner. But now we have two representations for the
same case - I suspect most went with the "reference
behavior".
The thing is, that -1 cannot "go away".
I'd prefer to keep the previous behavior - otherwise
implementations may fall back to 0, which is
definitely wrong. Would be better IMHO not to break
existing implementations / render existing setups
incompatible with Iceberg 1.8.
On 17.02.25 15:49, Fokko Driesprong wrote:
Hey Robert,
Thanks for raising this.
snapshot-ID -1 isn't per-se invalid, because
the valid values are not defined in the spec.
For me, this is invalid, since there is no snapshot
with -1 in the snapshots property. In the tests
with the PR, you can see that there are no
snapshots
<https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>.
A year ago we had a similar discussion on PyIceberg
<https://py.iceberg.apache.org/configuration/#backward-compatibility>
around this and this ended up in adding a flag to
fall back to the old behavior. I do agree that we
should have communicated this more clearly with
the release.
Kind regards,
Fokko
Op ma 17 feb 2025 om 12:25 schreef Robert Stupp
<sn...@snazy.de>:
Feels like
https://github.com/apache/iceberg/pull/11560
introduced a behavior change.
snapshot-ID -1 isn't per-se invalid, because
the valid values are not defined in the spec.
Previous Iceberg-Java versions always produced
-1 if there's no current snapshot - 1.8
produces `null` in that case. So there are now
two _different_ values for "no current snapshot".
Implementations that rely on the behavior of
the "reference implementation" (Iceberg-Java)
do now fail in case there's no current snapshot.
On 13.02.25 10:09, Amogh Jahagirdar wrote:
I'm pleased to announce the release of Apache
Iceberg 1.8.0!
Apache Iceberg is an open table format for
huge analytic datasets. Iceberg
delivers high query performance for tables
with tens of petabytes of data,
along with atomic commits, concurrent writes,
and SQL-compatible table
evolution.
This release can be downloaded from:
https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz
Release notes:
https://iceberg.apache.org/releases/#180-release
<https://iceberg.apache.org/releases/180-release>
Java artifacts are available from Maven Central.
Thanks to everyone for contributing!
--
Robert Stupp
@snazy
--
Robert Stupp
@snazy
--
Robert Stupp
@snazy
--
Robert Stupp
@snazy
--
Robert Stupp
@snazy