I had a couple of small fixes that would be great to get into 1.8.1: https://github.com/apache/iceberg/pull/12305 https://github.com/apache/iceberg/pull/12224
I added those to the GitHub 1.8.1 milestone in case this is possible. Thanks, Bryan > On Feb 18, 2025, at 8:53 AM, Robert Stupp <sn...@snazy.de> wrote: > > > On 18.02.25 17:10, Fokko Driesprong wrote: >>> Reality is that Iceberg did write '-1' into current-snapshot-id (and other >>> "non-exist" marker values for schema/spec/sort) instead of omitting the >>> field. >> >> Yes, but this is wrong. The spec dictates >> <https://iceberg.apache.org/spec/#table-metadata-fields> under >> current-snapshot-id: >> >>> long ID of the current table snapshot; must be the same as the current ID >>> of the main branch in refs. > True, it's wrong. (Or: Iceberg violated its own spec ;) ) > >> In the Java 1.7.1 code, we don't see this entry in refs >> <https://github.com/apache/iceberg/blob/4a432839233f2343a9eae8255532f911f06358ef/core/src/test/java/org/apache/iceberg/rest/responses/TestLoadTableResponseParser.java#L110-L111>. >> My point is, we need to exclude -1 from that as well. My concern is that >> this is a slippery slope, and start encoding all kinds of historical bugs >> into the spec. > Quite slippery, yes. But the -1 values are out there in the wild. The > (special) meaning of that value needs to be in the spec IMHO. > >>> I don't think there is any reason to change the behavior of writing null / >>> missing unless that's really a point of confusion for folks. >> >> This is laid out in the reader/writer requirements >> <https://iceberg.apache.org/spec/#writer-requirements>, and should be >> updated if it is confusing. >> >>> Is there a JSON library folks are using which distinguishes between >>> missing/null? >> >> Yes, pydantic allows you to control this. If you define a field, and it is >> missing, it will throw an error when you don't have a default value defined. >> The Java TableMetadataV2ValidMinimalTest >> <https://github.com/apache/iceberg/blob/main/core/src/test/resources/TableMetadataV2ValidMinimal.json> >> omits current-snapshot-id. > Jackson as well BTW. > > > >> How about reverting #11560 <https://github.com/apache/iceberg/pull/11560> >> for 1.8.1, and then reinstating this for 2.0.0? > SGTM! > > > >> >> Kind regards, >> Fokko >> >> >> >> >> >> Op di 18 feb 2025 om 16:55 schreef Russell Spitzer >> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>>: >>> The only thing I think I agree with is defining that -1 has meant "no >>> current snapshot" in the past and >>> is equivalent to a missing/nuil (if we have to specify that) . >>> >>> I don't think there is any reason to change the behavior of writing null / >>> missing unless >>> that's really a point of confusion for folks. Is there a JSON library folks >>> are using >>> which distinguishes between missing/null? >>> >>> Would having field:null be the same as missing have avoided the issue >>> because it seems like libraries >>> that only handled non-null would also not handle "missing" >>> current-snapshot-id as well? >>> >>> I'd really like to hear from other implementers here since changing all >>> this again would be a lot of >>> work and I thought we had understanding of Optional == Nullable as a valid >>> thing in the spec. >>> >>> On Tue, Feb 18, 2025 at 7:35 AM Robert Stupp <sn...@snazy.de >>> <mailto:sn...@snazy.de>> wrote: >>>> Correcting myself: schema/spec/sort seem to be always present - please >>>> ignore that part in my previous email. The valid values for those fields >>>> however should be defined. >>>> >>>> On 18.02.25 14:29, Robert Stupp wrote: >>>>> Reality is that Iceberg did write '-1' into current-snapshot-id (and >>>>> other "non-exist" marker values for schema/spec/sort) instead of omitting >>>>> the field. >>>>> >>>>> Side note: the table-spec says that these fields are optional, but >>>>> nothing about whether it is nullable. >>>>> >>>>> The spec should at least be amended to explicitly define the valid values >>>>> (-1 for current-snapshot-id for no snapshot is just a fact now that it's >>>>> there). IMO that field being 'null' isn't defined in the spec, but the >>>>> absence of the field is. >>>>> >>>>> Proposal for the implementation: >>>>> >>>>> * revert https://github.com/apache/iceberg/pull/11560 in 1.8.1 and on main >>>>> >>>>> Proposal for the spec (for current-snapshot-id): >>>>> >>>>> * Define the valid value ranges for the field >>>>> * Define that the absence of the field means "no current snapshot" >>>>> * Define that the value -1 of the field means "no current snapshot" >>>>> * Define that the value 'null' of the field means "no current snapshot" >>>>> * Define that new implementations must not write the field if there's no >>>>> current snapshot >>>>> >>>>> Proposal for the spec (for schema/spec/sort): >>>>> * Define the valid value ranges for the fields >>>>> * Define the "schema/spec/sort not present" values (the fields are >>>>> optional for v1 but required for v2+v3). >>>>> * OR Define that "schema/spec/sort must be absent" if there is no current >>>>> schema/spec/sort. >>>>> >>>>> WDYT? >>>>> >>>>> >>>>> On 17.02.25 21:07, Russell Spitzer wrote: >>>>>> It sounds like the argument here is that we should change the Spec for >>>>>> V1, V2, and V3 to mark current-snapshot-id >>>>>> as required. Then we should change all other implementations to follow >>>>>> this new standard. I'm not sure that >>>>>> is a good solution going forwards but I'm not sure of how we can support >>>>>> catalogs/engines that cannot handle a null >>>>>> correctly in this situation otherwise. Perhaps we should source out to >>>>>> see if any other implementers worked off the >>>>>> assumption of a non-optional "current-snapshot-id" and if we get a >>>>>> critical mass we can try to make that change? >>>>>> Because of how wide that change would be, I think we would need pretty >>>>>> broad consensus to do so. >>>>>> >>>>>> We could possibly also have a flag to allow the old behavior but that >>>>>> also feels wrong to me, we have often gone with a motto of >>>>>> read "wrong" write "correct" for things like this in the past and >>>>>> continuing to write "wrong" is a disservice to any >>>>>> new implementers . When we do have a contradiction between our >>>>>> implementation and the spec I think we have >>>>>> to trust that implementers followed the spec and fix the core library. >>>>>> >>>>>> Are there any other solutions here? >>>>>> >>>>>> On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong <fo...@apache.org >>>>>> <mailto:fo...@apache.org>> wrote: >>>>>>> Hey Robert, >>>>>>> >>>>>>>> The thing is, that -1 cannot "go away". >>>>>>> >>>>>>> Yes, I agree, but that's also the case for null, as the field is >>>>>>> optional in the spec >>>>>>> <https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>. >>>>>>> Therefore we support both in PyIceberg >>>>>>> <https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>, >>>>>>> Iceberg-Rust >>>>>>> <https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>, >>>>>>> Iceberg-Java >>>>>>> <https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>, >>>>>>> and Iceberg-Go >>>>>>> <https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>. >>>>>>> On the write side, they all produce nullinstead of -1. Therefore, I >>>>>>> was surprised that it comes up now, and not earlier. >>>>>>> >>>>>>>> I'd prefer to keep the previous behavior - otherwise implementations >>>>>>>> may fall back to 0, which is definitely wrong. >>>>>>> >>>>>>> I'm not seeing why it would fall back to 0, and I agree, that's wrong. >>>>>>> >>>>>>>> Would be better IMHO not to break existing implementations / render >>>>>>>> existing setups incompatible with Iceberg 1.8. >>>>>>> >>>>>>> In my opinion, if this had been caught in an RC, it would be open for >>>>>>> discussion, but that ship has sailed. Let's hear what others think. >>>>>>> >>>>>>> Kind regards, >>>>>>> Fokko >>>>>>> >>>>>>> Op ma 17 feb 2025 om 18:16 schreef Robert Stupp <sn...@snazy.de >>>>>>> <mailto:sn...@snazy.de>>: >>>>>>>> Hi Fokko, >>>>>>>> >>>>>>>> sure, in general "absent" or "null" would be cleaner. But now we have >>>>>>>> two representations for the same case - I suspect most went with the >>>>>>>> "reference behavior". >>>>>>>> >>>>>>>> The thing is, that -1 cannot "go away". >>>>>>>> >>>>>>>> I'd prefer to keep the previous behavior - otherwise implementations >>>>>>>> may fall back to 0, which is definitely wrong. Would be better IMHO >>>>>>>> not to break existing implementations / render existing setups >>>>>>>> incompatible with Iceberg 1.8. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 17.02.25 15:49, Fokko Driesprong wrote: >>>>>>>>> Hey Robert, >>>>>>>>> >>>>>>>>> Thanks for raising this. >>>>>>>>> >>>>>>>>>> snapshot-ID -1 isn't per-se invalid, because the valid values are >>>>>>>>>> not defined in the spec. >>>>>>>>> >>>>>>>>> For me, this is invalid, since there is no snapshot with -1 in the >>>>>>>>> snapshots property. In the tests with the PR, you can see that there >>>>>>>>> are no snapshots >>>>>>>>> <https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>. >>>>>>>>> A year ago we had a similar discussion on PyIceberg >>>>>>>>> <https://py.iceberg.apache.org/configuration/#backward-compatibility> >>>>>>>>> around this and this ended up in adding a flag to fall back to the >>>>>>>>> old behavior. I do agree that we should have communicated this more >>>>>>>>> clearly with the release. >>>>>>>>> >>>>>>>>> Kind regards, >>>>>>>>> Fokko >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Op ma 17 feb 2025 om 12:25 schreef Robert Stupp <sn...@snazy.de >>>>>>>>> <mailto:sn...@snazy.de>>: >>>>>>>>>> Feels like https://github.com/apache/iceberg/pull/11560 introduced a >>>>>>>>>> behavior change. >>>>>>>>>> >>>>>>>>>> snapshot-ID -1 isn't per-se invalid, because the valid values are >>>>>>>>>> not defined in the spec. >>>>>>>>>> >>>>>>>>>> Previous Iceberg-Java versions always produced -1 if there's no >>>>>>>>>> current snapshot - 1.8 produces `null` in that case. So there are >>>>>>>>>> now two _different_ values for "no current snapshot". >>>>>>>>>> >>>>>>>>>> Implementations that rely on the behavior of the "reference >>>>>>>>>> implementation" (Iceberg-Java) do now fail in case there's no >>>>>>>>>> current snapshot. >>>>>>>>>> >>>>>>>>>> On 13.02.25 10:09, Amogh Jahagirdar wrote: >>>>>>>>>>> I'm pleased to announce the release of Apache Iceberg 1.8.0! >>>>>>>>>>> >>>>>>>>>>> Apache Iceberg is an open table format for huge analytic datasets. >>>>>>>>>>> Iceberg >>>>>>>>>>> delivers high query performance for tables with tens of petabytes >>>>>>>>>>> of data, >>>>>>>>>>> along with atomic commits, concurrent writes, and SQL-compatible >>>>>>>>>>> table >>>>>>>>>>> evolution. >>>>>>>>>>> >>>>>>>>>>> This release can be downloaded from: >>>>>>>>>>> https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Release notes: https://iceberg.apache.org/releases/#180-release >>>>>>>>>>> <https://iceberg.apache.org/releases/180-release> >>>>>>>>>>> >>>>>>>>>>> Java artifacts are available from Maven Central. >>>>>>>>>>> >>>>>>>>>>> Thanks to everyone for contributing! >>>>>>>>>> -- >>>>>>>>>> Robert Stupp >>>>>>>>>> @snazy >>>>>>>> -- >>>>>>>> Robert Stupp >>>>>>>> @snazy >>>>> -- >>>>> Robert Stupp >>>>> @snazy >>>> -- >>>> Robert Stupp >>>> @snazy > -- > Robert Stupp > @snazy