I see it's been merged, but I don't think it is a good idea to enforce this. The spec can and should require the `operation` but we want to be careful about creating situations where bad metadata can needlessly break a table. I would be much more permissive here, which is why this probably wasn't enforced in the first place.
On Fri, Oct 25, 2024 at 2:36 PM Kevin Liu <kevin.jq....@gmail.com> wrote: > Thanks, everyone! The PR[1] has been merged > > Best, > Kevin Liu > > [1] https://github.com/apache/iceberg/pull/11354 > > > On Fri, Oct 25, 2024 at 1:02 PM Kevin Liu <kevin.jq....@gmail.com> wrote: > >> Thanks, Ryan! That makes sense. >> >> I want to follow up on the original issue. I've made a PR [1] to enforce >> that the Snapshot `summary` map must have an `operation` key. Please take a >> look. Thank you @nastra for the comments and reviews. >> >> Best, >> Kevin Liu >> >> [1] https://github.com/apache/iceberg/pull/11354 >> >> >> >> On Tue, Oct 22, 2024 at 4:06 PM rdb...@gmail.com <rdb...@gmail.com> >> wrote: >> >>> > For example, the `Snapshot` `summary` field is optional in V1 but >>> required in V2. Therefore, the REST spec definition should mark the >>> `summary` field as optional to support both versions. >>> >>> Yeah, this is technically true. But as I said in my first email, unless >>> you have tables that are 5 years old, it's unlikely that this is going to >>> be a problem. A failure here is more likely with newer implementations that >>> have a bug. So I'd argue there's value in leaving it as required. >>> >>> On Mon, Oct 21, 2024 at 9:41 AM Kevin Liu <kevin.jq....@gmail.com> >>> wrote: >>> >>>> > No. They were introduced at the same time. >>>> Great! Since the `summary` field and the `operation` key were >>>> introduced together, we should enforce the rule that the `summary` >>>> field must always have an accompanying `operation` key. This has been >>>> addressed in PR 11354 [1]. >>>> >>>> > I am strongly against this. The REST spec should be independent of >>>> the table versions. >>>> That makes sense. For the REST spec to support both V1 and V2 tables, >>>> it should "accept" the least common denominator between the two versions. >>>> For example, the `Snapshot` `summary` field is optional in V1 but required >>>> in V2. Therefore, the REST spec definition should mark the `summary` field >>>> as optional to support both versions. However, the current REST spec leans >>>> towards the V2 table spec; fields that are optional in V1 and required in >>>> V2 are marked as required in the spec, such as `TableMetadata.table-uuid` >>>> [2][3] and `Snapshot.summary` [4][5]. >>>> >>>> Would love to get other people's thoughts on this. >>>> >>>> Best, >>>> Kevin Liu >>>> >>>> [1] https://github.com/apache/iceberg/pull/11354 >>>> [2] >>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414 >>>> [3] https://iceberg.apache.org/spec/#table-metadata-fields >>>> [4] >>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325 >>>> [5] https://iceberg.apache.org/spec/#snapshots >>>> >>>> On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com <rdb...@gmail.com> >>>> wrote: >>>> >>>>> Was it ever valid to have a summary field without the operation key? >>>>> >>>>> No. They were introduced at the same time. >>>>> >>>>> Would it be helpful to create alternative versions of the REST spec >>>>> specifically for referencing V1 and V2 tables? >>>>> >>>>> I am strongly against this. The REST spec should be independent of the >>>>> table versions. Any table format version can be passed and the table >>>>> format >>>>> should be the canonical reference for what is allowed. We want to avoid >>>>> cases where there are discrepancies. The table spec is canonical for table >>>>> metadata, and the REST spec allows passing it. >>>>> >>>>> On Sun, Oct 20, 2024 at 11:18 AM Kevin Liu <kevin.jq....@gmail.com> >>>>> wrote: >>>>> >>>>>> Hey folks, >>>>>> >>>>>> Thanks, everyone for the discussion, and thanks Ryan for providing >>>>>> the historical context. >>>>>> Enforce the `operation` key in Snapshot’s `summary` field >>>>>> >>>>>> When serializing the `Snapshot` object from JSON, the Java >>>>>> implementation does not enforce that the `summary` field must contain an >>>>>> `operation` key. In the V1 spec, the `summary` field is optional, while >>>>>> in >>>>>> the V2 spec, it is required. However, in both versions, if a `summary` >>>>>> field is present, it must include an `operation` key. Any `summary` field >>>>>> lacking an `operation` key should be considered invalid. >>>>>> >>>>>> I’ve addressed this issue in PR 11354 [1] by adding this constraint >>>>>> when parsing JSON. >>>>>> >>>>>> > We initially did not have the snapshot summary or operation. When I >>>>>> added the summary, the operation was intended to be required in cases >>>>>> where >>>>>> the summary is present. It should always be there if the summary is and >>>>>> the >>>>>> summary should always be there unless you wrote the metadata.json >>>>>> file way back in 2017 or 2018. >>>>>> >>>>>> @Ryan, does this constraint also apply to `metadata.json` files from >>>>>> 2017/2018? Was it ever valid to have a `summary` field without the >>>>>> `operation` key? >>>>>> >>>>>> > Well, the spec says nothing about a top-level `operation` field in >>>>>> JSON [1]. Yet the Java implementation produces it [2] and removes the >>>>>> operation from the summary map. This seems inconsistent? >>>>>> >>>>>> @Anton, the Java `Snapshot` object includes both the `summary` and >>>>>> `operation` fields. When serializing to JSON, the `operation` field is >>>>>> included in the `summary` map [2], rather than as a top-level field. >>>>>> During >>>>>> deserialization from JSON, the `operation` field is extracted from the >>>>>> `summary` map [3]. >>>>>> >>>>>> I believe this is consistent with the table spec, which defines the >>>>>> JSON output, not how the `Snapshot` object is implemented in Java. >>>>>> On REST spec and Table spec >>>>>> >>>>>> Thanks, Yufei, for highlighting the difference between the REST spec >>>>>> and the table spec. I mistakenly used the REST spec >>>>>> (`rest-catalog-open-api.yaml` [4]) as the source of truth for V2 tables. >>>>>> >>>>>> Looking at the REST spec file, it can be challenging to determine how >>>>>> a REST server should handle V1 versus V2 tables. Even for V2 tables, the >>>>>> current version of the file combines features from V2, along with >>>>>> additional changes made in preparation for the upcoming V3 spec. >>>>>> >>>>>> Would it be helpful to create alternative versions of the REST spec >>>>>> specifically for referencing V1 and V2 tables? The goal would be to have >>>>>> a >>>>>> "frozen" version of the REST spec dedicated to V1 tables and another for >>>>>> V2 >>>>>> tables while allowing the current REST spec file to evolve as needed. >>>>>> >>>>>> Taking a step back, I think we need more documentation on the REST >>>>>> spec, including support for different table versions and guidance on >>>>>> upgrading from one version to another. I’d love to hear everyone’s >>>>>> thoughts >>>>>> on this. >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> Kevin Liu >>>>>> >>>>>> >>>>>> [1] https://github.com/apache/iceberg/pull/11354 >>>>>> >>>>>> [2] >>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63-L66 >>>>>> >>>>>> [3] >>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L137 >>>>>> >>>>>> [4] >>>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml >>>>>> >>>>>> >>>>>> On Sat, Oct 19, 2024 at 7:48 PM Sung Yun <sun...@apache.org> wrote: >>>>>> >>>>>>> Hi Ryan, thank you for your response! >>>>>>> >>>>>>> That detailed context is very helpful in allowing me to >>>>>>> understanding why the REST catalog spec has evolved the way it has, and >>>>>>> how >>>>>>> the Table Spec and the REST Catalog Spec should each be referenced in >>>>>>> the >>>>>>> sub-communities (like in PyIceberg). I'll keep those motivations in >>>>>>> mind as >>>>>>> we discuss those Specs in the future. >>>>>>> >>>>>>> Also, here's a small PR to specify more explicitly that the >>>>>>> operation field should be a required field in the summary field: >>>>>>> https://github.com/apache/iceberg/pull/11355 >>>>>>> >>>>>>> Sung >>>>>>> >>>>>>> On 2024/10/19 22:14:59 "rdb...@gmail.com" wrote: >>>>>>> > I can provide some historical context here about how the table >>>>>>> spec evolved >>>>>>> > and how the REST spec works with respect to table versions. >>>>>>> > >>>>>>> > We initially did not have the snapshot summary or operation. When >>>>>>> I added >>>>>>> > the summary, the operation was intended to be required in cases >>>>>>> where the >>>>>>> > summary is present. It should always be there if the summary is >>>>>>> and the >>>>>>> > summary should always be there unless you wrote the metadata.json >>>>>>> file way >>>>>>> > back in 2017 or 2018. It looks like the spec could be more clear >>>>>>> that the >>>>>>> > operation is required when summary is present. Anyone want to open >>>>>>> a PR? >>>>>>> > >>>>>>> > Anton, I don't think there is a top-level operation field. The Java >>>>>>> > Snapshot class tracks the operation as top-level, but it is always >>>>>>> stored >>>>>>> > in the summary. I think this is consistent with the spec. >>>>>>> > >>>>>>> > For the REST spec, I think that it should be strictly optional to >>>>>>> support >>>>>>> > v1 tables written with no summary, but it should always be >>>>>>> present. I'd >>>>>>> > probably leave it required since it already is and is catching a >>>>>>> valuable >>>>>>> > error case here. >>>>>>> > >>>>>>> > When building the REST spec, I took the same approach as the Java >>>>>>> parser >>>>>>> > (which is also to parse table metadata coming from REST servers). >>>>>>> That is, >>>>>>> > it is compatible with v1 metadata; fields that were not required >>>>>>> in v1 are >>>>>>> > optional. This fits with the principle of "be liberal in what you >>>>>>> accept >>>>>>> > and strict in what you produce". The REST spec allows passing >>>>>>> metadata for >>>>>>> > any table version so that the specs are not tightly coupled. The >>>>>>> table >>>>>>> > version is passed when loading and clients should reject table >>>>>>> versions >>>>>>> > that are newer than what they can support. The REST protocol just >>>>>>> needs to >>>>>>> > facilitate passing the table metadata. >>>>>>> > >>>>>>> > Most v2 structures, like the `schemas` list, are introduced as >>>>>>> optional in >>>>>>> > v1 and made required in v2. That way, it's possible to add >>>>>>> metadata to >>>>>>> > existing format versions and make the upgrade path easier. >>>>>>> Maintaining the >>>>>>> > newer structures even though there are different writer versions >>>>>>> deployed >>>>>>> > is one of the reasons why the REST spec changes to a change-based >>>>>>> model. >>>>>>> > New metadata only needs to be supported by the service maintaining >>>>>>> the >>>>>>> > metadata.json files and any writers that want to update it. >>>>>>> > >>>>>>> > I see some points about being able to remove old table versions. I >>>>>>> don't >>>>>>> > think that the REST protocol itself is the place to do this. The >>>>>>> protocol >>>>>>> > is format-agnostic. Implementations are free to reject requests to >>>>>>> create >>>>>>> > tables with older versions, or to update the table to a new >>>>>>> version. >>>>>>> > >>>>>>> > Ryan >>>>>>> > >>>>>>> > On Fri, Oct 18, 2024 at 6:42 AM Sung Yun <sun...@apache.org> >>>>>>> wrote: >>>>>>> > >>>>>>> > > Folks, thank you for your responses! These area really helpful >>>>>>> insights. >>>>>>> > > >>>>>>> > > > I agree that the REST spec should aim to support v1, v2, and >>>>>>> potentially >>>>>>> > > the upcoming v3. In practice, it seems like the choice of table >>>>>>> spec might >>>>>>> > > ultimately be dictated by the REST catalog implementation. >>>>>>> > > >>>>>>> > > > A best practice would be for the server to strive to support >>>>>>> all Iceberg >>>>>>> > > versions, but the REST spec itself should remain flexible enough >>>>>>> to >>>>>>> > > accommodate less strict table specs. >>>>>>> > > >>>>>>> > > Yufei, yes that makes sense, and I agree that the server should >>>>>>> strive to >>>>>>> > > support all format versions, because otherwise the an older >>>>>>> client >>>>>>> > > application, may just not be compatible with a REST Catalog >>>>>>> running on a >>>>>>> > > higher version of table spec. I think we have two choices here >>>>>>> in ensuring >>>>>>> > > that the REST Catalog server is able to support multiple >>>>>>> versions of the >>>>>>> > > Table Spec: >>>>>>> > > >>>>>>> > > 1. We could create single components that are common >>>>>>> denominators of all >>>>>>> > > existing table specs to accommodate the less table specs. The >>>>>>> REST Catalog >>>>>>> > > Spec currently falls short in this approach, and I've put up >>>>>>> this PR to >>>>>>> > > show what this change would look like just for the Snapshot >>>>>>> component: >>>>>>> > > https://github.com/apache/iceberg/pull/11353 - My take on this >>>>>>> is that, >>>>>>> > > this approach will make the REST catalog spec more confusing and >>>>>>> difficult >>>>>>> > > to manage as we add more Table Spec versions moving forward. The >>>>>>> discussion >>>>>>> > > on this mail list thread is I think a great demonstration of >>>>>>> that confusion >>>>>>> > > :) >>>>>>> > > 2. We could instead create separate Table Spec version specific >>>>>>> components >>>>>>> > > on the REST Catalog Open API Spec. For example, a Snapshot >>>>>>> component could >>>>>>> > > be anyOf SnapshotV1 and SnapshotV2, which match the Table Spec >>>>>>> V1 and V2 >>>>>>> > > definitions. I think creating explicit components that match the >>>>>>> spec >>>>>>> > > definitions will work in our favor when we continue to introduce >>>>>>> more Spec >>>>>>> > > changes and manage their lifecycles. And perhaps, maybe we could >>>>>>> also >>>>>>> > > indicate what format-versions the REST Catalog Server supports >>>>>>> through an >>>>>>> > > endpoint, and communicate it to a client application. >>>>>>> > > >>>>>>> > > I'd love to hear the community's opinion on suggestion (2)! I'm >>>>>>> very >>>>>>> > > curious to hear if we've considered it before. >>>>>>> > > >>>>>>> > > Sung >>>>>>> > > >>>>>>> > > On 2024/10/18 05:13:15 Péter Váry wrote: >>>>>>> > > > Hi Team, >>>>>>> > > > Apart from fixing this current issue by relaxing the current >>>>>>> spec >>>>>>> > > > constraints, to support both v1 and v2 specifications, we >>>>>>> should think >>>>>>> > > > about how to handle table spec evolution for the long term. >>>>>>> > > > >>>>>>> > > > What are the base factors we can start from (please add your >>>>>>> own ideas >>>>>>> > > if I >>>>>>> > > > have missed something): >>>>>>> > > > - We evolve the specifications in a way that is backwards >>>>>>> compatible (v1 >>>>>>> > > > table could be read by v2 reader) but not forwards compatible >>>>>>> (v2 table >>>>>>> > > > could not be read by an old reader) >>>>>>> > > > - The rest spec ideally should conform to the currently used >>>>>>> table spec >>>>>>> > > > schema/constraints >>>>>>> > > > - REST catalogs sooner-or-later would like to drop support for >>>>>>> older >>>>>>> > > table >>>>>>> > > > spec. We need to avoid the situation of Hive Metastore, where >>>>>>> the >>>>>>> > > decisions >>>>>>> > > > made 10 years ago prevented enhancing the APIs as the old >>>>>>> specifications >>>>>>> > > > were supported forever. >>>>>>> > > > >>>>>>> > > > Probably (when the spec difference becomes big enough) a >>>>>>> composit request >>>>>>> > > > (version + different content spec) or a different endpoint >>>>>>> will be >>>>>>> > > required. >>>>>>> > > > >>>>>>> > > > Thanks, Peter >>>>>>> > > > >>>>>>> > > > On Thu, Oct 17, 2024, 23:11 Yufei Gu <flyrain...@gmail.com> >>>>>>> wrote: >>>>>>> > > > >>>>>>> > > > > Hi Sung, >>>>>>> > > > > >>>>>>> > > > > It seems we are running to issues related to a mismatch >>>>>>> between the >>>>>>> > > REST >>>>>>> > > > > spec and table specifications. Currently, there's no clear >>>>>>> definition >>>>>>> > > of >>>>>>> > > > > how the REST spec is meant to support different table specs. >>>>>>> The >>>>>>> > > closest >>>>>>> > > > > reference I found is this statement >>>>>>> > > > > < >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L30-L30 >>>>>>> > > > >>>>>>> > > > > in the REST spec. >>>>>>> > > > > >>>>>>> > > > > Implementations should ideally support both Iceberg table >>>>>>> specs v1 and >>>>>>> > > v2, >>>>>>> > > > >> with priority given to v2. >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > I agree that the REST spec should aim to support v1, v2, and >>>>>>> > > potentially >>>>>>> > > > > the upcoming v3. In practice, it seems like the choice of >>>>>>> table spec >>>>>>> > > might >>>>>>> > > > > ultimately be dictated by the REST catalog implementation. >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > A best practice would be for the server to strive to support >>>>>>> all >>>>>>> > > Iceberg >>>>>>> > > > > versions, but the REST spec itself should remain flexible >>>>>>> enough to >>>>>>> > > > > accommodate less strict table specs. For the case you >>>>>>> mentioned, it >>>>>>> > > should >>>>>>> > > > > be fine to make sequence number optional since the spec has >>>>>>> to support >>>>>>> > > v1 >>>>>>> > > > > table spec. It does feel confusing though. >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > WDYT? >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > Yufei >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > On Thu, Oct 17, 2024 at 1:56 PM Anton Okolnychyi < >>>>>>> > > aokolnyc...@gmail.com> >>>>>>> > > > > wrote: >>>>>>> > > > > >>>>>>> > > > >> Well, the spec says nothing about a top-level `operation` >>>>>>> field in >>>>>>> > > JSON >>>>>>> > > > >> [1]. Yet the Java implementation produces it [2] and >>>>>>> removes the >>>>>>> > > operation >>>>>>> > > > >> from the summary map. This seems inconsistent? >>>>>>> > > > >> >>>>>>> > > > >> - Anton >>>>>>> > > > >> >>>>>>> > > > >> [1] - https://iceberg.apache.org/spec/#snapshots >>>>>>> > > > >> [2] - >>>>>>> > > > >> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > > >> чт, 17 жовт. 2024 р. о 10:06 Sung Yun <sun...@apache.org> >>>>>>> пише: >>>>>>> > > > >> >>>>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file >>>>>>> [2] in the >>>>>>> > > > >>> Iceberg repo contains the latest version of the spec. >>>>>>> > > > >>> >>>>>>> > > > >>> I think more clarity on this would be helpful. Is it >>>>>>> really the case >>>>>>> > > > >>> that the Open API spec contains the latest version of the >>>>>>> spec? For >>>>>>> > > > >>> example, I'm noticing a discrepancy between >>>>>>> sequence-number in the >>>>>>> > > Table >>>>>>> > > > >>> Spec and in the Open API Spec... >>>>>>> > > > >>> >>>>>>> > > > >>> In the table spec, it's required for V2, but it's optional >>>>>>> in the >>>>>>> > > REST >>>>>>> > > > >>> API Spec: >>>>>>> > > > >>> >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2319-L2335 >>>>>>> > > > >>> >>>>>>> > > > >>> On 2024/10/17 16:58:17 Kevin Liu wrote: >>>>>>> > > > >>> > > Based on the example metadata, that looks like it is >>>>>>> not to >>>>>>> > > spec, so >>>>>>> > > > >>> it's >>>>>>> > > > >>> > reasonable that python would reject it. If the java >>>>>>> > > implementation is >>>>>>> > > > >>> > allowing for that, it's likely that we're being too >>>>>>> relaxed >>>>>>> > > (possibly a >>>>>>> > > > >>> > holdover from v1 parsing). >>>>>>> > > > >>> > I believe the Java implementation is relaxing the >>>>>>> constraint. I'll >>>>>>> > > > >>> create a >>>>>>> > > > >>> > PR with test cases and the necessary changes. >>>>>>> > > > >>> > >>>>>>> > > > >>> > > Do you know what produced the metadata? >>>>>>> > > > >>> > It was created by Snowflake [1]. After verifying this, >>>>>>> I'll look >>>>>>> > > into >>>>>>> > > > >>> > raising the issue with them. >>>>>>> > > > >>> > >>>>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file >>>>>>> [2] in the >>>>>>> > > > >>> Iceberg >>>>>>> > > > >>> > repo contains the latest version of the spec. As we're >>>>>>> continuing >>>>>>> > > to >>>>>>> > > > >>> evolve >>>>>>> > > > >>> > to spec for V3, would it be helpful to create a frozen >>>>>>> version >>>>>>> > > > >>> representing >>>>>>> > > > >>> > both the V1 and V2 specs for reference, possibly as a >>>>>>> separate >>>>>>> > > file? >>>>>>> > > > >>> > >>>>>>> > > > >>> > Best, >>>>>>> > > > >>> > Kevin Liu >>>>>>> > > > >>> > >>>>>>> > > > >>> > [1] >>>>>>> > > > >>> > >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455 >>>>>>> > > > >>> > [2] >>>>>>> > > > >>> > >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml >>>>>>> > > > >>> > >>>>>>> > > > >>> > On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks < >>>>>>> dwe...@apache.org> >>>>>>> > > > >>> wrote: >>>>>>> > > > >>> > >>>>>>> > > > >>> > > Sung, >>>>>>> > > > >>> > > >>>>>>> > > > >>> > > I was thinking of v1, so you're right that >>>>>>> manifest-list and >>>>>>> > > summary >>>>>>> > > > >>> are >>>>>>> > > > >>> > > required as of v2. The REST Spec seems to follow the >>>>>>> v2 >>>>>>> > > definition, >>>>>>> > > > >>> so I >>>>>>> > > > >>> > > think we're somewhat implicitly requiring those fields >>>>>>> via REST. >>>>>>> > > > >>> > > >>>>>>> > > > >>> > > Kevin, >>>>>>> > > > >>> > > >>>>>>> > > > >>> > > Based on the example metadata, that looks like it is >>>>>>> not to >>>>>>> > > spec, so >>>>>>> > > > >>> it's >>>>>>> > > > >>> > > reasonable that python would reject it. If the java >>>>>>> > > implementation >>>>>>> > > > >>> is >>>>>>> > > > >>> > > allowing for that, it's likely that we're being too >>>>>>> relaxed >>>>>>> > > > >>> (possibly a >>>>>>> > > > >>> > > holdover from v1 parsing). >>>>>>> > > > >>> > > >>>>>>> > > > >>> > > Do you know what produced the metadata? >>>>>>> > > > >>> > > >>>>>>> > > > >>> > > -Dan >>>>>>> > > > >>> > > >>>>>>> > > > >>> > > On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu < >>>>>>> > > kevin.jq....@gmail.com> >>>>>>> > > > >>> wrote: >>>>>>> > > > >>> > > >>>>>>> > > > >>> > >> Thanks for the additional context. >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >> My understanding is that if a Snapshot has a >>>>>>> `summary` field, it >>>>>>> > > > >>> must >>>>>>> > > > >>> > >> also have a corresponding `operation` key in the >>>>>>> summary map. Is >>>>>>> > > > >>> that >>>>>>> > > > >>> > >> correct? Based on the `SnapshotParser`, this is not >>>>>>> enforced >>>>>>> > > [1]. >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >> The underlying issue in #1106 [2] is the missing >>>>>>> `operation` >>>>>>> > > field >>>>>>> > > > >>> when >>>>>>> > > > >>> > >> the `summary` field is present. >>>>>>> > > > >>> > >> For example, >>>>>>> > > > >>> > >> ``` >>>>>>> > > > >>> > >> "summary" : { >>>>>>> > > > >>> > >> "manifests-created" : "8", >>>>>>> > > > >>> > >> "total-records" : "26508666891", >>>>>>> > > > >>> > >> "added-files-size" : "3927895626752", >>>>>>> > > > >>> > >> "manifests-kept" : "0", >>>>>>> > > > >>> > >> "total-files-size" : "3927895626752", >>>>>>> > > > >>> > >> "added-records" : "26508666891", >>>>>>> > > > >>> > >> "added-data-files" : "231513", >>>>>>> > > > >>> > >> "manifests-replaced" : "0", >>>>>>> > > > >>> > >> "total-data-files" : "231513" >>>>>>> > > > >>> > >> } >>>>>>> > > > >>> > >> ``` >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >> It could be the case that this particular >>>>>>> `metadata.json` was >>>>>>> > > > >>> generated >>>>>>> > > > >>> > >> not according to the spec. >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >> Best, >>>>>>> > > > >>> > >> Kevin Liu >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >> [1] >>>>>>> > > > >>> > >> >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142 >>>>>>> > > > >>> > >> [2] >>>>>>> https://github.com/apache/iceberg-python/issues/1106 >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun < >>>>>>> sun...@apache.org> >>>>>>> > > wrote: >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >>> Thank you for the clarification Daniel, and thank >>>>>>> you Kevin for >>>>>>> > > > >>> raising >>>>>>> > > > >>> > >>> this issue! >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> > >>> Does that mean that we are creating component >>>>>>> schemas that are >>>>>>> > > the >>>>>>> > > > >>> > >>> superset of the V1 and V2 schemas? And if so, should >>>>>>> we remove >>>>>>> > > > >>> summary and >>>>>>> > > > >>> > >>> manifest-list from the required properties, and add >>>>>>> manifests >>>>>>> > > > >>> optional >>>>>>> > > > >>> > >>> property to the Snapshot schema to support both V1 >>>>>>> and V2 >>>>>>> > > Summary >>>>>>> > > > >>> specs? >>>>>>> > > > >>> > >>> https://iceberg.apache.org/spec/#snapshots >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> > >>> Or would creating separate component schemas for >>>>>>> V1/V2 be a >>>>>>> > > > >>> cleaner way >>>>>>> > > > >>> > >>> to align the REST spec with the table spec? >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> > >>> Sung >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> > >>> On 2024/10/17 15:19:23 Daniel Weeks wrote: >>>>>>> > > > >>> > >>> > I'm not convinced this is incorrect behavior >>>>>>> (table spec or >>>>>>> > > > >>> > >>> > implementation), but it does lend to some >>>>>>> confusion. The >>>>>>> > > > >>> 'summary' >>>>>>> > > > >>> > >>> field >>>>>>> > > > >>> > >>> > is optional, which means that if a summary is not >>>>>>> provided, >>>>>>> > > you >>>>>>> > > > >>> do not >>>>>>> > > > >>> > >>> have >>>>>>> > > > >>> > >>> > an associated 'operation' field. The 'operation' >>>>>>> field is >>>>>>> > > only >>>>>>> > > > >>> > >>> required in >>>>>>> > > > >>> > >>> > the context of the summary, so it's actually >>>>>>> possible for the >>>>>>> > > > >>> > >>> > implementation (i.e. the tests you reference) to >>>>>>> not have an >>>>>>> > > > >>> operation. >>>>>>> > > > >>> > >>> > >>>>>>> > > > >>> > >>> > I think what is wrong here is that the REST spec >>>>>>> marked the >>>>>>> > > > >>> summary as >>>>>>> > > > >>> > >>> > required >>>>>>> > > > >>> > >>> > < >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040 >>>>>>> > > > >>> > >>> >, >>>>>>> > > > >>> > >>> > which is inconsistent with the table spec. >>>>>>> > > > >>> > >>> > >>>>>>> > > > >>> > >>> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi < >>>>>>> > > > >>> > >>> aokolnyc...@gmail.com> >>>>>>> > > > >>> > >>> > wrote: >>>>>>> > > > >>> > >>> > >>>>>>> > > > >>> > >>> > > Based on [1], we never persisted the operation >>>>>>> in the >>>>>>> > > summary >>>>>>> > > > >>> map. >>>>>>> > > > >>> > >>> > > Instead, we persisted it as a top-level field in >>>>>>> Java, >>>>>>> > > which is >>>>>>> > > > >>> > >>> actually >>>>>>> > > > >>> > >>> > > NOT what the spec says. Does anyone remember >>>>>>> cases when the >>>>>>> > > > >>> > >>> operation was >>>>>>> > > > >>> > >>> > > unknown? I personally don't. >>>>>>> > > > >>> > >>> > > >>>>>>> > > > >>> > >>> > > [1] - >>>>>>> > > > >>> > >>> > > >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 >>>>>>> > > > >>> > >>> > > >>>>>>> > > > >>> > >>> > > >>>>>>> > > > >>> > >>> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu < >>>>>>> > > kevin.jq....@gmail.com >>>>>>> > > > >>> > >>>>>>> > > > >>> > >>> пише: >>>>>>> > > > >>> > >>> > > >>>>>>> > > > >>> > >>> > >> Hey folks, >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> > >> I’ve noticed a discrepancy between the Iceberg >>>>>>> > > specification >>>>>>> > > > >>> and >>>>>>> > > > >>> > >>> the Java >>>>>>> > > > >>> > >>> > >> implementation regarding the `operation` key in >>>>>>> the >>>>>>> > > `Snapshot` >>>>>>> > > > >>> > >>> `summary` >>>>>>> > > > >>> > >>> > >> field. >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> > >> The `Snapshot` object's `summary` dictionary >>>>>>> includes a >>>>>>> > > > >>> *required* >>>>>>> > > > >>> > >>> key >>>>>>> > > > >>> > >>> > >> named `operation`, as outlined in the spec >>>>>>> describing >>>>>>> > > Table >>>>>>> > > > >>> > >>> Metadata and >>>>>>> > > > >>> > >>> > >> Snapshots [1] and the generated OpenAPI YAML >>>>>>> [2]. >>>>>>> > > However, in >>>>>>> > > > >>> the >>>>>>> > > > >>> > >>> Java >>>>>>> > > > >>> > >>> > >> implementation [3], `operation` is treated as >>>>>>> optional. In >>>>>>> > > > >>> > >>> contrast, it >>>>>>> > > > >>> > >>> > >> remains a required field in the Python >>>>>>> implementation [4]. >>>>>>> > > > >>> > >>> > >> I also found that Java tests for >>>>>>> `SnapshotParser` assert >>>>>>> > > that >>>>>>> > > > >>> the >>>>>>> > > > >>> > >>> > >> `operation` field is null. [5] >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> > >> Due to this discrepancy, a user reported [6] >>>>>>> that the >>>>>>> > > > >>> > >>> `metadata.json` >>>>>>> > > > >>> > >>> > >> file generated for an Iceberg table could not >>>>>>> be read by >>>>>>> > > > >>> PyIceberg, >>>>>>> > > > >>> > >>> though >>>>>>> > > > >>> > >>> > >> it is readable using the Iceberg Java library. >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> > >> How should we proceed from here? Should the >>>>>>> Java library >>>>>>> > > > >>> enforce >>>>>>> > > > >>> > >>> this >>>>>>> > > > >>> > >>> > >> requirement? Additionally, how should we handle >>>>>>> existing >>>>>>> > > > >>> > >>> `metadata.json` >>>>>>> > > > >>> > >>> > >> files that were generated without this field? >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> > >> Best, >>>>>>> > > > >>> > >>> > >> Kevin Liu >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> > >> [1] >>>>>>> > > > >>> >>>>>>> https://iceberg.apache.org/spec/#table-metadata-and-snapshots >>>>>>> > > > >>> > >>> > >> [2] >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060 >>>>>>> > > > >>> > >>> > >> [3] >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124 >>>>>>> > > > >>> > >>> > >> [4] >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182 >>>>>>> > > > >>> > >>> > >> [5] >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> >>>>>>> > > >>>>>>> https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52 >>>>>>> > > > >>> > >>> > >> [6] >>>>>>> https://github.com/apache/iceberg-python/issues/1106 >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> > >> >>>>>>> > > > >>> > >>> > >>>>>>> > > > >>> > >>> >>>>>>> > > > >>> > >> >>>>>>> > > > >>> > >>>>>>> > > > >>> >>>>>>> > > > >> >>>>>>> > > > >>>>>>> > > >>>>>>> > >>>>>>> >>>>>>