Thanks for everyone joining the discussion. I think we have reached a consensus on the approach:
Read liberally and write strictly Follow the above principle in different language implementations. I'll fire a pr following Fokko's suggestion about the statement that the schemas and current-schema-id take priority over schema . On Wed, Mar 19, 2025 at 3:04 AM Kevin Liu <kevinjq...@apache.org> wrote: > I agree with Fokko to add the current behavior to the Implementation Notes > section. With schemas vs schema and partitions_specs vs partition_spec, it > is the logical fallback when the required field for the V1 table is > missing. > > > Read liberally and write strictly. > > This has been mentioned a couple times on the devlist. It's the > general approach the java (reference) implementation has taken. I'm in > favor of following this approach in other implementations. In order to fix > the invalid table metadata, we'd need to be able to read it first. > > Best, > Kevin Liu > > On Tue, Mar 18, 2025 at 7:02 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> Our general policy in the past has been to accept potentially invalid >> metadata as long as we can reason about it or until we get to a point where >> we have to throw an error. >> >> Read liberally and write strictly. >> >> I wouldn’t want to change the spec to match an incorrect writer but there >> is nothing wrong with changing a reader imho if it’s primarily to fix >> tables. I think the only thing we really want to avoid is making implicit >> changes to the spec by normalizing completely out of spec metadata. >> >> On Tue, Mar 18, 2025 at 4:46 AM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> >>> I think there are two problems here: >>> >>> 1. As Fokko and Eduard mentioned, Glue has produced invalid v1 tables. >>> 2. The java implementation didn't do a sanity check about table metadata >>> for required fields. >>> >>> We could follow java implementation in other languages, but this seems a >>> workaround for dealing with invalid table metadata. I'm not sure if this is >>> desired approach. >>> >>> >>> On Tue, Mar 18, 2025 at 5:26 PM Eduard Tudenhöfner < >>> etudenhoef...@apache.org> wrote: >>> >>>> I also believe that the Java impl and the spec are correct for V1. >>>> The TableMetadataParser reads *schemas >>>> <https://github.com/apache/iceberg/blob/f3b3ee40871d38083ed095215fffa91acb2c8a45/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L356-L380>* >>>> / *partition_specs >>>> <https://github.com/apache/iceberg/blob/f3b3ee40871d38083ed095215fffa91acb2c8a45/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L393-L409>* >>>> first. If those are present, then it must be a V2 table and therefore >>>> *schema >>>> <https://github.com/apache/iceberg/blob/f3b3ee40871d38083ed095215fffa91acb2c8a45/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L382-L387>* >>>> / *partition_spec >>>> <https://github.com/apache/iceberg/blob/f3b3ee40871d38083ed095215fffa91acb2c8a45/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L412-L421>* >>>> don't >>>> have to be read. >>>> That being said, *schema* / *partition_spec* are still required fields >>>> for V1, so I wouldn't change anything in the spec other than maybe adding >>>> that implementation note of the Java impl. >>>> >>>> As Fokko already mentioned and looking at >>>> https://github.com/apache/iceberg-rust/issues/1088 it seems that the >>>> problem is with how Glue produces V1 metadata. >>>> >>>> >>>> Eduard >>>> >>>> On Tue, Mar 18, 2025 at 9:53 AM Fokko Driesprong <fo...@apache.org> >>>> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> Thanks for raising this. I believe the Java implementation and the >>>>> spec are still in sync since for V1 we always write schema >>>>> <https://github.com/apache/iceberg/blob/8f6ebb5b36a0263edfcb04e0c104b26225f95b07/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L176-L182> >>>>> and partition-spec >>>>> <https://github.com/apache/iceberg/blob/8f6ebb5b36a0263edfcb04e0c104b26225f95b07/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L192-L196>. >>>>> According to the PR <https://github.com/apache/iceberg-rust/pull/1087>, >>>>> it looks like the problem lies in Glue. I believe making the fields in the >>>>> spec optional instead of required isn't the right approach, since this >>>>> leads us to a path where we would update the spec for an >>>>> implementation that is not adhering to it. Since the field is deprecated, >>>>> I >>>>> think this is a special case and we could add a section to Implementation >>>>> Notes >>>>> <https://iceberg.apache.org/spec/#appendix-f-implementation-notes> that >>>>> states that the schemas and current-schema-id take priority over >>>>> schema. >>>>> >>>>> Curious to hear what others think. >>>>> >>>>> Kind regards, >>>>> Fokko >>>>> >>>>> Op di 18 mrt 2025 om 04:20 schreef Kevin Liu <kevinjq...@apache.org>: >>>>> >>>>>> Hi Renjie, >>>>>> >>>>>> Thanks for raising this! For context, here's the iceberg-rust PR >>>>>> <https://github.com/apache/iceberg-rust/pull/1087> where we found >>>>>> the inconsistency. >>>>>> I am also in favor of approach 2 for backwards compatibility reasons. >>>>>> Perhaps, we can include this behavior in "Appendix E: Format version >>>>>> changes" <https://iceberg.apache.org/spec/#version-2>, under the "Reading >>>>>> v1 metadata for v2" section. >>>>>> >>>>>> Best, >>>>>> Kevin Liu >>>>>> >>>>>> >>>>>> On Mon, Mar 17, 2025 at 7:17 PM Renjie Liu <liurenjie2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi: >>>>>>> >>>>>>> We found an inconsistency between java implementation and spec about >>>>>>> partition-spec and schema in v1 table. >>>>>>> >>>>>>> In spec, it says in v1 table partition-spec and schema are required >>>>>>> but deprecated: https://iceberg.apache.org/spec/#table-metadata >>>>>>> >>>>>>> While in java implementation, they are both optional. For schema, it >>>>>>> checks for `schemas` first, and looks up the current table schema by >>>>>>> `schema-id`. Otherwise it looks for a `schema` field. Similar things >>>>>>> happen for `partition spec`. >>>>>>> >>>>>>> We found this problem when implementing iceberg-rust. To >>>>>>> resolve this inconsistency, there are two approaches: >>>>>>> >>>>>>> 1. Modify java implementation to match spec, e.g. force checking >>>>>>> `schema` and `partition spec` field. >>>>>>> 2. Update spec to claim that both `schema` and `partition spec` are >>>>>>> optional for v1 table. >>>>>>> >>>>>>> Personally I'm in favor of approach 2 as it keeps backward >>>>>>> compatibility and seems a more reasonable solution to me. >>>>>>> >>>>>>> Looking forward to hearing from you! >>>>>>> >>>>>>