Re: [DISCUSS] Inconsistency between java implementation and spec about partition-spec and schema in v1 table

Renjie Liu Wed, 19 Mar 2025 05:46:51 -0700

Thanks for everyone joining the discussion. I think we have reached a
consensus on the approach:


 Read liberally and write strictly


Follow the above principle in different language implementations.

I'll fire a pr following Fokko's suggestion about the statement that the
schemas and current-schema-id take priority over schema .

On Wed, Mar 19, 2025 at 3:04 AM Kevin Liu <kevinjq...@apache.org> wrote:

> I agree with Fokko to add the current behavior to the Implementation Notes
> section. With schemas vs schema and partitions_specs vs partition_spec, it
> is the logical fallback when the required field for the V1 table is
> missing.
>
> > Read liberally and write strictly.
>
> This has been mentioned a couple times on the devlist. It's the
> general approach the java (reference) implementation has taken. I'm in
> favor of following this approach in other implementations. In order to fix
> the invalid table metadata, we'd need to be able to read it first.
>
> Best,
> Kevin Liu
>
> On Tue, Mar 18, 2025 at 7:02 AM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> Our general policy in the past has been to accept potentially invalid
>> metadata as long as we can reason about it or until we get to a point where
>> we have to throw an error.
>>
>> Read liberally and write strictly.
>>
>> I wouldn’t want to change the spec to match an incorrect writer but there
>> is nothing wrong with changing a reader imho if it’s primarily to fix
>> tables. I think the only thing we really want to avoid is making implicit
>> changes to the spec by normalizing completely out of spec metadata.
>>
>> On Tue, Mar 18, 2025 at 4:46 AM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>>
>>> I think there are two problems here:
>>>
>>> 1. As Fokko and Eduard mentioned, Glue has produced invalid v1 tables.
>>> 2. The java implementation didn't do a sanity check about table metadata
>>> for required fields.
>>>
>>> We could follow java implementation in other languages, but this seems a
>>> workaround for dealing with invalid table metadata. I'm not sure if this is
>>> desired approach.
>>>
>>>
>>> On Tue, Mar 18, 2025 at 5:26 PM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>>
>>>> I also believe that the Java impl and the spec are correct for V1.
>>>> The TableMetadataParser reads *schemas
>>>> <https://github.com/apache/iceberg/blob/f3b3ee40871d38083ed095215fffa91acb2c8a45/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L356-L380>*
>>>> / *partition_specs
>>>> <https://github.com/apache/iceberg/blob/f3b3ee40871d38083ed095215fffa91acb2c8a45/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L393-L409>*
>>>> first. If those are present, then it must be a V2 table and therefore 
>>>> *schema
>>>> <https://github.com/apache/iceberg/blob/f3b3ee40871d38083ed095215fffa91acb2c8a45/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L382-L387>*
>>>> / *partition_spec
>>>> <https://github.com/apache/iceberg/blob/f3b3ee40871d38083ed095215fffa91acb2c8a45/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L412-L421>*
>>>>  don't
>>>> have to be read.
>>>> That being said, *schema* / *partition_spec* are still required fields
>>>> for V1, so I wouldn't change anything in the spec other than maybe adding
>>>> that implementation note of the Java impl.
>>>>
>>>> As Fokko already mentioned and looking at
>>>> https://github.com/apache/iceberg-rust/issues/1088 it seems that the
>>>> problem is with how Glue produces V1 metadata.
>>>>
>>>>
>>>> Eduard
>>>>
>>>> On Tue, Mar 18, 2025 at 9:53 AM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Thanks for raising this. I believe the Java implementation and the
>>>>> spec are still in sync since for V1 we always write schema
>>>>> <https://github.com/apache/iceberg/blob/8f6ebb5b36a0263edfcb04e0c104b26225f95b07/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L176-L182>
>>>>> and partition-spec
>>>>> <https://github.com/apache/iceberg/blob/8f6ebb5b36a0263edfcb04e0c104b26225f95b07/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L192-L196>.
>>>>> According to the PR <https://github.com/apache/iceberg-rust/pull/1087>,
>>>>> it looks like the problem lies in Glue. I believe making the fields in the
>>>>> spec optional instead of required isn't the right approach, since this
>>>>> leads us to a path where we would update the spec for an
>>>>> implementation that is not adhering to it. Since the field is deprecated, 
>>>>> I
>>>>> think this is a special case and we could add a section to Implementation
>>>>> Notes
>>>>> <https://iceberg.apache.org/spec/#appendix-f-implementation-notes> that
>>>>> states that the schemas and current-schema-id take priority over
>>>>> schema.
>>>>>
>>>>> Curious to hear what others think.
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>> Op di 18 mrt 2025 om 04:20 schreef Kevin Liu <kevinjq...@apache.org>:
>>>>>
>>>>>> Hi Renjie,
>>>>>>
>>>>>> Thanks for raising this! For context, here's the iceberg-rust PR
>>>>>> <https://github.com/apache/iceberg-rust/pull/1087> where we found
>>>>>> the inconsistency.
>>>>>> I am also in favor of approach 2 for backwards compatibility reasons.
>>>>>> Perhaps, we can include this behavior in "Appendix E: Format version
>>>>>> changes" <https://iceberg.apache.org/spec/#version-2>, under the "Reading
>>>>>> v1 metadata for v2" section.
>>>>>>
>>>>>> Best,
>>>>>> Kevin Liu
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 17, 2025 at 7:17 PM Renjie Liu <liurenjie2...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi:
>>>>>>>
>>>>>>> We found an inconsistency between java implementation and spec about
>>>>>>> partition-spec and schema in v1 table.
>>>>>>>
>>>>>>> In spec, it says in v1 table partition-spec and schema are required
>>>>>>> but deprecated: https://iceberg.apache.org/spec/#table-metadata
>>>>>>>
>>>>>>> While in java implementation, they are both optional. For schema, it
>>>>>>> checks for `schemas` first, and looks up the current table schema by
>>>>>>> `schema-id`.  Otherwise it looks for a `schema` field. Similar things
>>>>>>> happen for `partition spec`.
>>>>>>>
>>>>>>> We found this problem when implementing iceberg-rust. To
>>>>>>> resolve this inconsistency, there are two approaches:
>>>>>>>
>>>>>>> 1. Modify java implementation to match spec, e.g. force checking
>>>>>>> `schema` and `partition spec` field.
>>>>>>> 2. Update spec to claim that both `schema` and `partition spec` are
>>>>>>> optional for v1 table.
>>>>>>>
>>>>>>> Personally I'm in favor of approach 2 as it keeps backward
>>>>>>> compatibility and seems a more reasonable solution to me.
>>>>>>>
>>>>>>> Looking forward to hearing from you!
>>>>>>>
>>>>>>

Re: [DISCUSS] Inconsistency between java implementation and spec about partition-spec and schema in v1 table

Reply via email to