Re: pre-proposal: schema_id on DataFile

rdb...@gmail.com Fri, 14 Feb 2025 12:08:13 -0800

We've considered this in the past and I'm undecided on it. There is some
benefit, like being able to prune files during planning if the file didn't
contain a column that is used in a non-null filter (i.e. `new_data_column
IN ("a", "b")`).


On the other hand, we don't want data files that were written with older
schemas to prevent removing a schema from table metadata. The current
schema can always be used to read and we don't want to compromise that,
need to keep around too many schemas, or need to scan all metadata to
remove schemas.

I think my preference is to instead include the highest field ID in the
schema used to write a file. That enables the `new_data_column` filter
logic above, but never requires keeping schemas around.

As Fokko said, this probably depends on your use case. I'm talking about
cases that I've thought about but if you have a different one in mind,
we're open to adding this.

Ryan

On Fri, Feb 14, 2025 at 11:42 AM Devin Smith
<devinsm...@deephaven.io.invalid> wrote:

> Thanks for the info, it is very helpful. I see it debugging down through
> `org.apache.iceberg.ManifestReader#readMetadata`. It wasn't obvious to me
> that this sort of data would be in the avro metadata as opposed to the
> org.apache.iceberg.ManifestFile object. I may have some questions later
> about the writing side of the equation in these regards...
>
> BTW, it looks like either the spec is incorrect, or the java
> implementation is incorrect; I see `schema` being written to the manifest
> header metadata, but not `schema-id`.
>
>
> https://github.com/apache/iceberg/blob/apache-iceberg-1.8.0/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L346-L355
>
>
> https://github.com/apache/iceberg/blob/apache-iceberg-1.8.0/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L312-L321
>
>
>
> On Fri, Feb 14, 2025 at 10:26 AM Fokko Driesprong <fo...@apache.org>
> wrote:
>
>> Hi Devin,
>>
>> The schema-id is stored in the Manifest Avro header:
>> https://iceberg.apache.org/spec/#manifests Also the schema itself is
>> stored there. Would that help your situation? I think this makes adding it
>> to the data file redundant.
>>
>> Kind regards,
>> Fokko
>>
>> Op vr 14 feb 2025 om 17:56 schreef Devin Smith
>> <devinsm...@deephaven.io.invalid>:
>>
>>> I want to make sure I'm not missing something that already exists;
>>> otherwise, hoping to get a quick thumbs up / thumbs down on a potential
>>> proposal before spending more time on it.
>>>
>>> It would be nice to know what Iceberg schema a writer used (/assumed)
>>> when writing a DataFile. Oftentimes, this information is written into the
>>> parquet file's metadata, but it would be great if Iceberg provided this
>>> directly. A schema_id on DataFile would be nice, I think.
>>>
>>> Thanks,
>>> -Devin
>>>
>>

Re: pre-proposal: schema_id on DataFile

Reply via email to