Re: pre-proposal: schema_id on DataFile

Devin Smith Tue, 18 Feb 2025 09:47:15 -0800

I'm coming at this from a mental model where a producer(s) to a given Table
is tightly-coupled to a specific Schema. That is, even as the Table's
Schema is evolved, the producer's logic will be unchanged - they produce
parquet files that have the same parquet metadata and columns. (This model
may place additional restrictions on how a Schema may be evolved... but I
do think this is a pretty common approach.)

Ultimately, I'd like to know the producer's Schema for each DataFile (/
ManifestFile) for query planning purposes, with the intention of not
needing to read the parquet (meta)data unless absolutely necessary.

As mentioned in the earlier response, the manifest avro header contains a
Schema, but unfortunately, it looks like the typical write patterns we are
using in Java (and, I assume others may be using), does not seem to
preserve the producer's Schema as I might have expected. That is, using
`DataFiles#builder(PartitionSpec)` with a constant `spec` for the producer,
along with `AppendFiles#appendFile`, does not persist `spec.schema()` to
the ManifestFile. (Digging into the object models, this makes some sense,
but it seems like a rough edges of using `PartitionSpec` and "does the
PartitionSpec's Schema matter for this API or not?").

It seems like we need to use the lower level API
`AppendFiles#appendManifest` (ie, take over writing the manifest yourself)
if we actually want to preserve the producer's Schema... but even then, the
specification may not technically allow the preservation of the
_producer's_ Schema, as it does seem to explicitly call out using the
latest schema:

> the table schema at the time the manifest was written

Which somewhat leads me back to the original question, maybe a bit more
targeted: is there a way to capture the producer's specific Schema as
Iceberg metadata?

It seems like the answer today may be "no". I understand the concerns of
not wanting to store schema_id as to prevent the removal of old schemas
from table metadata; I wonder if adding an optional "producer's Schema"
object to DataFile or ManifestFile would be appropriate? (Or maybe, some
additional auxiliary structure / file for Schemas instead of needing to
store directly in the table metadata?)

I think storing the highest field ID solves a subset of query planning use
cases, but it's less precise... for example, the question "does this
DataFile have this field ID?" would be answered with a "No" / "Maybe" (as
opposed to "No" / "Yes" if you knew the exact Schema).

On Fri, Feb 14, 2025 at 12:10 PM rdb...@gmail.com <rdb...@gmail.com> wrote:

> We've considered this in the past and I'm undecided on it. There is some
> benefit, like being able to prune files during planning if the file didn't
> contain a column that is used in a non-null filter (i.e. `new_data_column
> IN ("a", "b")`).
>
> On the other hand, we don't want data files that were written with older
> schemas to prevent removing a schema from table metadata. The current
> schema can always be used to read and we don't want to compromise that,
> need to keep around too many schemas, or need to scan all metadata to
> remove schemas.
>
> I think my preference is to instead include the highest field ID in the
> schema used to write a file. That enables the `new_data_column` filter
> logic above, but never requires keeping schemas around.
>
> As Fokko said, this probably depends on your use case. I'm talking about
> cases that I've thought about but if you have a different one in mind,
> we're open to adding this.
>
> Ryan
>
> On Fri, Feb 14, 2025 at 11:42 AM Devin Smith
> <devinsm...@deephaven.io.invalid> wrote:
>
>> Thanks for the info, it is very helpful. I see it debugging down through
>> `org.apache.iceberg.ManifestReader#readMetadata`. It wasn't obvious to me
>> that this sort of data would be in the avro metadata as opposed to the
>> org.apache.iceberg.ManifestFile object. I may have some questions later
>> about the writing side of the equation in these regards...
>>
>> BTW, it looks like either the spec is incorrect, or the java
>> implementation is incorrect; I see `schema` being written to the manifest
>> header metadata, but not `schema-id`.
>>
>>
>> https://github.com/apache/iceberg/blob/apache-iceberg-1.8.0/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L346-L355
>>
>>
>> https://github.com/apache/iceberg/blob/apache-iceberg-1.8.0/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L312-L321
>>
>>
>>
>> On Fri, Feb 14, 2025 at 10:26 AM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> Hi Devin,
>>>
>>> The schema-id is stored in the Manifest Avro header:
>>> https://iceberg.apache.org/spec/#manifests Also the schema itself is
>>> stored there. Would that help your situation? I think this makes adding it
>>> to the data file redundant.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op vr 14 feb 2025 om 17:56 schreef Devin Smith
>>> <devinsm...@deephaven.io.invalid>:
>>>
>>>> I want to make sure I'm not missing something that already exists;
>>>> otherwise, hoping to get a quick thumbs up / thumbs down on a potential
>>>> proposal before spending more time on it.
>>>>
>>>> It would be nice to know what Iceberg schema a writer used (/assumed)
>>>> when writing a DataFile. Oftentimes, this information is written into the
>>>> parquet file's metadata, but it would be great if Iceberg provided this
>>>> directly. A schema_id on DataFile would be nice, I think.
>>>>
>>>> Thanks,
>>>> -Devin
>>>>
>>>

Re: pre-proposal: schema_id on DataFile

Reply via email to