I'm coming at this from a mental model where a producer(s) to a given Table is tightly-coupled to a specific Schema. That is, even as the Table's Schema is evolved, the producer's logic will be unchanged - they produce parquet files that have the same parquet metadata and columns. (This model may place additional restrictions on how a Schema may be evolved... but I do think this is a pretty common approach.)
Ultimately, I'd like to know the producer's Schema for each DataFile (/ ManifestFile) for query planning purposes, with the intention of not needing to read the parquet (meta)data unless absolutely necessary. As mentioned in the earlier response, the manifest avro header contains a Schema, but unfortunately, it looks like the typical write patterns we are using in Java (and, I assume others may be using), does not seem to preserve the producer's Schema as I might have expected. That is, using `DataFiles#builder(PartitionSpec)` with a constant `spec` for the producer, along with `AppendFiles#appendFile`, does not persist `spec.schema()` to the ManifestFile. (Digging into the object models, this makes some sense, but it seems like a rough edges of using `PartitionSpec` and "does the PartitionSpec's Schema matter for this API or not?"). It seems like we need to use the lower level API `AppendFiles#appendManifest` (ie, take over writing the manifest yourself) if we actually want to preserve the producer's Schema... but even then, the specification may not technically allow the preservation of the _producer's_ Schema, as it does seem to explicitly call out using the latest schema: > the table schema at the time the manifest was written Which somewhat leads me back to the original question, maybe a bit more targeted: is there a way to capture the producer's specific Schema as Iceberg metadata? It seems like the answer today may be "no". I understand the concerns of not wanting to store schema_id as to prevent the removal of old schemas from table metadata; I wonder if adding an optional "producer's Schema" object to DataFile or ManifestFile would be appropriate? (Or maybe, some additional auxiliary structure / file for Schemas instead of needing to store directly in the table metadata?) I think storing the highest field ID solves a subset of query planning use cases, but it's less precise... for example, the question "does this DataFile have this field ID?" would be answered with a "No" / "Maybe" (as opposed to "No" / "Yes" if you knew the exact Schema). On Fri, Feb 14, 2025 at 12:10 PM rdb...@gmail.com <rdb...@gmail.com> wrote: > We've considered this in the past and I'm undecided on it. There is some > benefit, like being able to prune files during planning if the file didn't > contain a column that is used in a non-null filter (i.e. `new_data_column > IN ("a", "b")`). > > On the other hand, we don't want data files that were written with older > schemas to prevent removing a schema from table metadata. The current > schema can always be used to read and we don't want to compromise that, > need to keep around too many schemas, or need to scan all metadata to > remove schemas. > > I think my preference is to instead include the highest field ID in the > schema used to write a file. That enables the `new_data_column` filter > logic above, but never requires keeping schemas around. > > As Fokko said, this probably depends on your use case. I'm talking about > cases that I've thought about but if you have a different one in mind, > we're open to adding this. > > Ryan > > On Fri, Feb 14, 2025 at 11:42 AM Devin Smith > <devinsm...@deephaven.io.invalid> wrote: > >> Thanks for the info, it is very helpful. I see it debugging down through >> `org.apache.iceberg.ManifestReader#readMetadata`. It wasn't obvious to me >> that this sort of data would be in the avro metadata as opposed to the >> org.apache.iceberg.ManifestFile object. I may have some questions later >> about the writing side of the equation in these regards... >> >> BTW, it looks like either the spec is incorrect, or the java >> implementation is incorrect; I see `schema` being written to the manifest >> header metadata, but not `schema-id`. >> >> >> https://github.com/apache/iceberg/blob/apache-iceberg-1.8.0/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L346-L355 >> >> >> https://github.com/apache/iceberg/blob/apache-iceberg-1.8.0/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L312-L321 >> >> >> >> On Fri, Feb 14, 2025 at 10:26 AM Fokko Driesprong <fo...@apache.org> >> wrote: >> >>> Hi Devin, >>> >>> The schema-id is stored in the Manifest Avro header: >>> https://iceberg.apache.org/spec/#manifests Also the schema itself is >>> stored there. Would that help your situation? I think this makes adding it >>> to the data file redundant. >>> >>> Kind regards, >>> Fokko >>> >>> Op vr 14 feb 2025 om 17:56 schreef Devin Smith >>> <devinsm...@deephaven.io.invalid>: >>> >>>> I want to make sure I'm not missing something that already exists; >>>> otherwise, hoping to get a quick thumbs up / thumbs down on a potential >>>> proposal before spending more time on it. >>>> >>>> It would be nice to know what Iceberg schema a writer used (/assumed) >>>> when writing a DataFile. Oftentimes, this information is written into the >>>> parquet file's metadata, but it would be great if Iceberg provided this >>>> directly. A schema_id on DataFile would be nice, I think. >>>> >>>> Thanks, >>>> -Devin >>>> >>>