I agree with Dan that type promotion should be well-defined. If it's a grey area then we should clarify it in the spec.
How it works today is that schema evolution always produces a schema that can read files written with any older schema. When a type is promoted, the new schema can read any older data file, but readers may need to promote values like the [int-to-long reader]( https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L546-L560) does. You aren't guaranteed to be able to read new data using an older schema, so the latest schema should always be used or you should use the schema attached to a snapshot. Because files with older schemas can always be read, it is safe to write files with an older schema. This happens fairly regularly, as Steven noted, in cases where a writer has a fixed schema and is long-running. Ryan On Thu, Aug 21, 2025 at 5:37 PM Steven Wu <stevenz...@gmail.com> wrote: > > This means that you can have writers using different schema to write > (use cases include different partitioning or "out-of-date" writers), but > the data is still valid. > > +1 on Dan's point. Both batch and streaming writers can have stale schema. > long-running streaming jobs may stay stale for extended periods before > picking up the new schema during restart. > > On Wed, Aug 20, 2025 at 2:50 PM Daniel Weeks <dwe...@apache.org> wrote: > >> I think I'm going to disagree and argue that it's not really a gray area. >> >> Having strict schema evolution rules and how schema's are tracked means >> that there is independence between writer and reader schemas which remain >> compatible due to the evolution rules. >> >> This means that you can have writers using different schema to write (use >> cases include different partitioning or "out-of-date" writers), but the >> data is still valid. >> >> How you promote physical representation during a read/scan operation >> results in a consistent presentation with the read schema. >> >> All of the representations are technically valid. >> >> -Dan >> >> On Mon, Aug 18, 2025 at 7:46 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> +1 to what Micah said :) sorry about the typo >>> >>> On Mon, Aug 18, 2025 at 9:45 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> +1 to what Micaah , We have never really written rules about what is >>>> "allowed" in this particular context but since >>>> a reader needs to be able to handle both int/long values for the >>>> column, there isn't really any danger in writing >>>> new files with the narrower type. If a reader couldn't handle this, >>>> then type promotion would be impossible. >>>> >>>> I would include all columns in the file, the space requirements for an >>>> all null column (or all constant column) should >>>> be very small. I believe the reason we original wrote those rules in >>>> was to avoid folks doing the Hive Style >>>> implicit columns from partition tuple (although we also have handling >>>> for this.) >>>> >>>> On Sun, Aug 17, 2025 at 11:15 PM Micah Kornfield <emkornfi...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> Hi Nic, >>>>> This is IMO a gray area. >>>>> >>>>> However, is it allowed to commit *new* parquet files with the old >>>>>> types (int) and commit them to the table with a table schema where >>>>>> types are promoted (long)? >>>>> >>>>> >>>>> IMO I would expect writers to be writing files that are consistent >>>>> with the current metadata, so ideally they would not be written with int >>>>> if >>>>> it is now long. In general, though in these cases I think most readers >>>>> are >>>>> robust to reading type promoted files. We should probably clarify in the >>>>> specification. >>>>> >>>>> >>>>> Also, is it allowed to commit parquet files, in general, which contain >>>>>> only a subset of columns of table schema? I.e. if I know a column is >>>>>> all NULLs, can we just skip writing it? >>>>> >>>>> >>>>> As currently worded the spec on writing data files ( >>>>> https://iceberg.apache.org/spec/#writing-data-files) should include >>>>> all columns. Based on column projection rules >>>>> <https://iceberg.apache.org/spec/#column-projection>, however, >>>>> failing to do so should also not cause problems. >>>>> >>>>> Cheers, >>>>> Micah >>>>> >>>>> On Fri, Aug 15, 2025 at 8:45 AM Nicolae Vartolomei >>>>> <n...@nvartolomei.com.invalid> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm implementing an Iceberg writer[^1] and have a question about what >>>>>> type promotion actually means as part of schema evolution rules. >>>>>> >>>>>> Iceberg spec [specifies][spec-evo] which type promotions are allowed. >>>>>> No confusion there. >>>>>> >>>>>> The confusion on my end arises when it comes to actually writing i.e. >>>>>> parquet data. Let's take for example the int to long promotion. What >>>>>> is actually allowed under this promotion rule? Let me try to show what >>>>>> I mean. >>>>>> >>>>>> Obviously if I have a schema-id N with field A of type int and table >>>>>> snapshots with this schema then it is possible to update the table >>>>>> schema-id to > N where field A now has type long and this new schema >>>>>> can read parquet files with the old type. >>>>>> >>>>>> However, is it allowed to commit *new* parquet files with the old >>>>>> types (int) and commit them to the table with a table schema where >>>>>> types are promoted (long)? >>>>>> >>>>>> Also, is it allowed to commit parquet files, in general, which contain >>>>>> only a subset of columns of table schema? I.e. if I know a column is >>>>>> all NULLs, can we just skip writing it? >>>>>> >>>>>> Appreciate taking the time to look at this, >>>>>> Nic >>>>>> >>>>>> [spec-evo]: https://iceberg.apache.org/spec/#schema-evolution >>>>>> [^1]: This is for Redpanda to Iceberg native integration >>>>>> (https://github.com/redpanda-data/redpanda). >>>>>> >>>>>