I opened https://github.com/apache/iceberg/pull/13936 as a draft proposal to capture the conversation.
BTW, I think one area this brings up that I don't think the specification handles is changing between nullable and not-nullable fields. Outdated schemas have some implications in these cases as well. Cheers, Micah On Tue, Aug 26, 2025 at 10:13 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I think the original question is ambiguous. We should probably subset > this into two questions: > > 1. Is it OK to write out an "int" instead of a "long" if the writer's > schema says the value is a long? > > I think the answer here is we recommended not doing so, even though it > would likely work. > > 2. Is it OK to use an older schema for writing? > > The consensus on the thread seems to be yes. I'll note that this can > cause confusing results when the "write-default" [1] value for a column > changes. We should probably have an implementation note to clarify: > a. Using a stale schema is allowed > b. It might cause inconsistent results in the face of multiple writers > when default values are used. > > Thoughts? > > Thanks, > Micah > > On Mon, Aug 25, 2025 at 4:59 PM Ryan Blue <rdb...@gmail.com> wrote: > >> I agree with Dan that type promotion should be well-defined. If it's a >> grey area then we should clarify it in the spec. >> >> How it works today is that schema evolution always produces a schema that >> can read files written with any older schema. When a type is promoted, the >> new schema can read any older data file, but readers may need to promote >> values like the [int-to-long reader]( >> https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L546-L560) >> does. You aren't guaranteed to be able to read new data using an older >> schema, so the latest schema should always be used or you should use the >> schema attached to a snapshot. >> >> Because files with older schemas can always be read, it is safe to write >> files with an older schema. This happens fairly regularly, as Steven noted, >> in cases where a writer has a fixed schema and is long-running. >> >> Ryan >> >> On Thu, Aug 21, 2025 at 5:37 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> > This means that you can have writers using different schema to write >>> (use cases include different partitioning or "out-of-date" writers), but >>> the data is still valid. >>> >>> +1 on Dan's point. Both batch and streaming writers can have stale >>> schema. long-running streaming jobs may stay stale for extended periods >>> before picking up the new schema during restart. >>> >>> On Wed, Aug 20, 2025 at 2:50 PM Daniel Weeks <dwe...@apache.org> wrote: >>> >>>> I think I'm going to disagree and argue that it's not really a gray >>>> area. >>>> >>>> Having strict schema evolution rules and how schema's are tracked means >>>> that there is independence between writer and reader schemas which remain >>>> compatible due to the evolution rules. >>>> >>>> This means that you can have writers using different schema to write >>>> (use cases include different partitioning or "out-of-date" writers), but >>>> the data is still valid. >>>> >>>> How you promote physical representation during a read/scan operation >>>> results in a consistent presentation with the read schema. >>>> >>>> All of the representations are technically valid. >>>> >>>> -Dan >>>> >>>> On Mon, Aug 18, 2025 at 7:46 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> +1 to what Micah said :) sorry about the typo >>>>> >>>>> On Mon, Aug 18, 2025 at 9:45 AM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> +1 to what Micaah , We have never really written rules about what is >>>>>> "allowed" in this particular context but since >>>>>> a reader needs to be able to handle both int/long values for the >>>>>> column, there isn't really any danger in writing >>>>>> new files with the narrower type. If a reader couldn't handle this, >>>>>> then type promotion would be impossible. >>>>>> >>>>>> I would include all columns in the file, the space requirements for >>>>>> an all null column (or all constant column) should >>>>>> be very small. I believe the reason we original wrote those rules in >>>>>> was to avoid folks doing the Hive Style >>>>>> implicit columns from partition tuple (although we also have handling >>>>>> for this.) >>>>>> >>>>>> On Sun, Aug 17, 2025 at 11:15 PM Micah Kornfield < >>>>>> emkornfi...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> Hi Nic, >>>>>>> This is IMO a gray area. >>>>>>> >>>>>>> However, is it allowed to commit *new* parquet files with the old >>>>>>>> types (int) and commit them to the table with a table schema where >>>>>>>> types are promoted (long)? >>>>>>> >>>>>>> >>>>>>> IMO I would expect writers to be writing files that are consistent >>>>>>> with the current metadata, so ideally they would not be written with >>>>>>> int if >>>>>>> it is now long. In general, though in these cases I think most readers >>>>>>> are >>>>>>> robust to reading type promoted files. We should probably clarify in >>>>>>> the >>>>>>> specification. >>>>>>> >>>>>>> >>>>>>> Also, is it allowed to commit parquet files, in general, which >>>>>>>> contain >>>>>>>> only a subset of columns of table schema? I.e. if I know a column is >>>>>>>> all NULLs, can we just skip writing it? >>>>>>> >>>>>>> >>>>>>> As currently worded the spec on writing data files ( >>>>>>> https://iceberg.apache.org/spec/#writing-data-files) should include >>>>>>> all columns. Based on column projection rules >>>>>>> <https://iceberg.apache.org/spec/#column-projection>, however, >>>>>>> failing to do so should also not cause problems. >>>>>>> >>>>>>> Cheers, >>>>>>> Micah >>>>>>> >>>>>>> On Fri, Aug 15, 2025 at 8:45 AM Nicolae Vartolomei >>>>>>> <n...@nvartolomei.com.invalid> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm implementing an Iceberg writer[^1] and have a question about >>>>>>>> what >>>>>>>> type promotion actually means as part of schema evolution rules. >>>>>>>> >>>>>>>> Iceberg spec [specifies][spec-evo] which type promotions are >>>>>>>> allowed. >>>>>>>> No confusion there. >>>>>>>> >>>>>>>> The confusion on my end arises when it comes to actually writing >>>>>>>> i.e. >>>>>>>> parquet data. Let's take for example the int to long promotion. What >>>>>>>> is actually allowed under this promotion rule? Let me try to show >>>>>>>> what >>>>>>>> I mean. >>>>>>>> >>>>>>>> Obviously if I have a schema-id N with field A of type int and table >>>>>>>> snapshots with this schema then it is possible to update the table >>>>>>>> schema-id to > N where field A now has type long and this new schema >>>>>>>> can read parquet files with the old type. >>>>>>>> >>>>>>>> However, is it allowed to commit *new* parquet files with the old >>>>>>>> types (int) and commit them to the table with a table schema where >>>>>>>> types are promoted (long)? >>>>>>>> >>>>>>>> Also, is it allowed to commit parquet files, in general, which >>>>>>>> contain >>>>>>>> only a subset of columns of table schema? I.e. if I know a column is >>>>>>>> all NULLs, can we just skip writing it? >>>>>>>> >>>>>>>> Appreciate taking the time to look at this, >>>>>>>> Nic >>>>>>>> >>>>>>>> [spec-evo]: https://iceberg.apache.org/spec/#schema-evolution >>>>>>>> [^1]: This is for Redpanda to Iceberg native integration >>>>>>>> (https://github.com/redpanda-data/redpanda). >>>>>>>> >>>>>>>