> This means that you can have writers using different schema to write (use cases include different partitioning or "out-of-date" writers), but the data is still valid.
+1 on Dan's point. Both batch and streaming writers can have stale schema. long-running streaming jobs may stay stale for extended periods before picking up the new schema during restart. On Wed, Aug 20, 2025 at 2:50 PM Daniel Weeks <dwe...@apache.org> wrote: > I think I'm going to disagree and argue that it's not really a gray area. > > Having strict schema evolution rules and how schema's are tracked means > that there is independence between writer and reader schemas which remain > compatible due to the evolution rules. > > This means that you can have writers using different schema to write (use > cases include different partitioning or "out-of-date" writers), but the > data is still valid. > > How you promote physical representation during a read/scan operation > results in a consistent presentation with the read schema. > > All of the representations are technically valid. > > -Dan > > On Mon, Aug 18, 2025 at 7:46 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> +1 to what Micah said :) sorry about the typo >> >> On Mon, Aug 18, 2025 at 9:45 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> +1 to what Micaah , We have never really written rules about what is >>> "allowed" in this particular context but since >>> a reader needs to be able to handle both int/long values for the column, >>> there isn't really any danger in writing >>> new files with the narrower type. If a reader couldn't handle this, then >>> type promotion would be impossible. >>> >>> I would include all columns in the file, the space requirements for an >>> all null column (or all constant column) should >>> be very small. I believe the reason we original wrote those rules in was >>> to avoid folks doing the Hive Style >>> implicit columns from partition tuple (although we also have handling >>> for this.) >>> >>> On Sun, Aug 17, 2025 at 11:15 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> >>>> Hi Nic, >>>> This is IMO a gray area. >>>> >>>> However, is it allowed to commit *new* parquet files with the old >>>>> types (int) and commit them to the table with a table schema where >>>>> types are promoted (long)? >>>> >>>> >>>> IMO I would expect writers to be writing files that are consistent >>>> with the current metadata, so ideally they would not be written with int if >>>> it is now long. In general, though in these cases I think most readers are >>>> robust to reading type promoted files. We should probably clarify in the >>>> specification. >>>> >>>> >>>> Also, is it allowed to commit parquet files, in general, which contain >>>>> only a subset of columns of table schema? I.e. if I know a column is >>>>> all NULLs, can we just skip writing it? >>>> >>>> >>>> As currently worded the spec on writing data files ( >>>> https://iceberg.apache.org/spec/#writing-data-files) should include >>>> all columns. Based on column projection rules >>>> <https://iceberg.apache.org/spec/#column-projection>, however, failing >>>> to do so should also not cause problems. >>>> >>>> Cheers, >>>> Micah >>>> >>>> On Fri, Aug 15, 2025 at 8:45 AM Nicolae Vartolomei >>>> <n...@nvartolomei.com.invalid> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm implementing an Iceberg writer[^1] and have a question about what >>>>> type promotion actually means as part of schema evolution rules. >>>>> >>>>> Iceberg spec [specifies][spec-evo] which type promotions are allowed. >>>>> No confusion there. >>>>> >>>>> The confusion on my end arises when it comes to actually writing i.e. >>>>> parquet data. Let's take for example the int to long promotion. What >>>>> is actually allowed under this promotion rule? Let me try to show what >>>>> I mean. >>>>> >>>>> Obviously if I have a schema-id N with field A of type int and table >>>>> snapshots with this schema then it is possible to update the table >>>>> schema-id to > N where field A now has type long and this new schema >>>>> can read parquet files with the old type. >>>>> >>>>> However, is it allowed to commit *new* parquet files with the old >>>>> types (int) and commit them to the table with a table schema where >>>>> types are promoted (long)? >>>>> >>>>> Also, is it allowed to commit parquet files, in general, which contain >>>>> only a subset of columns of table schema? I.e. if I know a column is >>>>> all NULLs, can we just skip writing it? >>>>> >>>>> Appreciate taking the time to look at this, >>>>> Nic >>>>> >>>>> [spec-evo]: https://iceberg.apache.org/spec/#schema-evolution >>>>> [^1]: This is for Redpanda to Iceberg native integration >>>>> (https://github.com/redpanda-data/redpanda). >>>>> >>>>