+1 to what Micah said :) sorry about the typo On Mon, Aug 18, 2025 at 9:45 AM Russell Spitzer <russell.spit...@gmail.com> wrote:
> +1 to what Micaah , We have never really written rules about what is > "allowed" in this particular context but since > a reader needs to be able to handle both int/long values for the column, > there isn't really any danger in writing > new files with the narrower type. If a reader couldn't handle this, then > type promotion would be impossible. > > I would include all columns in the file, the space requirements for an all > null column (or all constant column) should > be very small. I believe the reason we original wrote those rules in was > to avoid folks doing the Hive Style > implicit columns from partition tuple (although we also have handling for > this.) > > On Sun, Aug 17, 2025 at 11:15 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> >> Hi Nic, >> This is IMO a gray area. >> >> However, is it allowed to commit *new* parquet files with the old >>> types (int) and commit them to the table with a table schema where >>> types are promoted (long)? >> >> >> IMO I would expect writers to be writing files that are consistent with >> the current metadata, so ideally they would not be written with int if it >> is now long. In general, though in these cases I think most readers are >> robust to reading type promoted files. We should probably clarify in the >> specification. >> >> >> Also, is it allowed to commit parquet files, in general, which contain >>> only a subset of columns of table schema? I.e. if I know a column is >>> all NULLs, can we just skip writing it? >> >> >> As currently worded the spec on writing data files ( >> https://iceberg.apache.org/spec/#writing-data-files) should include all >> columns. Based on column projection rules >> <https://iceberg.apache.org/spec/#column-projection>, however, failing >> to do so should also not cause problems. >> >> Cheers, >> Micah >> >> On Fri, Aug 15, 2025 at 8:45 AM Nicolae Vartolomei >> <n...@nvartolomei.com.invalid> wrote: >> >>> Hi, >>> >>> I'm implementing an Iceberg writer[^1] and have a question about what >>> type promotion actually means as part of schema evolution rules. >>> >>> Iceberg spec [specifies][spec-evo] which type promotions are allowed. >>> No confusion there. >>> >>> The confusion on my end arises when it comes to actually writing i.e. >>> parquet data. Let's take for example the int to long promotion. What >>> is actually allowed under this promotion rule? Let me try to show what >>> I mean. >>> >>> Obviously if I have a schema-id N with field A of type int and table >>> snapshots with this schema then it is possible to update the table >>> schema-id to > N where field A now has type long and this new schema >>> can read parquet files with the old type. >>> >>> However, is it allowed to commit *new* parquet files with the old >>> types (int) and commit them to the table with a table schema where >>> types are promoted (long)? >>> >>> Also, is it allowed to commit parquet files, in general, which contain >>> only a subset of columns of table schema? I.e. if I know a column is >>> all NULLs, can we just skip writing it? >>> >>> Appreciate taking the time to look at this, >>> Nic >>> >>> [spec-evo]: https://iceberg.apache.org/spec/#schema-evolution >>> [^1]: This is for Redpanda to Iceberg native integration >>> (https://github.com/redpanda-data/redpanda). >>> >>