> This means that you can have writers using different schema to write (use
cases include different partitioning or "out-of-date" writers), but the
data is still valid.

+1 on Dan's point. Both batch and streaming writers can have stale schema.
long-running streaming jobs may stay stale for extended periods before
picking up the new schema during restart.

On Wed, Aug 20, 2025 at 2:50 PM Daniel Weeks <dwe...@apache.org> wrote:

> I think I'm going to disagree and argue that it's not really a gray area.
>
> Having strict schema evolution rules and how schema's are tracked means
> that there is independence between writer and reader schemas which remain
> compatible due to the evolution rules.
>
> This means that you can have writers using different schema to write (use
> cases include different partitioning or "out-of-date" writers), but the
> data is still valid.
>
> How you promote physical representation during a read/scan operation
> results in a consistent presentation with the read schema.
>
> All of the representations are technically valid.
>
> -Dan
>
> On Mon, Aug 18, 2025 at 7:46 AM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> +1 to what Micah said :) sorry about the typo
>>
>> On Mon, Aug 18, 2025 at 9:45 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> +1 to what Micaah , We have never really written rules about what is
>>> "allowed" in this particular context but since
>>> a reader needs to be able to handle both int/long values for the column,
>>> there isn't really any danger in writing
>>> new files with the narrower type. If a reader couldn't handle this, then
>>> type promotion would be impossible.
>>>
>>> I would include all columns in the file, the space requirements for an
>>> all null column (or all constant column) should
>>> be very small. I believe the reason we original wrote those rules in was
>>> to avoid folks doing the Hive Style
>>> implicit columns from partition tuple (although we also have handling
>>> for this.)
>>>
>>> On Sun, Aug 17, 2025 at 11:15 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>  Hi Nic,
>>>> This is IMO a gray area.
>>>>
>>>> However, is it allowed to commit *new* parquet files with the old
>>>>> types (int) and commit them to the table with a table schema where
>>>>> types are promoted (long)?
>>>>
>>>>
>>>> IMO  I would expect writers to be writing files that are consistent
>>>> with the current metadata, so ideally they would not be written with int if
>>>> it is now long.  In general, though in these cases I think most readers are
>>>> robust to reading type promoted files.  We should probably clarify in the
>>>> specification.
>>>>
>>>>
>>>> Also, is it allowed to commit parquet files, in general, which contain
>>>>> only a subset of columns of table schema? I.e. if I know a column is
>>>>> all NULLs, can we just skip writing it?
>>>>
>>>>
>>>> As currently worded the spec on writing data files (
>>>> https://iceberg.apache.org/spec/#writing-data-files) should include
>>>> all columns. Based on column projection rules
>>>> <https://iceberg.apache.org/spec/#column-projection>, however, failing
>>>> to do so should also not cause problems.
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>> On Fri, Aug 15, 2025 at 8:45 AM Nicolae Vartolomei
>>>> <n...@nvartolomei.com.invalid> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm implementing an Iceberg writer[^1] and have a question about what
>>>>> type promotion actually means as part of schema evolution rules.
>>>>>
>>>>> Iceberg spec [specifies][spec-evo] which type promotions are allowed.
>>>>> No confusion there.
>>>>>
>>>>> The confusion on my end arises when it comes to actually writing i.e.
>>>>> parquet data. Let's take for example the int to long promotion. What
>>>>> is actually allowed under this promotion rule? Let me try to show what
>>>>> I mean.
>>>>>
>>>>> Obviously if I have a schema-id N with field A of type int and table
>>>>> snapshots with this schema then it is possible to update the table
>>>>> schema-id to > N where field A now has type long and this new schema
>>>>> can read parquet files with the old type.
>>>>>
>>>>> However, is it allowed to commit *new* parquet files with the old
>>>>> types (int) and commit them to the table with a table schema where
>>>>> types are promoted (long)?
>>>>>
>>>>> Also, is it allowed to commit parquet files, in general, which contain
>>>>> only a subset of columns of table schema? I.e. if I know a column is
>>>>> all NULLs, can we just skip writing it?
>>>>>
>>>>> Appreciate taking the time to look at this,
>>>>> Nic
>>>>>
>>>>> [spec-evo]: https://iceberg.apache.org/spec/#schema-evolution
>>>>> [^1]: This is for Redpanda to Iceberg native integration
>>>>> (https://github.com/redpanda-data/redpanda).
>>>>>
>>>>

Reply via email to