BTW, It was pointed out to me, that Unknown type is also useful for SQL systems where somebody writes a query like:
"SELECT null as someColName" In this case someColName would have an unknown type as well. On Fri, Nov 28, 2025 at 12:59 AM Joana Hrotkó <[email protected]> wrote: > Thanks, Micah! > > On Thu, Nov 27, 2025 at 4:48 AM Micah Kornfield <[email protected]> > wrote: > >> Hi Joana, >> Here are my thoughts, which are by no means the definitive answer here. >> >> >>> 1. Given that variant can store any data type (both structured and >>> primitive), I'm unclear when unknown would be preferred as similar >>> behavior could be achieved by adding nullable variant columns? It seems >>> like variant could handle most schema evolution scenarios. Are there >>> specific situations where unknown is the better choice? >> >> >> I think the point of the type is to not impose on a system the need have >> to use a nullable variant column if it can't infer the type. The variant >> type has more overhead and can't easily be narrowed solely based on a >> metadata operation to other types (but a NullType can easily be widened to >> any type as a metadata operation). >> >> The null type is generally meant from moving from schema-less systems to >> ones with a schema. e.g. A CSV file that has an empty value for every >> field in a particular column. I think Parquet's description of its >> analogous type [1] is a good illustration: >> >> "Sometimes when discovering the schema of existing data, values are >> always null and the physical type can't be determined. This annotation >> signals the case where the physical type was guessed from all null values." >> >> That being said I don't think it is necessarily a bad idea if a system >> wants to use Nullable variants for this use-case. >> >> 2. Also, is unknown intended for explicit use in DDL? Meaning, should >>> users write DDL like: >> >> >> In general, I don't think there is much of a use-case for allowing users >> to set this through DDL, other than perhaps cloning it from an existing >> table. As you pointed out if someone wishing to keep there options open is >> likely better off using variant, or a type that can be widened later. >> >> There are probably multiple ways of handling evolution but two possible >> workable alternatives (I don't think these belong in the iceberg spec): >> 1. Automatically evolve the schema based on the first inserted non-null >> value for the column. >> 2. Block insertions that try to insert a non-null values in the column >> until user explicitly alters the column to a specific type. >> >> Cheers, >> Micah >> >> [1] >> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L330 >> >> On Tue, Nov 18, 2025 at 4:45 AM Joana Hrotkó >> <[email protected]> wrote: >> >>> Hi Iceberg Community, >>> >>> I'm working with Iceberg v3 and trying to understand the practical use >>> cases for the unknown type, especially in relation to the variant type. >>> >>> The variant type handles both semi-structured data (JSON, nested >>> objects/arrays) and primitive types (strings, integers, booleans, dates, >>> timestamps, etc.) with efficient binary encoding. It supports schema >>> evolution and provides good query performance. >>> >>> The unknown type is described as being for "evolving schemas without >>> forcing immediate resolution" and must always default to null. >>> >>> 1. Given that variant can store any data type (both structured and >>> primitive), I'm unclear when unknown would be preferred as similar >>> behavior could be achieved by adding nullable variant columns? It seems >>> like variant could handle most schema evolution scenarios. Are there >>> specific situations where unknown is the better choice? >>> >>> 2. Also, is unknown intended for explicit use in DDL? Meaning, should >>> users write DDL like: >>> >>> CREATE TABLE foo (col1 unknown)ALTER TABLE foo ADD COLUMN col2 unknown >>> >>> Or is unknown an internal type that engines use automatically during >>> schema evolution? >>> >>> Cheers, >>> >>> Joana Hrotkó >>> >>
