Re: [PR] Spec: Add spec for expressions [iceberg]

via GitHub Fri, 05 Jun 2026 15:32:12 -0700


stevenzwu commented on code in PR #16652:
URL: https://github.com/apache/iceberg/pull/16652#discussion_r3359578394



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.
+
+Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value. Value expressions must be compared or 
tested to produce a predicate. For example, `is_empty("")` is not a valid 
predicate, but `is_empty("") = true` is a valid predicate.
+
+
+#### Comparisons
+
+Comparisons are predicates that compare two value expressions with the same 
primitive type. Comparisons are:
+
+| Comparison  | Description |
+|-------------|-------------|
+| `=`         | Is equal |
+| `!=`        | Is not equal |
+| `<`         | Less than |
+| `<=`        | Less than or equal |
+| `>`         | Greater than |
+| `>=`        | Greater than or equal |
+
+Primitive types are compared using natural order, except for the following 
types:
+
+* `false` is less than `true` for `boolean`
+* `fixed` and `binary` use unsigned byte-wise comparison
+* `string` uses unsigned byte-wise comparison of the UTF-8 representation
+* `uuid` uses unsigned byte-wise comparison of the UUID bytes
+* `float` and `double` use IEEE 754 total order after normalizing NaN to the 
canonical NaN (sign bit 0, exponent bits all 1, matissa msb 1 followed by all 0)

Review Comment:
   Two issues: a typo and a framing one.
   
   **Typo:** `matissa` → `mantissa`.
   
   **Framing.** : is the below version more clear?
   
   > `float` and `double` are compared by treating all NaN bit patterns as 
equal to each other and greater than every non-NaN value. Implementations may 
achieve this by canonicalizing NaN bit patterns to a single value (sign bit 0, 
exponent bits all 1, mantissa msb 1 followed by all 0) before applying IEEE 754 
total order.
   
   



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`

Review Comment:
   This phrase is ambiguous: "comparisons of value expressions and boolean 
logic" can be misread as "comparisons of (value expressions and boolean logic)" 
— i.e., comparing value expressions against boolean logic. Suggest splitting:
   
   > **Predicates** are formed from comparisons of value expressions, combined 
with boolean logic, and produce `true` or `false`.
   



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.
+
+Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value. Value expressions must be compared or 
tested to produce a predicate. For example, `is_empty("")` is not a valid 
predicate, but `is_empty("") = true` is a valid predicate.
+
+
+#### Comparisons
+
+Comparisons are predicates that compare two value expressions with the same 
primitive type. Comparisons are:
+
+| Comparison  | Description |
+|-------------|-------------|
+| `=`         | Is equal |
+| `!=`        | Is not equal |
+| `<`         | Less than |
+| `<=`        | Less than or equal |
+| `>`         | Greater than |
+| `>=`        | Greater than or equal |
+
+Primitive types are compared using natural order, except for the following 
types:
+
+* `false` is less than `true` for `boolean`
+* `fixed` and `binary` use unsigned byte-wise comparison
+* `string` uses unsigned byte-wise comparison of the UTF-8 representation

Review Comment:
   This rule is not derived from an IEEE or RFC standard. Should we add a short 
clarifying note so readers do not look for a normative reference? e.g.:
   
   > `string` uses unsigned byte-wise comparison of the UTF-8 representation. 
This preserves Unicode code-point order and is independent of locale; it is 
**not** the Unicode Collation Algorithm 
([UTS#10](https://www.unicode.org/reports/tr10/)).
   



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.
+
+Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value. Value expressions must be compared or 
tested to produce a predicate. For example, `is_empty("")` is not a valid 
predicate, but `is_empty("") = true` is a valid predicate.
+
+
+#### Comparisons
+
+Comparisons are predicates that compare two value expressions with the same 
primitive type. Comparisons are:
+
+| Comparison  | Description |
+|-------------|-------------|
+| `=`         | Is equal |
+| `!=`        | Is not equal |
+| `<`         | Less than |
+| `<=`        | Less than or equal |
+| `>`         | Greater than |
+| `>=`        | Greater than or equal |
+
+Primitive types are compared using natural order, except for the following 
types:
+
+* `false` is less than `true` for `boolean`
+* `fixed` and `binary` use unsigned byte-wise comparison
+* `string` uses unsigned byte-wise comparison of the UTF-8 representation
+* `uuid` uses unsigned byte-wise comparison of the UUID bytes
+* `float` and `double` use IEEE 754 total order after normalizing NaN to the 
canonical NaN (sign bit 0, exponent bits all 1, matissa msb 1 followed by all 0)
+    * `NaN = NaN` is true for any two NaN values
+    * `val < NaN` is true for all non-NaN values
+
+Note type alignment produces `decimal` values with the same scale so that 
comparison is equivalent to the natural order of the unscaled numeric value.
+
+Tests are predicates that test a single value expression, optionally using a 
constant or set of constants. Constants must have the same type and must be 
non-null. Tests are:
+
+| Test                    | Allowed types | Constant type | Description |
+|-------------------------|---------------|---------------|-------------|
+| `IS NULL`               | any           |               | true iff the value 
is null |
+| `IS NOT NULL`           | any           |               | true iff the value 
is not null |
+| `IS NaN`                | float, double |               | true iff the value 
is an IEEE 754 NaN |
+| `IS NOT NaN`            | float, double |               | true iff the value 
is not an IEEE 754 NaN |
+| `STARTS WITH const`     | string        | string        | true iff the 
constant is a prefix of the value |
+| `NOT STARTS WITH const` | string        | string        | true iff the 
constant is not a prefix of the value |
+| `IN (constant set)`     | any           | same as value | true iff the value 
is equal to any constant |
+| `NOT IN (constant set)` | any           | same as value | true iff the value 
is not equal to all constants |
+
+
+#### Boolean logic
+
+Predicates must use 2-valued boolean logic. Evaluation of all predicates must 
produce `true` or `false`.
+
+Engines that implement SQL 3-valued boolean logic must add `IS NULL` and `NOT 
NULL` to produce the 2-valued equivalent. This avoids bugs in engines and 
languages that do not natively implement 3-valued logic. For example, the SQL 
predicate `x < 10` should be passed as `x < 10 AND x IS NOT NULL` for a SQL 
`WHERE` condition (or `x < 10`; see null-safe comparisons below). For a `CHECK` 
constraint, the expression is passed as `x < 10 OR x IS NULL`. This ensures 
that implementations will make the correct determination, rather than depending 
depending on context to interpret a null result (`WHERE` vs `CHECK`).
+
+Logical combinations are boolean operators applied to predicates. `AND` and 
`OR` are binary operations and `NOT` is a unary operation.
+
+Comparisons must be null-safe. For example:
+
+* `null = null` is `true`
+* `34 = null` is `false`
+* `null != null` is `false`
+* `34 != null` is `true`
+* `null < null` is `false`
+* `null <= null` is `true`
+* `34 < null` is `false`
+
+Comparisons must handle null values when value expressions evaluate to null. 
However, value expressions used to define a predicate should not directly 
contain null constants and may reject them. For example, `x = get_item(map, 
"key")` is valid although `get_item` may return a null value, but `x = null` 
should be rejected because `x IS NULL` is the recommended unambiguous predicate.
+
+
+### Compatibility with REST catalog expressions
+
+Older clients use more restrictive forms of predicates and references that 
used a "term" for specific transforms and named references. These expressions 
should be supported for backward compatibility to allow older clients to 
interact with newer REST catalog services.
+
+Prior to this spec, deprecated expressions were passed in the REST API in 3 
places:
+
+* As `filter` passed to server-side scan planning
+* As `filter` passed to the service in `ScanReport`
+* As `residual` passed to the client with a scan task
+
+Both server-side scan planning and the report endpoint can continue to accept 
filters from older clients without issues by parsing term-based expressions 
(see [Appendix B: JSON serialization](#appendix-b-json-serialization)).
+
+Residuals passed from services back to clients that do not use the new syntax 
would cause clients to fail, but services are allowed to omit the residual so 
that it is calculated on the client side (intended to avoid duplicating large 
IN filters). For compatibility, REST services should detect client versions and 
produce deprecated predicates, or omit residuals from tasks.
+
+
+## Appendix A: Iceberg functions
+
+This section defines the functions in the `iceberg_functions` reserved catalog 
name.
+
+* `if_else(condition: predicate, when_true: T, when_false: T) -> T`: returns 
the value of `when_true` when `condition` is true and `when_false` otherwise
+
+### Partition transforms
+
+Iceberg partition transforms are also defined as functions (other than `void`).
+
+All partition transforms produce `null` for a `null` input value.
+
+| Function name     | Description                                              
    | Source types                                                         | 
Result type |
+|-------------------|--------------------------------------------------------------|----------------------------------------------------------------------|-------------|
+| `identity(value)` | Source value, unmodified                                 
    | Any primitive except for `geometry`, `geography`, and `variant`      | 
Source type |
+| `year(value)`     | Extract a date or timestamp year, as years from 1970     
    | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | 
`int`       |
+| `month(value)`    | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, 
`timestamptz_ns` | `int`       |
+| `day(value)`      | Extract a date or timestamp day, as days from 1970-01-01 
    | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | 
`date`      |
+| `hour(value)`     | Extract a timestamp hour, as hours from 1970-01-01 
00:00:00  | `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`        
 | `int`       |
+
+Note that `year`, `month`, and `hour` transforms produce ordinal values and 
not human-readable values. For example, `year(2018-05-13)` produces `48`, not 
`2018`.
+
+Parameterized functions are called as 2-argument functions. The first argument 
is an `int` parameter (`N` or `W` from the table spec) and the second argument 
is the value to transform. For example, `bucket(256, id)` calls `bucket[256]`.
+
+| Parameterized function name | Description                                   
| Source types                                                                  
               | Result type |
+|-----------------------------|-----------------------------------------------|----------------------------------------------------------------------------------------------|-------------|
+| `bucket(N, value)`          | Hash of value, mod `N` (see table spec)       
| Any primitive except for `geometry`, `geography`, `variant`, `boolean`, 
`float`, or `double` | `int`       |
+| `truncate(W, value)`        | Value truncated to width `W` (see table spec) 
| `int`, `long`, `decimal`, `string`, `binary`                                  
               | Source type |
+
+
+## Appendix B: JSON serialization
+
+Iceberg expressions are serialized as JSON objects in table, view, and UDF 
metadata, and in the REST protocol for catalogs.
+
+### Value expressions
+
+```
+EXPR: LITERAL | REFERENCE | APPLY
+
+LITERAL: VALUE
+    | { "type": "literal", "value": VALUE }
+    | { "type": "literal", "value": VALUE, "data-type": DATA_TYPE }
+LITERALS: [ LITERAL* ]
+
+REFERENCE: BOUND_REF | UNBOUND_REF
+BOUND_REF: ID | { "type": "reference", "id": ID }
+UNBOUND_REF: NAME | { "type": "reference", "name": NAME }
+
+APPLY: { "type": "apply", "func-name": FUNC_ID, "arguments": [ EXPR* ] }
+FUNC_ID: NAME
+    | { "catalog": NAME, "namespace": [ NAME* ], "name": NAME }

Review Comment:
   With the flat list [CatalogObjectIdentifier 
PR](https://github.com/apache/iceberg/pull/16144) merged, we probably should 
update this to a flat list too. 
   
   [Functions spec PR](https://github.com/apache/iceberg/pull/15180) is also 
updated.



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.
+
+Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value. Value expressions must be compared or 
tested to produce a predicate. For example, `is_empty("")` is not a valid 
predicate, but `is_empty("") = true` is a valid predicate.
+
+
+#### Comparisons
+
+Comparisons are predicates that compare two value expressions with the same 
primitive type. Comparisons are:
+
+| Comparison  | Description |
+|-------------|-------------|
+| `=`         | Is equal |
+| `!=`        | Is not equal |
+| `<`         | Less than |
+| `<=`        | Less than or equal |
+| `>`         | Greater than |
+| `>=`        | Greater than or equal |
+
+Primitive types are compared using natural order, except for the following 
types:
+
+* `false` is less than `true` for `boolean`
+* `fixed` and `binary` use unsigned byte-wise comparison
+* `string` uses unsigned byte-wise comparison of the UTF-8 representation
+* `uuid` uses unsigned byte-wise comparison of the UUID bytes
+* `float` and `double` use IEEE 754 total order after normalizing NaN to the 
canonical NaN (sign bit 0, exponent bits all 1, matissa msb 1 followed by all 0)
+    * `NaN = NaN` is true for any two NaN values
+    * `val < NaN` is true for all non-NaN values
+
+Note type alignment produces `decimal` values with the same scale so that 
comparison is equivalent to the natural order of the unscaled numeric value.
+
+Tests are predicates that test a single value expression, optionally using a 
constant or set of constants. Constants must have the same type and must be 
non-null. Tests are:
+
+| Test                    | Allowed types | Constant type | Description |
+|-------------------------|---------------|---------------|-------------|
+| `IS NULL`               | any           |               | true iff the value 
is null |
+| `IS NOT NULL`           | any           |               | true iff the value 
is not null |
+| `IS NaN`                | float, double |               | true iff the value 
is an IEEE 754 NaN |
+| `IS NOT NaN`            | float, double |               | true iff the value 
is not an IEEE 754 NaN |
+| `STARTS WITH const`     | string        | string        | true iff the 
constant is a prefix of the value |
+| `NOT STARTS WITH const` | string        | string        | true iff the 
constant is not a prefix of the value |
+| `IN (constant set)`     | any           | same as value | true iff the value 
is equal to any constant |
+| `NOT IN (constant set)` | any           | same as value | true iff the value 
is not equal to all constants |
+
+
+#### Boolean logic
+
+Predicates must use 2-valued boolean logic. Evaluation of all predicates must 
produce `true` or `false`.
+
+Engines that implement SQL 3-valued boolean logic must add `IS NULL` and `NOT 
NULL` to produce the 2-valued equivalent. This avoids bugs in engines and 
languages that do not natively implement 3-valued logic. For example, the SQL 
predicate `x < 10` should be passed as `x < 10 AND x IS NOT NULL` for a SQL 
`WHERE` condition (or `x < 10`; see null-safe comparisons below). For a `CHECK` 
constraint, the expression is passed as `x < 10 OR x IS NULL`. This ensures 
that implementations will make the correct determination, rather than depending 
depending on context to interpret a null result (`WHERE` vs `CHECK`).
+
+Logical combinations are boolean operators applied to predicates. `AND` and 
`OR` are binary operations and `NOT` is a unary operation.
+
+Comparisons must be null-safe. For example:

Review Comment:
   The bullets below are exhaustive for `=`/`!=` but only partial for ordering 
operators (e.g. `null < 34`, `null <= 34`, `null > 34`, `null >= 34`, `34 > 
null`, `34 >= null`, `null > null`, `null >= null` are missing). 
   
   Suggest replacing the bullets with explicit rules, then keeping a few 
illustrative examples:
   
   > Comparisons must be null-safe. For any two operands `a` and `b`:
   > - `a = b` is true if both are null, or both are non-null and equal; 
otherwise false.
   > - `a != b` is the boolean negation of `a = b`.
   > - `a < b` and `a > b` are false whenever either operand is null; otherwise 
they use the natural order defined above.
   > - `a <= b` is `(a = b) OR (a < b)`; `a >= b` is `(a = b) OR (a > b)`. Both 
are true when both operands are null and false when exactly one operand is null.
   
   Examples (now derivable from the rules):
   > - `null = null` → true; `34 = null` → false
   > - `null != null` → false; `34 != null` → true
   > - `null < null` → false; `34 < null` → false
   > - `null <= null` → true; `34 <= null` → false; `null <= 34` → false



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.

Review Comment:
   Two issues with this paragraph:
   
   **1. "type promotion rules" is undefined here.** The Iceberg type promotion 
rules live in the table spec; please add a cross-reference so readers know 
which exact rules govern alignment.
   
   **2. Conflict with L104.** This paragraph says incompatible types `must 
evaluate to false`, but the [Value expression types](#value-expression-types) 
section four lines up says implementations `must fail rather than insert 
"unsafe" casts`. Both rules apply to the example given here.
   
   Take `"goats" > -Infinity`: string and float have no promotion path — there 
is no common type to align them to.
   - Following L113, this predicate must return `false`.
   - Following L104, the implementation must fail because no safe cast exists.
   
   Same expression, two contradictory required behaviors, four lines apart. Two 
engines on the same data and predicate will diverge — one throws, one returns 
zero rows — silently.
   
   The spec probably has two distinct cases in mind that should be split:
   
   - **No promotion path exists at all** (e.g., `string` vs `float`) → 
short-circuit the predicate to `false`. Comparisons across unrelated type 
families just do not match.
   - **A promotion path exists but the cast would be lossy/unsafe** (e.g., 
`long` → `int` truncation, high-precision `decimal` → `float`) → fail rather 
than silently produce wrong results.
   
   That is a coherent model, but it is not in the text. Either pick one rule 
and apply it uniformly to type-incompatible expressions, or distinguish the two 
cases explicitly.



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.
+
+Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value. Value expressions must be compared or 
tested to produce a predicate. For example, `is_empty("")` is not a valid 
predicate, but `is_empty("") = true` is a valid predicate.
+
+
+#### Comparisons
+
+Comparisons are predicates that compare two value expressions with the same 
primitive type. Comparisons are:
+
+| Comparison  | Description |
+|-------------|-------------|
+| `=`         | Is equal |
+| `!=`        | Is not equal |
+| `<`         | Less than |
+| `<=`        | Less than or equal |
+| `>`         | Greater than |
+| `>=`        | Greater than or equal |
+
+Primitive types are compared using natural order, except for the following 
types:
+
+* `false` is less than `true` for `boolean`
+* `fixed` and `binary` use unsigned byte-wise comparison
+* `string` uses unsigned byte-wise comparison of the UTF-8 representation
+* `uuid` uses unsigned byte-wise comparison of the UUID bytes
+* `float` and `double` use IEEE 754 total order after normalizing NaN to the 
canonical NaN (sign bit 0, exponent bits all 1, matissa msb 1 followed by all 0)
+    * `NaN = NaN` is true for any two NaN values
+    * `val < NaN` is true for all non-NaN values
+
+Note type alignment produces `decimal` values with the same scale so that 
comparison is equivalent to the natural order of the unscaled numeric value.
+
+Tests are predicates that test a single value expression, optionally using a 
constant or set of constants. Constants must have the same type and must be 
non-null. Tests are:
+
+| Test                    | Allowed types | Constant type | Description |
+|-------------------------|---------------|---------------|-------------|
+| `IS NULL`               | any           |               | true iff the value 
is null |
+| `IS NOT NULL`           | any           |               | true iff the value 
is not null |
+| `IS NaN`                | float, double |               | true iff the value 
is an IEEE 754 NaN |
+| `IS NOT NaN`            | float, double |               | true iff the value 
is not an IEEE 754 NaN |
+| `STARTS WITH const`     | string        | string        | true iff the 
constant is a prefix of the value |
+| `NOT STARTS WITH const` | string        | string        | true iff the 
constant is not a prefix of the value |
+| `IN (constant set)`     | any           | same as value | true iff the value 
is equal to any constant |
+| `NOT IN (constant set)` | any           | same as value | true iff the value 
is not equal to all constants |
+
+
+#### Boolean logic
+
+Predicates must use 2-valued boolean logic. Evaluation of all predicates must 
produce `true` or `false`.
+
+Engines that implement SQL 3-valued boolean logic must add `IS NULL` and `NOT 
NULL` to produce the 2-valued equivalent. This avoids bugs in engines and 
languages that do not natively implement 3-valued logic. For example, the SQL 
predicate `x < 10` should be passed as `x < 10 AND x IS NOT NULL` for a SQL 
`WHERE` condition (or `x < 10`; see null-safe comparisons below). For a `CHECK` 
constraint, the expression is passed as `x < 10 OR x IS NULL`. This ensures 
that implementations will make the correct determination, rather than depending 
depending on context to interpret a null result (`WHERE` vs `CHECK`).
+
+Logical combinations are boolean operators applied to predicates. `AND` and 
`OR` are binary operations and `NOT` is a unary operation.
+
+Comparisons must be null-safe. For example:
+
+* `null = null` is `true`
+* `34 = null` is `false`
+* `null != null` is `false`
+* `34 != null` is `true`
+* `null < null` is `false`
+* `null <= null` is `true`
+* `34 < null` is `false`
+
+Comparisons must handle null values when value expressions evaluate to null. 
However, value expressions used to define a predicate should not directly 
contain null constants and may reject them. For example, `x = get_item(map, 
"key")` is valid although `get_item` may return a null value, but `x = null` 
should be rejected because `x IS NULL` is the recommended unambiguous predicate.
+
+
+### Compatibility with REST catalog expressions
+
+Older clients use more restrictive forms of predicates and references that 
used a "term" for specific transforms and named references. These expressions 
should be supported for backward compatibility to allow older clients to 
interact with newer REST catalog services.
+
+Prior to this spec, deprecated expressions were passed in the REST API in 3 
places:
+
+* As `filter` passed to server-side scan planning
+* As `filter` passed to the service in `ScanReport`
+* As `residual` passed to the client with a scan task
+
+Both server-side scan planning and the report endpoint can continue to accept 
filters from older clients without issues by parsing term-based expressions 
(see [Appendix B: JSON serialization](#appendix-b-json-serialization)).
+
+Residuals passed from services back to clients that do not use the new syntax 
would cause clients to fail, but services are allowed to omit the residual so 
that it is calculated on the client side (intended to avoid duplicating large 
IN filters). For compatibility, REST services should detect client versions and 
produce deprecated predicates, or omit residuals from tasks.
+
+
+## Appendix A: Iceberg functions
+
+This section defines the functions in the `iceberg_functions` reserved catalog 
name.
+
+* `if_else(condition: predicate, when_true: T, when_false: T) -> T`: returns 
the value of `when_true` when `condition` is true and `when_false` otherwise
+
+### Partition transforms
+
+Iceberg partition transforms are also defined as functions (other than `void`).
+
+All partition transforms produce `null` for a `null` input value.
+
+| Function name     | Description                                              
    | Source types                                                         | 
Result type |
+|-------------------|--------------------------------------------------------------|----------------------------------------------------------------------|-------------|
+| `identity(value)` | Source value, unmodified                                 
    | Any primitive except for `geometry`, `geography`, and `variant`      | 
Source type |
+| `year(value)`     | Extract a date or timestamp year, as years from 1970     
    | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | 
`int`       |
+| `month(value)`    | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, 
`timestamptz_ns` | `int`       |
+| `day(value)`      | Extract a date or timestamp day, as days from 1970-01-01 
    | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | 
`date`      |
+| `hour(value)`     | Extract a timestamp hour, as hours from 1970-01-01 
00:00:00  | `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`        
 | `int`       |
+
+Note that `year`, `month`, and `hour` transforms produce ordinal values and 
not human-readable values. For example, `year(2018-05-13)` produces `48`, not 
`2018`.
+
+Parameterized functions are called as 2-argument functions. The first argument 
is an `int` parameter (`N` or `W` from the table spec) and the second argument 
is the value to transform. For example, `bucket(256, id)` calls `bucket[256]`.
+
+| Parameterized function name | Description                                   
| Source types                                                                  
               | Result type |
+|-----------------------------|-----------------------------------------------|----------------------------------------------------------------------------------------------|-------------|
+| `bucket(N, value)`          | Hash of value, mod `N` (see table spec)       
| Any primitive except for `geometry`, `geography`, `variant`, `boolean`, 
`float`, or `double` | `int`       |
+| `truncate(W, value)`        | Value truncated to width `W` (see table spec) 
| `int`, `long`, `decimal`, `string`, `binary`                                  
               | Source type |
+
+
+## Appendix B: JSON serialization
+
+Iceberg expressions are serialized as JSON objects in table, view, and UDF 
metadata, and in the REST protocol for catalogs.
+
+### Value expressions
+
+```
+EXPR: LITERAL | REFERENCE | APPLY
+
+LITERAL: VALUE
+    | { "type": "literal", "value": VALUE }
+    | { "type": "literal", "value": VALUE, "data-type": DATA_TYPE }
+LITERALS: [ LITERAL* ]
+
+REFERENCE: BOUND_REF | UNBOUND_REF
+BOUND_REF: ID | { "type": "reference", "id": ID }
+UNBOUND_REF: NAME | { "type": "reference", "name": NAME }
+
+APPLY: { "type": "apply", "func-name": FUNC_ID, "arguments": [ EXPR* ] }

Review Comment:
   The grammar and Appendix A disagree on whether `if_else` can be encoded.
   
   The grammar restricts `APPLY` arguments to value expressions: `APPLY: { 
"type": "apply", "func-name": FUNC_ID, "arguments": [ EXPR* ] }`, where `EXPR` 
is `LITERAL | REFERENCE | APPLY`. Predicates are explicitly *not* `EXPR` (see 
L115: "Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value").
   
   But Appendix A defines `if_else(condition: predicate, when_true: T, 
when_false: T) -> T`, with the first argument typed as a predicate.
   
   Concrete: how do you serialize `if_else(x > 5, "big", "small")`?
   
   ```json
   {
     "type": "apply",
     "func-name": "if_else",
     "arguments": [
       { "type": "gt", "left": "x", "right": 5 },
       "big",
       "small"
     ]
   }
   ```
   
   The first argument is a `CMP_OP` predicate, but `APPLY.arguments` requires 
every element to be `EXPR`. The grammar has no valid encoding for `if_else`’s 
condition.
   
   Two viable fixes:
   
   1. **Broaden `APPLY` arguments to allow `PREDICATE`** — `arguments: [ (EXPR 
| PREDICATE)* ]`, with function signatures declaring which positions accept 
which. Cost: muddies the value/predicate split that L115 enforces.
   2. **Give `if_else` a dedicated JSON form** rather than treating it as an 
`apply` — e.g. `{ "type": "if-else", "condition": PREDICATE, "when-true": EXPR, 
"when-false": EXPR }`. Cost: `if_else` is no longer "just a function"; the 
iceberg_functions table needs a special case.
   
   Either is fine, but the grammar and the function table need to agree — the 
spec’s own named example function is currently unrepresentable under its JSON 
grammar.



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.
+
+Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value. Value expressions must be compared or 
tested to produce a predicate. For example, `is_empty("")` is not a valid 
predicate, but `is_empty("") = true` is a valid predicate.
+
+
+#### Comparisons
+
+Comparisons are predicates that compare two value expressions with the same 
primitive type. Comparisons are:
+
+| Comparison  | Description |
+|-------------|-------------|
+| `=`         | Is equal |
+| `!=`        | Is not equal |
+| `<`         | Less than |
+| `<=`        | Less than or equal |
+| `>`         | Greater than |
+| `>=`        | Greater than or equal |
+
+Primitive types are compared using natural order, except for the following 
types:
+
+* `false` is less than `true` for `boolean`
+* `fixed` and `binary` use unsigned byte-wise comparison
+* `string` uses unsigned byte-wise comparison of the UTF-8 representation
+* `uuid` uses unsigned byte-wise comparison of the UUID bytes
+* `float` and `double` use IEEE 754 total order after normalizing NaN to the 
canonical NaN (sign bit 0, exponent bits all 1, matissa msb 1 followed by all 0)
+    * `NaN = NaN` is true for any two NaN values
+    * `val < NaN` is true for all non-NaN values
+
+Note type alignment produces `decimal` values with the same scale so that 
comparison is equivalent to the natural order of the unscaled numeric value.
+
+Tests are predicates that test a single value expression, optionally using a 
constant or set of constants. Constants must have the same type and must be 
non-null. Tests are:
+
+| Test                    | Allowed types | Constant type | Description |
+|-------------------------|---------------|---------------|-------------|
+| `IS NULL`               | any           |               | true iff the value 
is null |
+| `IS NOT NULL`           | any           |               | true iff the value 
is not null |
+| `IS NaN`                | float, double |               | true iff the value 
is an IEEE 754 NaN |
+| `IS NOT NaN`            | float, double |               | true iff the value 
is not an IEEE 754 NaN |
+| `STARTS WITH const`     | string        | string        | true iff the 
constant is a prefix of the value |
+| `NOT STARTS WITH const` | string        | string        | true iff the 
constant is not a prefix of the value |
+| `IN (constant set)`     | any           | same as value | true iff the value 
is equal to any constant |
+| `NOT IN (constant set)` | any           | same as value | true iff the value 
is not equal to all constants |
+
+
+#### Boolean logic
+
+Predicates must use 2-valued boolean logic. Evaluation of all predicates must 
produce `true` or `false`.
+
+Engines that implement SQL 3-valued boolean logic must add `IS NULL` and `NOT 
NULL` to produce the 2-valued equivalent. This avoids bugs in engines and 
languages that do not natively implement 3-valued logic. For example, the SQL 
predicate `x < 10` should be passed as `x < 10 AND x IS NOT NULL` for a SQL 
`WHERE` condition (or `x < 10`; see null-safe comparisons below). For a `CHECK` 
constraint, the expression is passed as `x < 10 OR x IS NULL`. This ensures 
that implementations will make the correct determination, rather than depending 
depending on context to interpret a null result (`WHERE` vs `CHECK`).
+
+Logical combinations are boolean operators applied to predicates. `AND` and 
`OR` are binary operations and `NOT` is a unary operation.
+
+Comparisons must be null-safe. For example:
+
+* `null = null` is `true`
+* `34 = null` is `false`
+* `null != null` is `false`
+* `34 != null` is `true`
+* `null < null` is `false`
+* `null <= null` is `true`
+* `34 < null` is `false`
+
+Comparisons must handle null values when value expressions evaluate to null. 
However, value expressions used to define a predicate should not directly 
contain null constants and may reject them. For example, `x = get_item(map, 
"key")` is valid although `get_item` may return a null value, but `x = null` 
should be rejected because `x IS NULL` is the recommended unambiguous predicate.
+
+
+### Compatibility with REST catalog expressions
+
+Older clients use more restrictive forms of predicates and references that 
used a "term" for specific transforms and named references. These expressions 
should be supported for backward compatibility to allow older clients to 
interact with newer REST catalog services.
+
+Prior to this spec, deprecated expressions were passed in the REST API in 3 
places:

Review Comment:
   "deprecated expressions" is awkward here for two reasons:
   
   1. **Anachronism.** "Prior to this spec, deprecated expressions were 
passed…" — these forms were not deprecated prior to this spec; they were the 
only form. They are being deprecated *by* this spec.
   
   2. **Forward reference without a hook.** A reader hitting "deprecated 
expressions" at this point has no idea what is being deprecated until they 
reach `DEPRECATED_PREDICATE` / `DEPRECATED_REF` in [Appendix 
B](#appendix-b-json-serialization). The form should be named here so the 
section stands on its own.
   
   Suggested rewrite:
   
   > Prior to this spec, REST APIs used a more restrictive, term-based form of 
predicates and references in three places. Those forms are now deprecated (see 
[Backward compatibility](#backward-compatibility) in Appendix B):
   
   This also lets you drop the redundant first paragraph of the section, since 
the term-based form is now named directly.



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.
+
+Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value. Value expressions must be compared or 
tested to produce a predicate. For example, `is_empty("")` is not a valid 
predicate, but `is_empty("") = true` is a valid predicate.
+
+
+#### Comparisons
+
+Comparisons are predicates that compare two value expressions with the same 
primitive type. Comparisons are:
+
+| Comparison  | Description |
+|-------------|-------------|
+| `=`         | Is equal |
+| `!=`        | Is not equal |
+| `<`         | Less than |
+| `<=`        | Less than or equal |
+| `>`         | Greater than |
+| `>=`        | Greater than or equal |
+
+Primitive types are compared using natural order, except for the following 
types:
+
+* `false` is less than `true` for `boolean`
+* `fixed` and `binary` use unsigned byte-wise comparison
+* `string` uses unsigned byte-wise comparison of the UTF-8 representation
+* `uuid` uses unsigned byte-wise comparison of the UUID bytes
+* `float` and `double` use IEEE 754 total order after normalizing NaN to the 
canonical NaN (sign bit 0, exponent bits all 1, matissa msb 1 followed by all 0)
+    * `NaN = NaN` is true for any two NaN values
+    * `val < NaN` is true for all non-NaN values
+
+Note type alignment produces `decimal` values with the same scale so that 
comparison is equivalent to the natural order of the unscaled numeric value.
+
+Tests are predicates that test a single value expression, optionally using a 
constant or set of constants. Constants must have the same type and must be 
non-null. Tests are:
+
+| Test                    | Allowed types | Constant type | Description |
+|-------------------------|---------------|---------------|-------------|
+| `IS NULL`               | any           |               | true iff the value 
is null |
+| `IS NOT NULL`           | any           |               | true iff the value 
is not null |
+| `IS NaN`                | float, double |               | true iff the value 
is an IEEE 754 NaN |
+| `IS NOT NaN`            | float, double |               | true iff the value 
is not an IEEE 754 NaN |
+| `STARTS WITH const`     | string        | string        | true iff the 
constant is a prefix of the value |
+| `NOT STARTS WITH const` | string        | string        | true iff the 
constant is not a prefix of the value |
+| `IN (constant set)`     | any           | same as value | true iff the value 
is equal to any constant |
+| `NOT IN (constant set)` | any           | same as value | true iff the value 
is not equal to all constants |

Review Comment:
   Two coverage gaps in the tests table:
   
   1. **Allowed types `any` for `IN`/`NOT IN`** — but the comparison rules 
above restrict equality to primitives, and the equality semantics for complex 
types (struct/list/map) are not defined anywhere in this spec. Either narrow 
the allowed types to primitive (matching `=`), or define how element equality 
works for non-primitive types.
   
   2. **Null on the value side for tests is unspecified.** The null-safe rules 
below cover comparisons (`=`, `<`, etc.) but not tests. What does `null IS NaN` 
evaluate to? `null IS NOT NaN`? `null STARTS WITH 'foo'`? `null IN (1, 2, 3)`? 
Reading the rows literally ("true iff value is not NaN"), `null IS NOT NaN` is 
true, which contradicts the null-safe spirit elsewhere. A row in the null-safe 
section stating the rule for tests on null values — e.g., "tests on a null 
value evaluate to false except for `IS NULL` (true) and `IS NOT NULL` (false)" 
— would close this.



##########
format/expressions-spec.md:
##########
@@ -0,0 +1,284 @@
+---
+title: "Expressions Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg Expressions
+
+This document defines the structure and behavior of expressions for use in 
Iceberg specifications. The purpose is to define a common structure that 
enables simple expressions to be stored and exchanged.
+
+Stored expressions are needed for use cases like data validations (`CHECK` 
constraints) and default values (for instance, `current_timestamp()`). 
Expressions are exchanged in use cases like server-side scan planning in the 
catalog protocol.
+
+
+## Overview
+
+The goal of this specification is to define a simple expression structure and 
avoid complexity.
+
+To remain simple, the expressions that can be represented are deliberately 
constrained. Value expressions are constants, field references, or function 
calls with value expression arguments. Predicates are comparisons of value 
expressions that produce true or false.
+
+This approach is intended to keep focus on the logical structure of 
expressions. Complexity is pushed to the functions that are called, which can 
be a limited set of well-defined and portable functions (like Iceberg partition 
transforms) or could be user-defined functions that can use the full range of 
SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs 
that are specific to an engine, rather than importing and duplicating dialects 
in Iceberg expressions.
+
+This is consistent with Iceberg's conservative approach in other specs. 
Expressions and predicates are an important part of Iceberg implementation 
APIs, but have been deliberately limited in specifications. For example, sort 
orders and partition fields are strictly limited to a small set of transforms 
over well-defined inputs (source field IDs). This spec is widening what can be 
expressed, but depends on function calls for complex tasks.
+
+This specification covers the structure of Iceberg expressions and includes 
appendicies that specify serialization as JSON and a set of portable functions 
defined by Iceberg specifications.
+
+
+## Structure
+
+Iceberg expressions have two types:
+
+* **Value expressions** represent data values and transformations of values 
(function calls) that produce any Iceberg type
+* **Predicates** represent comparisons of value expressions and boolean logic 
that produce `true` or `false`
+
+
+### Value expressions
+
+A value expression is an expression that produces a typed value
+
+Value expressions can be one of three types: a constant value, a field 
reference, or a function applied to zero or more value expressions.
+
+
+#### Constant values
+
+A constant or literal is the simplest type of value expression that represents 
a specific typed value.
+
+
+#### Field reference
+
+A field reference represents the value of a specific field in a row. When an 
expression is evaluated on a row, it returns the value of the field.
+
+Field references may be named references (unbound) or ID references (bound). 
ID references identify a field by field ID from a schema. Named references 
identify a field by name that must be resolved to an ID (bound to a schema) to 
access the field.
+
+ID references are used for stored expressions, where the identity of the 
column is determined when the stored expression is created. For example, column 
constraints are tied to field ID so that renaming a column does not drop its 
stored constraint.
+
+Named references are used when the identity of the column is determined when 
the expression is evaluated. For example, query filters are resolved each time 
a query runs so servers-side planning uses unbound named references.
+
+The context in which an expression is used determines the type of references 
that are valid. Iceberg specifications should document whether ID references, 
named references, or both are allowed.
+
+
+#### Apply function
+
+An apply expression represents the result of a function applied to (or called 
on) zero or more values produced by child value expressions.
+
+Functions are identified by catalog, namespace, and name.
+
+* Function name is always required
+* Namespace is optional and is assumed to be empty ([]) if it is not present 
or is null
+* Catalog is optional and is assumed to be the catalog in which the 
referencing object is stored if it is not present or is null
+
+The catalog name is used to identify the catalog where the function definition 
can be loaded or it identifies a reserved function set. As in the view and UDF 
specs, catalog names represent connection configurations that may differ across 
environments. Omitting catalog names is recommended to avoid depending on 
consistent environments. For example, if a table has a CHECK constraint that 
references a UDF without a catalog name (missing or null), the UDF should be 
loaded from the table’s catalog.
+
+Reserved function set names are:
+
+* `sql_functions` is used for functions defined by the SQL standard
+* `iceberg_functions` is used for functions defined in this specification
+
+Engines may document and use a catalog name to identify their built-in 
functions that are not part of the SQL spec, like 
`spark_builtin_functions.to_utc_timestamp`.
+
+Producers are responsible for resolving catalog, namespace, and name if the 
environment is relevant. For example, if a SQL engine uses its current catalog 
and namespace to find a function, the resolved catalog and namespace must be 
used to produce an unambiguous function identifier.
+
+
+#### Value expression types
+
+The type produced by a value expression may change. For example, an ID 
reference may produce a widened type after the underlying column's type is 
promoted.
+
+Function calls may produce different types when function definitions change, 
and type changes may change the definition that is resolved for a function 
name. For example, `identity(int) -> int` will change to `identity(long) -> 
long` when an input field is promoted from `int` to `long`.
+
+A value expression's type is determined when it is bound to a specific input 
schema.
+
+If types are incompatible at runtime, implementations binding or evaluating 
expressions may apply type promotion to align types for predicates and to 
resolve functions. Implementations may choose when to promote values to 
accomodate engines that differ in casting behavior. However, implementations 
must fail rather than insert "unsafe" casts. 
+
+
+### Predicates
+
+A predicate is a boolean expression that produces true or false.
+
+Predicates can be constants (true or false), comparisons or tests of value 
expressions, or logical combinations of predicates (AND, OR, NOT).
+
+If value expression types in a predicate are incompatible, implementations 
should align types using type promotion. For instance, `int_col > 5.0` should 
promote int values to float. If the types cannot be aligned according to type 
promotion rules, the predicate must evaluate to false. For instance, `"goats" > 
-Infinity` should always be `false`.
+
+Value expressions are not valid predicates, even when the expression is 
expected to return a boolean value. Value expressions must be compared or 
tested to produce a predicate. For example, `is_empty("")` is not a valid 
predicate, but `is_empty("") = true` is a valid predicate.
+
+
+#### Comparisons
+
+Comparisons are predicates that compare two value expressions with the same 
primitive type. Comparisons are:
+
+| Comparison  | Description |
+|-------------|-------------|
+| `=`         | Is equal |
+| `!=`        | Is not equal |
+| `<`         | Less than |
+| `<=`        | Less than or equal |
+| `>`         | Greater than |
+| `>=`        | Greater than or equal |
+
+Primitive types are compared using natural order, except for the following 
types:
+
+* `false` is less than `true` for `boolean`
+* `fixed` and `binary` use unsigned byte-wise comparison
+* `string` uses unsigned byte-wise comparison of the UTF-8 representation
+* `uuid` uses unsigned byte-wise comparison of the UUID bytes
+* `float` and `double` use IEEE 754 total order after normalizing NaN to the 
canonical NaN (sign bit 0, exponent bits all 1, matissa msb 1 followed by all 0)
+    * `NaN = NaN` is true for any two NaN values
+    * `val < NaN` is true for all non-NaN values
+
+Note type alignment produces `decimal` values with the same scale so that 
comparison is equivalent to the natural order of the unscaled numeric value.
+
+Tests are predicates that test a single value expression, optionally using a 
constant or set of constants. Constants must have the same type and must be 
non-null. Tests are:
+
+| Test                    | Allowed types | Constant type | Description |
+|-------------------------|---------------|---------------|-------------|
+| `IS NULL`               | any           |               | true iff the value 
is null |
+| `IS NOT NULL`           | any           |               | true iff the value 
is not null |
+| `IS NaN`                | float, double |               | true iff the value 
is an IEEE 754 NaN |
+| `IS NOT NaN`            | float, double |               | true iff the value 
is not an IEEE 754 NaN |
+| `STARTS WITH const`     | string        | string        | true iff the 
constant is a prefix of the value |
+| `NOT STARTS WITH const` | string        | string        | true iff the 
constant is not a prefix of the value |
+| `IN (constant set)`     | any           | same as value | true iff the value 
is equal to any constant |
+| `NOT IN (constant set)` | any           | same as value | true iff the value 
is not equal to all constants |
+
+
+#### Boolean logic
+
+Predicates must use 2-valued boolean logic. Evaluation of all predicates must 
produce `true` or `false`.
+
+Engines that implement SQL 3-valued boolean logic must add `IS NULL` and `NOT 
NULL` to produce the 2-valued equivalent. This avoids bugs in engines and 
languages that do not natively implement 3-valued logic. For example, the SQL 
predicate `x < 10` should be passed as `x < 10 AND x IS NOT NULL` for a SQL 
`WHERE` condition (or `x < 10`; see null-safe comparisons below). For a `CHECK` 
constraint, the expression is passed as `x < 10 OR x IS NULL`. This ensures 
that implementations will make the correct determination, rather than depending 
depending on context to interpret a null result (`WHERE` vs `CHECK`).

Review Comment:
   typo: `depending depending`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Add spec for expressions [iceberg]

Reply via email to