This is an automated email from the ASF dual-hosted git repository.
yufei pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/main by this push:
new b35624e0d0 Spec: Introduce SQL UDF specification (#14117)
b35624e0d0 is described below
commit b35624e0d0b45f6dcee7a8c157c47c05593566dc
Author: Yufei Gu <[email protected]>
AuthorDate: Thu Feb 5 15:54:38 2026 -0800
Spec: Introduce SQL UDF specification (#14117)
---
format/udf-spec.md | 352 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 352 insertions(+)
diff --git a/format/udf-spec.md b/format/udf-spec.md
new file mode 100644
index 0000000000..86cb2bc284
--- /dev/null
+++ b/format/udf-spec.md
@@ -0,0 +1,352 @@
+---
+title: "SQL UDF Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements. See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License. You may obtain a copy of the License at
+ -
+ - http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg UDF Spec
+
+## Background and Motivation
+
+A SQL user-defined function (UDF or UDTF) is a callable routine that accepts
input parameters and executes a function body.
+Depending on the function type, the result can be:
+
+- **Scalar function (UDF)** – returns a single value, which may be a primitive
type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`).
+- **Table function (UDTF)** – returns a table with zero or more rows of
columns with a uniform schema.
+
+This specification introduces a standardized metadata format for UDFs in
Iceberg.
+
+## Goals
+
+* Define a portable metadata format for both scalar and table SQL UDFs. The
metadata is self-contained and can be moved across catalogs.
+* Support function evolution through versioning and rollback.
+* Provide consistent semantics for representing UDFs across engines.
+
+## Overview
+
+UDF metadata follows the same design principles as Iceberg table and view
metadata: each function is represented by a
+**self-contained metadata file**. Metadata captures definitions, parameters,
return types, documentation, security,
+properties, and engine-specific representations.
+
+* UDF metadata files are immutable. Any modification (new definition, updated
representation, changed properties, etc.)
+ creates a new metadata file. Catalogs use an atomic swap to update the
metadata file linked to a catalog identifier.
+* Each metadata file includes recent definition versions, enabling rollbacks
without external state.
+
+## Specification
+
+### UDF Metadata
+The UDF metadata file has the following fields:
+
+| Requirement | Field name | Type | Description
|
+|-------------|------------------|------------------------|---------------------------------------------------------------------------|
+| *required* | `function-uuid` | `string` | A UUID that
identifies this UDF, generated once at creation. |
+| *required* | `format-version` | `int` | UDF specification
format version (must be `1`). |
+| *required* | `definitions` | `list<definition>` | List of function
[definition](#definition) entities. |
+| *required* | `definition-log` | `list<definition-log>` | History of
[versions within the function's definitions](#definition-log). |
+| *optional* | `location` | `string` | The function's
base location; used to create metadata file locations. |
+| *optional* | `properties` | `map<string,string>` | A string-to-string
map of properties. |
+| *optional* | `secure` | `boolean` | Whether it is a
secure function. Default: `false`. |
+| *optional* | `doc` | `string` | Documentation
string. |
+
+Notes:
+1. When `secure` is set to `true`, engines should prevent leakage of sensitive
information to end users. Each engine
+ may have its own security definition and mechanisms. It is the
administrator's responsibility to ensure that
+ UDFs marked as secure are properly configured and protected in their
environment.
+2. Entries in `properties` are treated as hints, not strict rules.
+3. UDF names are not stored in metadata. It's the catalog's responsibility to
map UDF names to metadata file locations.
+
+### Definition
+
+Each `definition` represents one function signature (e.g., `add_one(int)` vs
`add_one(float)`). A definition is uniquely
+identified by its signature (the ordered list of parameter types). There can
be only one definition for a given signature.
+All versions within a definition must accept the same signature as specified
in the definition's `parameters` field and
+must produce values of the declared `return-type`.
+
+| Requirement | Field name | Type |
Description
|
+|-------------|----------------------|--------------------------------|---------------------------------------------------------------------------------------------------|
+| *required* | `definition-id` | `string` | An
identifier derived from canonical parameter-type tuple (see [Definition
ID](#definition-id)). |
+| *required* | `parameters` | `list<parameter>` |
Ordered list of [function parameters](#parameter). Invocation order **must**
match this list. |
+| *required* | `return-type` | `string` or `object` |
Declared return type using [Types](#types).
|
+| *optional* | `return-nullable` | `boolean` | A hint
to indicate whether the return value is nullable or not. Default: `true`.
|
+| *required* | `versions` | `list<definition-version>` |
[Versioned implementations](#definition-version) of this definition.
|
+| *required* | `current-version-id` | `int` |
Identifier of the current version for this definition.
|
+| *required* | `function-type` | `string` (`"udf"` or `"udtf"`) | If
`"udtf"`, `return-type` must be a `struct` (see [Types](#types)) describing the
output schema. |
+| *optional* | `doc` | `string` |
Documentation string.
|
+
+### Parameter
+| Requirement | Field | Type | Description
|
+|-------------|--------|----------|--------------------------------------------|
+| *required* | `type` | `string` | Parameter data type (see [Types](#types)).
|
+| *required* | `name` | `string` | Parameter name.
|
+| *optional* | `doc` | `string` | Parameter documentation.
|
+
+Notes:
+1. Variadic (vararg) parameters are not supported. Each definition must
declare a fixed number of parameters.
+2. Parameterized signatures are not supported (e.g., `<E> array_agg(E)`).
+3. All types, including parameters and return types, are considered nullable.
+
+#### Types
+[iceberg-type-json]:
https://iceberg.apache.org/spec/#appendix-c-json-serialization
+Types are based on the [Iceberg
Type](https://iceberg.apache.org/spec/#schemas-and-data-types).
+
+Primitive and semi-structured type strings are encoded based on [Iceberg Type
JSON Representation][iceberg-type-json]
+(e.g., `int`, `string`, `timestamp`, `decimal(9,2)`, `variant`). Type strings
must contain no spaces or quote characters.
+
+Nested types (`struct`, `list`, `map`) use the [Iceberg Type JSON
Representation][iceberg-type-json] with the
+following fields required. Any other fields must be ignored.
+* `list` requires `type` and `element`, e.g., `{ "type": "list", "element":
"string" }`
+* `map` requires `type`, `key`, and `value`, e.g., `{ "type": "map", "key":
"string", "value": "int" }`
+* `struct` requires `type` and `fields`, where each field requires `name` and
`type`,
+ e.g., `{ "type": "struct", "fields": [ { "name": "id", "type": "int" }, {
"name": "name", "type": "string" } ] }`
+
+#### Definition ID
+The `definition-id` is a canonical string derived from the parameter types,
formatted as a comma-separated list with no
+spaces. Each type uses the following string representation:
+
+* Primitives and semi-structured: the type name (e.g., `int`, `variant`)
+* List: `list<element-type>` (e.g., `list<int>`)
+* Map: `map<key-type,value-type>` (e.g., `map<string,int>`)
+* Struct: `struct<name1:type1,name2:type2,...>` with field names and types
(e.g., `struct<id:int,name:string>`)
+
+Examples of complete definition-id signatures:
+* `int` – single int parameter
+* `int,string` – two parameters: int and string
+* `int,list<int>,struct<id:int,name:string>` – three parameters: an int, a
list and a struct
+
+### Definition Version
+
+Each definition can evolve over time by introducing new versions.
+A `definition version` represents a specific implementation of that definition
at a given point in time.
+
+| Requirement | Field name | Type
| Description |
+|-------------|-------------------|----------------------------------------------------------|----------------------------------------------------------------|
+| *required* | `version-id` | `int`
| Monotonically increasing identifier of the definition version. |
+| *required* | `representations` | `list<representation>`
| [UDF implementations](#representation). |
+| *optional* | `deterministic` | `boolean` (default `false`)
| Whether the function is deterministic. |
+| *optional* | `on-null-input` | `string` (`"return-null"` or `"call"`,
default `"call"`) | Defines how the UDF behaves when any input parameter is
NULL. |
+| *required* | `timestamp-ms` | `long` (unix epoch millis)
| Creation timestamp of this version. |
+
+
+#### Null Input Handling
+`on-null-input` provides an optimization hint for query engines:
+1. If set to `return-null`, the function always returns `NULL` if any input
argument is `NULL`. This allows engines to
+ apply predicate pushdown or skip function evaluation for rows with `NULL`
inputs. For a function `f(x, y) = x + y`,
+ the engine can safely rewrite `WHERE f(a,b) > 0` as `WHERE a IS NOT NULL
AND b IS NOT NULL AND f(a,b) > 0`.
+2. If set to `call`, the function may handle `NULL`s internally (e.g.,
`COALESCE`, `NVL`, `IFNULL`), so the engine must
+ execute the function even if some inputs are `NULL`.
+
+### Representation
+Each representation is an object with at least one common field, `type`, that
is one of the following:
+* `sql`: a SQL expression that defines the function body
+
+Representations further define metadata for each type.
+
+#### SQL Representation
+
+A definition version can have multiple SQL representations of different
dialects, but only one SQL representation per dialect.
+The SQL representation stores the function body as a SQL expression, with
metadata such as the SQL dialect.
+
+| Requirement | Field name | Type | Description
|
+|-------------|------------|----------|------------------------------------------------------|
+| *required* | `type` | `string` | Must be `"sql"`
|
+| *required* | `dialect` | `string` | SQL dialect identifier (e.g.,
`"spark"`, `"trino"`). |
+| *required* | `sql` | `string` | SQL expression text.
|
+
+Notes:
+1. The `sql` must reference parameters using the names declared in the
definition's `parameters` field.
+
+### Definition Log
+| Requirement | Field name | Type
| Description
|
+|-------------|-----------------------|-----------------------------------------------------|-------------------------------------------------------------------------|
+| *required* | `timestamp-ms` | `long` (unix epoch millis)
| Timestamp when the function was updated to use the definition
versions. |
+| *required* | `definition-versions` |
`list<struct<definition-id:string,version-id:int>>` | Mapping of each
definition to its selected version at this time. |
+
+## Function Call Convention and Resolution in Engines
+Selecting the definition of a function to use is delegated to engines, which
may apply their own casting rules. However, engines should:
+
+1. Prefer exact parameter matches over safe (widening) or unsafe casts.
+2. Safely widen types as needed to avoid failing to find a matching definition.
+3. Require explicit casts for unsafe or non-obvious conversions.
+4. Use definitions with the same number of arguments as the input.
+5. Pass positional arguments in the same position as the input.
+6. Use definitions with the same set of field names as named input arguments.
+
+## Appendix A: Example – Overloaded Scalar Function
+
+SQL statement:
+```sql
+-- Trino SQL
+CREATE FUNCTION add_one(x INT COMMENT 'Input integer')
+COMMENT 'Add one to the input integer'
+RETURNS INT
+RETURN x + 1;
+
+-- Trino SQL
+CREATE FUNCTION add_one(x FLOAT COMMENT 'Input float')
+COMMENT 'Add one to the input float'
+RETURNS FLOAT
+RETURN x + 1.0;
+
+-- Spark SQL
+CREATE OR REPLACE FUNCTION add_one(x FLOAT)
+RETURNS FLOAT
+RETURN x + 1.0;
+```
+
+```json
+{
+ "function-uuid": "42fd3f91-bc10-41c1-8a52-92b57dd0a9b2",
+ "format-version": 1,
+ "definitions": [
+ {
+ "definition-id": "int",
+ "parameters": [
+ {
+ "name": "x", "type": "int", "doc": "Input integer"
+ }
+ ],
+ "return-type": "int",
+ "function-type": "udf",
+ "doc": "Add one to the input integer",
+ "versions": [
+ {
+ "version-id": 1,
+ "deterministic": true,
+ "representations": [
+ { "type": "sql", "dialect": "trino", "sql": "x + 2" }
+ ],
+ "timestamp-ms": 1734507000123
+ },
+ {
+ "version-id": 2,
+ "deterministic": true,
+ "representations": [
+ { "type": "sql", "dialect": "trino", "sql": "x + 1" },
+ { "type": "sql", "dialect": "spark", "sql": "x + 1" }
+ ],
+ "timestamp-ms": 1735507000124
+ }
+ ],
+ "current-version-id": 2
+ },
+ {
+ "definition-id": "float",
+ "parameters": [
+ {
+ "name": "x", "type": "float", "doc": "Input float"
+ }
+ ],
+ "return-type": "float",
+ "function-type": "udf",
+ "doc": "Add one to the input float",
+ "versions": [
+ {
+ "version-id": 1,
+ "deterministic": true,
+ "representations": [
+ { "type": "sql", "dialect": "trino", "sql": "x + 1.0" }
+ ],
+ "timestamp-ms": 1734507001123
+ }
+ ],
+ "current-version-id": 1
+ }
+ ],
+ "definition-log": [
+ {
+ "timestamp-ms": 1734507000123,
+ "definition-versions": [
+ { "definition-id": "int", "version-id": 1 }
+ ]
+ },
+ {
+ "timestamp-ms": 1734507001123,
+ "definition-versions": [
+ { "definition-id": "int", "version-id": 1 },
+ { "definition-id": "float", "version-id": 1 }
+ ]
+ },
+ {
+ "timestamp-ms": 1735507000124,
+ "definition-versions": [
+ { "definition-id": "int", "version-id": 2 },
+ { "definition-id": "float", "version-id": 1 }
+ ]
+ }
+ ],
+ "doc": "Overloaded scalar UDF for integer and float inputs",
+ "secure": false
+}
+```
+
+## Appendix B: UDTF Example
+SQL statement:
+
+```sql
+CREATE FUNCTION fruits_by_color(c VARCHAR COMMENT 'Color of fruits')
+ COMMENT 'Return fruits of specific color from fruits table'
+RETURNS TABLE (name VARCHAR, color VARCHAR)
+RETURN SELECT name, color FROM fruits WHERE color = c;
+```
+
+```json
+{
+ "function-uuid": "8a7fa39a-6d8f-4a2f-9d8d-3f3a8f3c2a10",
+ "format-version": 1,
+ "definitions": [
+ {
+ "definition-id": "string",
+ "parameters": [
+ {
+ "name": "c", "type": "string", "doc": "Color of fruits"
+ }
+ ],
+ "return-type": {
+ "type": "struct",
+ "fields": [
+ { "name": "name", "type": "string" },
+ { "name": "color", "type": "string" }
+ ]
+ },
+ "function-type": "udtf",
+ "doc": "Return fruits of a specific color from the fruits table",
+ "versions": [
+ {
+ "version-id": 1,
+ "deterministic": true,
+ "representations": [
+ { "type": "sql", "dialect": "trino", "sql": "SELECT name, color
FROM fruits WHERE color = c" },
+ { "type": "sql", "dialect": "spark", "sql": "SELECT name, color
FROM fruits WHERE color = c" }
+ ],
+ "timestamp-ms": 1734508000123
+ }
+ ],
+ "current-version-id": 1
+ }
+ ],
+ "definition-log": [
+ {
+ "timestamp-ms": 1734508000123,
+ "definition-versions": [
+ { "definition-id": "string", "version-id": 1 }
+ ]
+ }
+ ],
+ "doc": "UDTF returning (name, color) rows filtered by the given color",
+ "secure": false
+}
+```