flyrain commented on code in PR #14117: URL: https://github.com/apache/iceberg/pull/14117#discussion_r2467821517
########## format/udf-spec.md: ########## @@ -0,0 +1,285 @@ +--- +title: "SQL UDF Spec" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Iceberg SQL UDF Spec + +## Background and Motivation + +A SQL user-defined function (UDF or UDTF) is a callable routine that accepts input parameters and executes a function body. +Depending on the function type, the result can be: + +- **Scalar functions (UDFs)** – return a scalar value, which may be a primitive type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`). +- **Table functions (UDTFs)** – return a table, i.e., a table with zero or more rows and columns with a uniform schema. + +Many compute engines (e.g., Spark, Trino) already support UDFs, but in different and incompatible ways. Without a common +standard, UDFs cannot be reliably shared across engines or reused in multi-engine environments. + +This specification introduces a standardized metadata format for UDFs in Iceberg. + +## Goals + +* Define a portable metadata format for both scalar and table SQL UDFs. The metadata is self-contained and can be moved across catalogs. +* Support function evolution through versioning and rollback. +* Provide consistent semantics for representing UDFs across engines. + +## Overview + +UDF metadata follows the same design principles as Iceberg table and view metadata: each function is represented by a +**self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security, +properties, and engine-specific representations. + +* Any modification (new overload, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata. +* Each metadata file includes recent definition versions, enabling rollbacks without external state. + +## Specification + +### UDF Metadata +The UDF metadata file has the following fields: + +| Requirement | Field name | Type | Description | +|-------------|-------------------|------------------------|-----------------------------------------------------------------------------------------------------------------| +| *required* | `function-uuid` | `string` | A UUID that identifies the function, generated once at creation. | +| *required* | `format-version` | `int` | Metadata format version (must be `1`). | +| *required* | `definitions` | `list<overload>` | List of function [overload](#overload) entities. | +| *required* | `definition-log` | `list<definition-log>` | History of [definition snapshots](#definition-log). | +| *required* | `max-overload-id` | `long` | Highest `overload-id` currently assigned for this UDF. Used to allocate new overload identifiers monotonically. | +| *optional* | `location` | `string` | Storage location of metadata files. | +| *optional* | `properties` | `map` | A string to string map of properties. | +| *optional* | `secure` | `boolean` | Whether it is a secure function. Default: `false`. | +| *optional* | `doc` | `string` | Documentation string. | + +Notes: +1. When `secure` is `true`, + - Engines **SHOULD NOT** expose the function definition through any inspection (e.g., `SHOW FUNCTIONS`). + - Engines **SHOULD** ensure that execution does not leak sensitive information through any channels, such as error messages, logs, or query plans. + +### Overload + +Function overloads allow multiple implementations of the same function name with different signatures. Each overload has +the following fields: + +| Requirement | Field name | Type | Description | +|-------------|----------------------------|--------------------------------------------------------|-----------------------------------------------------------------------------------------------| +| *required* | `overload-id` | `long` | Monotonically increasing identifier of this function overload. | +| *required* | `parameters` | `list<parameter>` | Ordered list of [function parameters](#parameter). Invocation order **must** match this list. | Review Comment: That’s a good question. Searched a bit, some engines like Spark[1] and Snowflake[2] support named parameter invocation, similar to how Python supports named arguments. Trino doesn't. Because of that, parameter names become part of the function’s signature, which means that renaming parameters can break existing invocations. For example, a Spark app might call a function defined as `foo(int a, int b)` using named arguments: `foo(a => 1, b => 2)`. If the function definition later changes to `foo(int c, int d)`, the same call would fail because the parameter names no longer match. Implications for the spec: - We may not be able to include a parameter name list in the SQL representation(discussed here, https://github.com/apache/iceberg/pull/14117#discussion_r2466553550), since representations are versioned and can change. - Recreating a definition may break existing use cases when names changed, e.g., deleting `foo(int a, int b)`, then creating `foo(int c, int d)`. There is probably no clean way to alway keep names the same. I guess we could consider them are different definitions. However, the conflict checking between two definitions will still only honor types and order. For example, we shouldn't allow `foo(int c, int d)` and `foo(int a, int b)` co-exist. 1. https://spark.apache.org/docs/latest/sql-ref-function-invocation.html 2. https://docs.snowflake.com/en/release-notes/bcr-bundles/2023_03/bcr-1017 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
