rdblue commented on code in PR #16652: URL: https://github.com/apache/iceberg/pull/16652#discussion_r3399660540
########## format/expressions-spec.md: ########## @@ -0,0 +1,284 @@ +--- +title: "Expressions Spec" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Iceberg Expressions + +This document defines the structure and behavior of expressions for use in Iceberg specifications. The purpose is to define a common structure that enables simple expressions to be stored and exchanged. + +Stored expressions are needed for use cases like data validations (`CHECK` constraints) and default values (for instance, `current_timestamp()`). Expressions are exchanged in use cases like server-side scan planning in the catalog protocol. + + +## Overview + +The goal of this specification is to define a simple expression structure and avoid complexity. + +To remain simple, the expressions that can be represented are deliberately constrained. Value expressions are constants, field references, or function calls with value expression arguments. Predicates are comparisons of value expressions that produce true or false. + +This approach is intended to keep focus on the logical structure of expressions. Complexity is pushed to the functions that are called, which can be a limited set of well-defined and portable functions (like Iceberg partition transforms) or could be user-defined functions that can use the full range of SQL capabilities. Multi-dialect UDFs are responsible for any SQL constructs that are specific to an engine, rather than importing and duplicating dialects in Iceberg expressions. + +This is consistent with Iceberg's conservative approach in other specs. Expressions and predicates are an important part of Iceberg implementation APIs, but have been deliberately limited in specifications. For example, sort orders and partition fields are strictly limited to a small set of transforms over well-defined inputs (source field IDs). This spec is widening what can be expressed, but depends on function calls for complex tasks. + +This specification covers the structure of Iceberg expressions and includes appendicies that specify serialization as JSON and a set of portable functions defined by Iceberg specifications. + + +## Structure + +Iceberg expressions have two types: + +* **Value expressions** represent data values and transformations of values (function calls) that produce any Iceberg type +* **Predicates** represent comparisons of value expressions and boolean logic that produce `true` or `false` + + +### Value expressions + +A value expression is an expression that produces a typed value + +Value expressions can be one of three types: a constant value, a field reference, or a function applied to zero or more value expressions. + + +#### Constant values + +A constant or literal is the simplest type of value expression that represents a specific typed value. + + +#### Field reference + +A field reference represents the value of a specific field in a row. When an expression is evaluated on a row, it returns the value of the field. + +Field references may be named references (unbound) or ID references (bound). ID references identify a field by field ID from a schema. Named references identify a field by name that must be resolved to an ID (bound to a schema) to access the field. + +ID references are used for stored expressions, where the identity of the column is determined when the stored expression is created. For example, column constraints are tied to field ID so that renaming a column does not drop its stored constraint. + +Named references are used when the identity of the column is determined when the expression is evaluated. For example, query filters are resolved each time a query runs so servers-side planning uses unbound named references. + +The context in which an expression is used determines the type of references that are valid. Iceberg specifications should document whether ID references, named references, or both are allowed. + + +#### Apply function + +An apply expression represents the result of a function applied to (or called on) zero or more values produced by child value expressions. + +Functions are identified by catalog, namespace, and name. + +* Function name is always required +* Namespace is optional and is assumed to be empty ([]) if it is not present or is null +* Catalog is optional and is assumed to be the catalog in which the referencing object is stored if it is not present or is null + +The catalog name is used to identify the catalog where the function definition can be loaded or it identifies a reserved function set. As in the view and UDF specs, catalog names represent connection configurations that may differ across environments. Omitting catalog names is recommended to avoid depending on consistent environments. For example, if a table has a CHECK constraint that references a UDF without a catalog name (missing or null), the UDF should be loaded from the table’s catalog. Review Comment: I was trying to avoid calling reserved function sets "catalogs". Maybe I could change it to "Reserved function sets are identified by fixed catalog names:"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
