Hi everyone, I've updated the design document[1] based on the previous comments. Additionally, I've included the SQL UDF syntax supported by various vendors, including Dremio, Snowflake, Databricks, and Trino.
I'm happy to schedule a separate sync if a deeper discussion is needed. Let's keep moving forward, especially with the renewed interest from the community. [1] https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Hey everyone, > > During the last catalog community sync, there was significant interest in > storing UDFs in Iceberg and adding endpoints for UDF handling in the REST > catalog spec. > > I recently discussed this with Yufei to better understand the new > requirement of using UDFs for fine-grained access control policies. This > expands the use cases beyond just versioned and interoperable UDFs. > Additionally, I learnt that many vendors are interested in this feature. > > Given the strong community interest and support, I’d like to take > ownership of this effort and revive the work. I'll be revisiting the > document I proposed long back and will share an updated proposal by next > week. > > Looking forward to storing UDFs in Iceberg! > - Ajantha > > On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov > <dmitri.bourlatch...@dremio.com.invalid> wrote: > >> The UDF spec does not require representations to be SQL. It merely does >> not specify (in this revision) how other representations are to be written. >> >> This seems like an easy extension (adding a new type in the >> "Representations" section). >> >> Cheers, >> Dmitri. >> >> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >>> Right now, SQL is an explicit requirement of the spec. It leaves a way >>> for future versions to add different representations later, but only SQL is >>> supported. That was also the feedback to my initial skepticism about how it >>> would work to add functions. >>> >>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov >>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>> >>>> I do not think the spec is meant to allow only SQL representations, >>>> although it is certainly faviouring SQL in examples... It would be nice to >>>> add a non-SQL example, indeed. >>>> >>>> Cheers, >>>> Dmitri. >>>> >>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong <fo...@apache.org> >>>> wrote: >>>> >>>>> Coming from PyIceberg, I have concerns as this proposal focuses on >>>>> SQL-based engines, while Python-based systems often work with data frames. >>>>> Adding imperative languages like Python would make this proposal more >>>>> inclusive. >>>>> >>>>> Kind regards, >>>>> Fokko >>>>> >>>>> >>>>> >>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen < >>>>> piotr.findei...@gmail.com>: >>>>> >>>>>> Hi, >>>>>> >>>>>> Walaa, thanks for asking! >>>>>> In the design doc linked before in this thread [1] i read >>>>>> "Without a common standard, the UDFs are hard to share among >>>>>> different engines." >>>>>> ("Background and Motivation" section). >>>>>> I agree with this statement. I don't fully understand yet how the >>>>>> proposed design addresses shareability between the engines though. >>>>>> I would use some help to understand this better. >>>>>> >>>>>> Best >>>>>> Piotr >>>>>> >>>>>> >>>>>> >>>>>> [1] SQL User-Defined Function Spec >>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc >>>>>> >>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> Piotr, what do you mean by making user-created functions shareable >>>>>>> between engines? Do you mean UDFs written in imperative code? >>>>>>> >>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >>>>>>> <piotr.findei...@gmail.com> wrote: >>>>>>> > >>>>>>> > Hi, >>>>>>> > >>>>>>> > Thank you Ajantha for creating this thread. The Iceberg UDFs are >>>>>>> an interesting idea! >>>>>>> > Is there a plan to make the user-created functions sharable >>>>>>> between the engines? >>>>>>> > If so, how would a CREATE FUNCTION statement look like in e..g >>>>>>> Spark or Trino? >>>>>>> > >>>>>>> > Meanwhile, added a few comments in the doc. >>>>>>> > >>>>>>> > Best >>>>>>> > Piotr >>>>>>> > >>>>>>> > >>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid> >>>>>>> wrote: >>>>>>> >> >>>>>>> >> I just looked through the proposal and added comments. I think it >>>>>>> would be helpful to also have a design doc that covers the choices from >>>>>>> the >>>>>>> draft spec. For instance, the choice to enumerate all possible function >>>>>>> input struts rather than allowing generics and varargs. >>>>>>> >> >>>>>>> >> Here’s a quick summary of my feedback: >>>>>>> >> >>>>>>> >> I think that the choice to enumerate function signatures is >>>>>>> limiting. It would be nice to see a discussion of the trade-offs and a >>>>>>> rationale for the choice. I think it would also be very helpful to have >>>>>>> a >>>>>>> few representative use cases for this included in the doc. That way the >>>>>>> proposal can demonstrate that it solves those use cases with reasonable >>>>>>> trade-offs. >>>>>>> >> There are a few instances where this is inconsistent with >>>>>>> conventions in other specs. For example, using string IDs rather than an >>>>>>> integer. >>>>>>> >> This uses a very different model for spec versioning than the >>>>>>> Iceberg view and table specs. It requires readers to fail if there are >>>>>>> any >>>>>>> unknown fields, which prevents the spec from adding things that are >>>>>>> fully >>>>>>> backward-compatible. Other Iceberg specs only require a version change >>>>>>> to >>>>>>> introduce forward-incompatible changes and I think that this should do >>>>>>> the >>>>>>> same to avoid confusion. >>>>>>> >> It looks like the intent is to allow multiple function signatures >>>>>>> per verison, but it is unclear how to encode them because a version is >>>>>>> associated with a single function signature. >>>>>>> >> There is no review of SQL syntax for creating functions across >>>>>>> engines, so this doesn’t show that the metadata proposed is sufficient >>>>>>> for >>>>>>> cross-engine use cases. >>>>>>> >> The example for a table-valued function shows a SELECT statement >>>>>>> and it isn’t clear how this is distinct from a view >>>>>>> >> >>>>>>> >> >>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat < >>>>>>> ajanthab...@gmail.com> wrote: >>>>>>> >>> >>>>>>> >>> Thanks Walaa and Robert for the review on this. >>>>>>> >>> >>>>>>> >>> We didn't find any blocker for the spec. >>>>>>> >>> I will wait for a week and If no more review comments, I will >>>>>>> raise a PR for spec addition next week. >>>>>>> >>> >>>>>>> >>> If anyone else is interested, please have a look at the proposal >>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>> >>> >>>>>>> >>> - Ajantha >>>>>>> >>> >>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa < >>>>>>> wa.moust...@gmail.com> wrote: >>>>>>> >>>> >>>>>>> >>>> Hi Ajantha, >>>>>>> >>>> >>>>>>> >>>> I have left some comments. It is an interesting direction, but >>>>>>> there might be some details that need to be fine tuned. >>>>>>> >>>> >>>>>>> >>>> The doc is here [1] for others who might be interested. >>>>>>> Resharing since I do not think it was directly linked in the thread. >>>>>>> >>>> >>>>>>> >>>> [1] >>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>> >>>> >>>>>>> >>>> Thanks, >>>>>>> >>>> Walaa. >>>>>>> >>>> >>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat < >>>>>>> ajanthab...@gmail.com> wrote: >>>>>>> >>>>> >>>>>>> >>>>> Hi, just another reminder since we didn't get any review on >>>>>>> the proposal. >>>>>>> >>>>> Initially proposed on June 4. >>>>>>> >>>>> >>>>>>> >>>>> - Ajantha >>>>>>> >>>>> >>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat < >>>>>>> ajanthab...@gmail.com> wrote: >>>>>>> >>>>>> >>>>>>> >>>>>> Hi everyone, >>>>>>> >>>>>> >>>>>>> >>>>>> We've only received one review so far (from Benny). >>>>>>> >>>>>> >>>>>>> >>>>>> We would appreciate more eyes on this. >>>>>>> >>>>>> >>>>>>> >>>>>> - Ajantha >>>>>>> >>>>>> >>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat < >>>>>>> ajanthab...@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> Please find the proposal link >>>>>>> >>>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>>> >>>>>>> >>>>>>> >>>>>>> Google doc link is attached in the proposal. >>>>>>> >>>>>>> And Thanks Stephen Lin for working on it. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hope it gives more clarity to take the decisions and how we >>>>>>> want to implement it. >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Ajantha >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < >>>>>>> wa.moust...@gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user >>>>>>> defined functions. Here are some examples of what I meant in (2): >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Hive GenericUDF: >>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>> >>>>>>>> Trino user defined functions: >>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>> >>>>>>>> Flink user defined functions: >>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Probably what you referred to is a variation of (1) where >>>>>>> the API is data flow/data pipeline API instead of SQL (e.g., Spark >>>>>>> Scala). >>>>>>> Yes, that is also possible in the very long run :) >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Thanks, >>>>>>> >>>>>>>> Walaa. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye < >>>>>>> yezhao...@gmail.com> wrote: >>>>>>> >>>>>>>>> >>>>>>> >>>>>>>>> > (2) Custom code written in imperative function according >>>>>>> to a Java/Scala/Python API, etc. >>>>>>> >>>>>>>>> >>>>>>> >>>>>>>>> I think we could still explore some long term >>>>>>> opportunities in this case. Consider you register a Spark temp view as >>>>>>> some >>>>>>> sort of data frame read, then it could still be resolved to a Spark plan >>>>>>> that is representable by an intermediate representation. But I agree >>>>>>> this >>>>>>> gets very complicated very soon, and just having the case (1) covered >>>>>>> would >>>>>>> already be a huge step forward. >>>>>>> >>>>>>>>> >>>>>>> >>>>>>>>> -Jack >>>>>>> >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow < >>>>>>> btc...@gmail.com> wrote: >>>>>>> >>>>>>>>>> >>>>>>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF can be >>>>>>> used to build a parameterized view. So, there's definitely a lot in >>>>>>> common >>>>>>> between UDFs and views. >>>>>>> >>>>>>>>>> >>>>>>> >>>>>>>>>> Thanks >>>>>>> >>>>>>>>>> >>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >>>>>>> wa.moust...@gmail.com> wrote: >>>>>>> >>>>>>>>>>> >>>>>>> >>>>>>>>>>> I think there is a disconnect about what is perceived as >>>>>>> a "UDF". There are 2 flavors: >>>>>>> >>>>>>>>>>> >>>>>>> >>>>>>>>>>> (1) Functions that are defined by the user whose >>>>>>> definition is a composition of other built-in functions/SQL expressions. >>>>>>> >>>>>>>>>>> (2) Custom code written in imperative function according >>>>>>> to a Java/Scala/Python API, etc. >>>>>>> >>>>>>>>>>> >>>>>>> >>>>>>>>>>> All the examples in Ajantha's references are pretty much >>>>>>> from (1) and I think those have more analogy to views due to their SQL >>>>>>> nature. Agree (2) is not practical to maintain by Iceberg, but I think >>>>>>> Ajantha's use cases are around (1), and may be worth evaluating. >>>>>>> >>>>>>>>>>> >>>>>>> >>>>>>>>>>> Thanks, >>>>>>> >>>>>>>>>>> Walaa. >>>>>>> >>>>>>>>>>> >>>>>>> >>>>>>>>>>> >>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat < >>>>>>> ajanthab...@gmail.com> wrote: >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post the proposal, >>>>>>> but I think this would be a very difficult area to tackle across >>>>>>> engines, >>>>>>> languages, and memory models without having a huge performance penalty. >>>>>>> >>>>>>>>>>>> >>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL representations >>>>>>> of UDFs (similar to views as shared by the reference links above), the >>>>>>> complexity involved will be similar to managing views. >>>>>>> >>>>>>>>>>>> >>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>>> >>>>>>>>>>>> We will work on publishing the draft spec (inspired by >>>>>>> the view spec) this week to facilitate further discussions. >>>>>>> >>>>>>>>>>>> >>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>> >>>>>>>>>>>> >>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye < >>>>>>> yezhao...@gmail.com> wrote: >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> > While it would be great to have a common set of >>>>>>> functions across engines, I don't see how that is practical when those >>>>>>> engines are implemented so differently. Plugging in code -- and >>>>>>> especially >>>>>>> custom user-supplied code -- seems inherently specialized to me and >>>>>>> should >>>>>>> be part of the engines' design. >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> How is this different from the views? I feel we can >>>>>>> say exactly the same thing for Iceberg views, but yet we have Iceberg >>>>>>> multi-dialect views implemented. Maybe it sounds like we are trying to >>>>>>> draw >>>>>>> a line between SQL vs other programming language as "code"? but I think >>>>>>> SQL >>>>>>> is just another type of code, and we are already talking about compiling >>>>>>> all these different code dialects to an intermediate representation >>>>>>> (using >>>>>>> projects like Coral, Substrait), which will be stored as another type of >>>>>>> representation of Iceberg view. I think the same functionality can be >>>>>>> used >>>>>>> for UDFs if developed. >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a good idea, >>>>>>> even just a multi-dialect one like view, and that can allow engines to >>>>>>> for >>>>>>> example parse a view SQL, and when a function referenced cannot be >>>>>>> resolved, try to seek for a multi-dialect UDF definition. >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we have the actual >>>>>>> proposal published. >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> Best, >>>>>>> >>>>>>>>>>>>> Jack Ye >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp < >>>>>>> sn...@snazy.de> wrote: >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and >>>>>>> "non-centralized" as views are. The same performance concerns apply to >>>>>>> views as well. >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base upon which >>>>>>> engines can build, so the argument that UDFs aren't practical, because >>>>>>> engines are different, is probably only a temporary concern. >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle >>>>>>> the idea to make views portable, which is conceptually not that much >>>>>>> different from portable UDFs. >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the >>>>>>> idea of having UDFs in Iceberg, especially not in this early stage. >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha. >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add >>>>>>> UDFs tracked by Iceberg catalogs. I think that Iceberg primarily deals >>>>>>> with >>>>>>> things that are centralized, like tables of data. While it would be >>>>>>> great >>>>>>> to have a common set of functions across engines, I don't see how that >>>>>>> is >>>>>>> practical when those engines are implemented so differently. Plugging in >>>>>>> code -- and especially custom user-supplied code -- seems inherently >>>>>>> specialized to me and should be part of the engines' design. >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post the proposal, >>>>>>> but I think this would be a very difficult area to tackle across >>>>>>> engines, >>>>>>> languages, and memory models without having a huge performance penalty. >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> Ryan >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < >>>>>>> ajanthab...@gmail.com> wrote: >>>>>>> >>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>> Hi Everyone, >>>>>>> >>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the community interest >>>>>>> in storing the Versioned SQL UDFs in Iceberg. >>>>>>> >>>>>>>>>>>>>>> We want to propose the spec addition for storing the >>>>>>> versioned UDFs in Iceberg (inspired by view spec). >>>>>>> >>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in that >>>>>>> they are associated with tables, but they can accept arguments and >>>>>>> produce >>>>>>> return values, or even function as inline expressions. >>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, >>>>>>> Databricks Spark supports SQL UDFs at catalog level [1]. >>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable >>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>> >>>>>>>>>>>>>>> - Interoperability between the engines. Potentially >>>>>>> engines can understand the UDFs written by other engines (with the >>>>>>> translate layer). >>>>>>> >>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>> We believe that integrating this feature into >>>>>>> Iceberg would be a valuable addition, and we're eager to collaborate >>>>>>> with >>>>>>> the community to develop a UDF specification. >>>>>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a specification >>>>>>> to propose to the community. >>>>>>> >>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this. >>>>>>> >>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>> >>>>>>>>>>>>>>> Dremio - >>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>> >>>>>>>>>>>>>>> Trino - >>>>>>> https://trino.io/docs/current/sql/create-function.html >>>>>>> >>>>>>>>>>>>>>> Snowflake - >>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>> >>>>>>>>>>>>>>> Databricks - >>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>> >>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>> - Ajantha >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> -- >>>>>>> >>>>>>>>>>>>>> Ryan Blue >>>>>>> >>>>>>>>>>>>>> Tabular >>>>>>> >>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>> -- >>>>>>> >>>>>>>>>>>>>> Robert Stupp >>>>>>> >>>>>>>>>>>>>> @snazy >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> -- >>>>>>> >> Ryan Blue >>>>>>> >> Databricks >>>>>>> >>>>>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >>