[jira] [Comment Edited] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

Yaron Gvili (Jira) Thu, 16 Jun 2022 02:22:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554982#comment-17554982
 ]


Yaron Gvili edited comment on ARROW-16823 at 6/16/22 9:21 AM:
--------------------------------------------------------------

[~vibhatha], before I address your points, I think it would help that I write 
my view of how nested registries would be used, in general and in the context 
of UDFs.

In general, a nested registry is created and passed to a new scope which is 
free to modify it without affecting its parent registries. This can be thought 
of as passing-by-value, as long as parent registries remain constant while the 
new scope is alive, and indeed this is the recommended way of using nested 
registries. With this way of use, registry nesting has the following desirable 
properties:
 # Value-semantics: modification are restricted to the passed "value".
 # Recursive: repeated nesting works as expected.
 # Thread-safety: a nested registry can be safely passed to a thread.

In the context of UDFs, a nested registry is created for temporarily 
registering UDFs for the lifetime of a separate scope in which they will be 
used. In a typical use case, this scope is for deserialization and execution of 
a Substrait plan. In this use case, one creates nested (function and 
extension-id) registries and uses them to deserialize a Substrait plan, 
register UDFs for this plan, and execute the plan, then drops the nested 
registries.

It is no accident that the above properties make nested registries powerful 
enough to cleanly support much more complex future use cases. I envision 
modular Substrait plans:
 * a Substrait plan can be shared (from author to its users)
 * shared Substrait plans can be gathered in libraries/modules
 * a Substrait plan can include invocations of other shared Substrait plans

and that they will become important for boosting user productivity with Arrow.

While this is my long-term vision, the current issue is about preparation for 
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that 
I'm currently working on.

Now to your points.

> I think for general usage of UDFs we could also keep a temporary registry 
> which is in the scope of the application and it get destroyed when the 
> application ends it's life.

A single registry for UDF would go against the design goal of modularity. It 
would require support for unregistration, which is error-prone. See also the 
discussion in ARROW-16211.

> Thinking about a simple example to reflect the usage.

This is actually an example more complex than the 
single-Substrait-plan-with-UDFs one that I described above.

> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?

I think the right organization for your example is that each nested registry 
has the global one as its parent. Each of the 3 stages has its own set of UDFs 
to register.

> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going 
> to correct that relationship when we remove TF1. Are we planning to handle 
> this or is this irrelevant?

When following the recommended way of using nested registries that I described 
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect 
to even modify, let alone drop, TF1 while TF2 is alive.

> Considering the practical usage, I assume what should happen is, when I ask 
> for function `f1` to be called, it should scan through the global, then go 
> level by level on the scoped and retrieve the function once located. Is this 
> right?

It's the other way around. In the case of GFR->TF1->TF2, the function is first 
looked up in TF2, then in TF1, and finally in GFR. This way, modification to 
TF2 take precedence, which is what one expects from value-semantics.

>  For Python UDF users or R UDF users, do we have to do anything special where 
> we expose the FunctionRegistry (I guess we don't have to, but curious)...

Eventually, the end-user should typically just invoke a single function to 
execute a Substrait plan. If the Substrait plan has UDFs, their registration 
into fresh nested registries will be automated (I have this locally worked out 
for Python-UDFs). The facilities we discuss here are for developers and should 
eventually be encapsulated from the end-user.

> In addition, I have this general question, depending on the usage, should we 
> keep a separate temporary function registry for Substrait UDF users, plain 
> UDF users (directly using Arrow), in future there could be similar cases 
> where we need to support...

As described above, the recommended way is to create nested registries for a 
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).

> Diving a little deep into the parallel case, we are going to have separate 
> scoped registry for each instance. I would say that is efficient for 
> communication and there is no sync issues. May be the intended use is 
> multiple plans with non-overlapping functions? ...

A thread is a separate scope, and if it needs to modify registries then it will 
be passed fresh nested registries (or create them by itself first thing) that 
it can freely modify. For example, this need arises when there are multiple 
threads, each processing a Substrait plan with its own UDFs. The parent 
registries will be kept constant while the threads are working. Since the 
parent registries are reused, so does their memory, hence the extra 
registration memory cost is only due to the UDFs registered in the nested 
registries. Even in a case with 1000 threads, it is still possible to minimize 
the extra memory required, e.g., when all threads share nested registries that 
were set up once before they start using them in a read-only manner.


was (Author: JIRAUSER284707):
[~vibhatha], before I address your points, I think it would help that I write 
my view of how nested registries would be used, in general and in the context 
of UDFs.

In general, a nested registry is created and passed to a new scope which is 
free to modify it without affecting its parent registries. This can be thought 
of as passing-by-value, as long as parent registries remain constant while the 
new scope is alive, and indeed this is the recommended way of using nested 
registries. With this way of use, registry nesting has the following desirable 
properties:
 # Value-semantics: modification are restricted to the passed "value".
 # Recursive: repeated nesting works as expected.
 # Thread-safety: a nested registry can be safely passed to a thread.

In the context of UDFs, a nested registry is created for temporarily 
registering UDFs for the lifetime of a separate scope in which they will be 
used. In a typical use case, this scope is for deserialization and execution of 
a Substrait plan. In this use case, one creates nested (function and 
extension-id) registries and use them to deserialize a Substrait plan, register 
UDFs for this plan, and execute the plan, then drops the nested registries.

It is no accident that the above properties make nested registries powerful 
enough to cleanly support much more complex future use cases. I envision 
modular Substrait plans:
 * a Substrait plan can be shared (from author to its users)
 * shared Substrait plans can be gathered in libraries/modules
 * a Substrait plan can include invocations of other shared Substrait plans

and that they will become important for boosting user productivity with Arrow.

While this is my long-term vision, the current issue is about preparation for 
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that 
I'm currently working on.

Now to your points.

> I think for general usage of UDFs we could also keep a temporary registry 
> which is in the scope of the application and it get destroyed when the 
> application ends it's life.

A single registry for UDF would go against the design goal of modularity. It 
would require support for unregistration, which is error-prone. See also the 
discussion in ARROW-16211.

> Thinking about a simple example to reflect the usage.

This is actually an example more complex than the 
single-Substrait-plan-with-UDFs one that I described above.

> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?

I think the right organization for your example is that each nested registry 
has the global one as its parent. Each of the 3 stages has its own set of UDFs 
to register.

> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going 
> to correct that relationship when we remove TF1. Are we planning to handle 
> this or is this irrelevant?

When following the recommended way of using nested registries that I described 
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect 
to even modify, let alone drop, TF1 while TF2 is alive.

> Considering the practical usage, I assume what should happen is, when I ask 
> for function `f1` to be called, it should scan through the global, then go 
> level by level on the scoped and retrieve the function once located. Is this 
> right?

It's the other way around. In the case of GFR->TF1->TF2, the function is first 
looked up in TF2, then in TF1, and finally in GFR. This way, modification to 
TF2 take precedence, which is what one expects from value-semantics.

>  For Python UDF users or R UDF users, do we have to do anything special where 
> we expose the FunctionRegistry (I guess we don't have to, but curious)...

Eventually, the end-user should typically just invoke a single function to 
execute a Substrait plan. If the Substrait plan has UDFs, their registration 
into fresh nested registries will be automated (I have this locally worked out 
for Python-UDFs). The facilities we discuss here are for developers and should 
eventually be encapsulated from the end-user.

> In addition, I have this general question, depending on the usage, should we 
> keep a separate temporary function registry for Substrait UDF users, plain 
> UDF users (directly using Arrow), in future there could be similar cases 
> where we need to support...

As described above, the recommended way is to create nested registries for a 
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).

> Diving a little deep into the parallel case, we are going to have separate 
> scoped registry for each instance. I would say that is efficient for 
> communication and there is no sync issues. May be the intended use is 
> multiple plans with non-overlapping functions? ...

A thread is a separate scope, and if it needs to modify registries then it will 
be passed fresh nested registries (or create them by itself first thing) that 
it can freely modify. For example, this need arises when there are multiple 
threads, each processing a Substrait plan with its own UDFs. The parent 
registries will be kept constant while the threads are working. Since the 
parent registries are reused, so does their memory, hence the extra 
registration memory cost is only due to the UDFs registered in the nested 
registries. Even in a case with 1000 threads, it is still possible to minimize 
the extra memory required, e.g., when all threads share nested registries that 
were set up once before they start using them in a read-only manner.

> [C++] Arrow Substrait enhancements for UDF
> ------------------------------------------
>
>                 Key: ARROW-16823
>                 URL: https://issues.apache.org/jira/browse/ARROW-16823
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yaron Gvili
>            Assignee: Yaron Gvili
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> The enhancements include support for:
>  * user-provided extension-id-registries and function-registries (for scoped 
> registries)
>  * registering a function (with an Id) external to the plan
>  * a dataset-write-sink (for convenience and multiple outputting)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

Reply via email to