[
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554982#comment-17554982
]
Yaron Gvili edited comment on ARROW-16823 at 6/16/22 9:21 AM:
--------------------------------------------------------------
[~vibhatha], before I address your points, I think it would help that I write
my view of how nested registries would be used, in general and in the context
of UDFs.
In general, a nested registry is created and passed to a new scope which is
free to modify it without affecting its parent registries. This can be thought
of as passing-by-value, as long as parent registries remain constant while the
new scope is alive, and indeed this is the recommended way of using nested
registries. With this way of use, registry nesting has the following desirable
properties:
# Value-semantics: modification are restricted to the passed "value".
# Recursive: repeated nesting works as expected.
# Thread-safety: a nested registry can be safely passed to a thread.
In the context of UDFs, a nested registry is created for temporarily
registering UDFs for the lifetime of a separate scope in which they will be
used. In a typical use case, this scope is for deserialization and execution of
a Substrait plan. In this use case, one creates nested (function and
extension-id) registries and uses them to deserialize a Substrait plan,
register UDFs for this plan, and execute the plan, then drops the nested
registries.
It is no accident that the above properties make nested registries powerful
enough to cleanly support much more complex future use cases. I envision
modular Substrait plans:
* a Substrait plan can be shared (from author to its users)
* shared Substrait plans can be gathered in libraries/modules
* a Substrait plan can include invocations of other shared Substrait plans
and that they will become important for boosting user productivity with Arrow.
While this is my long-term vision, the current issue is about preparation for
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that
I'm currently working on.
Now to your points.
> I think for general usage of UDFs we could also keep a temporary registry
> which is in the scope of the application and it get destroyed when the
> application ends it's life.
A single registry for UDF would go against the design goal of modularity. It
would require support for unregistration, which is error-prone. See also the
discussion in ARROW-16211.
> Thinking about a simple example to reflect the usage.
This is actually an example more complex than the
single-Substrait-plan-with-UDFs one that I described above.
> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?
I think the right organization for your example is that each nested registry
has the global one as its parent. Each of the 3 stages has its own set of UDFs
to register.
> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going
> to correct that relationship when we remove TF1. Are we planning to handle
> this or is this irrelevant?
When following the recommended way of using nested registries that I described
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect
to even modify, let alone drop, TF1 while TF2 is alive.
> Considering the practical usage, I assume what should happen is, when I ask
> for function `f1` to be called, it should scan through the global, then go
> level by level on the scoped and retrieve the function once located. Is this
> right?
It's the other way around. In the case of GFR->TF1->TF2, the function is first
looked up in TF2, then in TF1, and finally in GFR. This way, modification to
TF2 take precedence, which is what one expects from value-semantics.
> For Python UDF users or R UDF users, do we have to do anything special where
> we expose the FunctionRegistry (I guess we don't have to, but curious)...
Eventually, the end-user should typically just invoke a single function to
execute a Substrait plan. If the Substrait plan has UDFs, their registration
into fresh nested registries will be automated (I have this locally worked out
for Python-UDFs). The facilities we discuss here are for developers and should
eventually be encapsulated from the end-user.
> In addition, I have this general question, depending on the usage, should we
> keep a separate temporary function registry for Substrait UDF users, plain
> UDF users (directly using Arrow), in future there could be similar cases
> where we need to support...
As described above, the recommended way is to create nested registries for a
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).
> Diving a little deep into the parallel case, we are going to have separate
> scoped registry for each instance. I would say that is efficient for
> communication and there is no sync issues. May be the intended use is
> multiple plans with non-overlapping functions? ...
A thread is a separate scope, and if it needs to modify registries then it will
be passed fresh nested registries (or create them by itself first thing) that
it can freely modify. For example, this need arises when there are multiple
threads, each processing a Substrait plan with its own UDFs. The parent
registries will be kept constant while the threads are working. Since the
parent registries are reused, so does their memory, hence the extra
registration memory cost is only due to the UDFs registered in the nested
registries. Even in a case with 1000 threads, it is still possible to minimize
the extra memory required, e.g., when all threads share nested registries that
were set up once before they start using them in a read-only manner.
was (Author: JIRAUSER284707):
[~vibhatha], before I address your points, I think it would help that I write
my view of how nested registries would be used, in general and in the context
of UDFs.
In general, a nested registry is created and passed to a new scope which is
free to modify it without affecting its parent registries. This can be thought
of as passing-by-value, as long as parent registries remain constant while the
new scope is alive, and indeed this is the recommended way of using nested
registries. With this way of use, registry nesting has the following desirable
properties:
# Value-semantics: modification are restricted to the passed "value".
# Recursive: repeated nesting works as expected.
# Thread-safety: a nested registry can be safely passed to a thread.
In the context of UDFs, a nested registry is created for temporarily
registering UDFs for the lifetime of a separate scope in which they will be
used. In a typical use case, this scope is for deserialization and execution of
a Substrait plan. In this use case, one creates nested (function and
extension-id) registries and use them to deserialize a Substrait plan, register
UDFs for this plan, and execute the plan, then drops the nested registries.
It is no accident that the above properties make nested registries powerful
enough to cleanly support much more complex future use cases. I envision
modular Substrait plans:
* a Substrait plan can be shared (from author to its users)
* shared Substrait plans can be gathered in libraries/modules
* a Substrait plan can include invocations of other shared Substrait plans
and that they will become important for boosting user productivity with Arrow.
While this is my long-term vision, the current issue is about preparation for
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that
I'm currently working on.
Now to your points.
> I think for general usage of UDFs we could also keep a temporary registry
> which is in the scope of the application and it get destroyed when the
> application ends it's life.
A single registry for UDF would go against the design goal of modularity. It
would require support for unregistration, which is error-prone. See also the
discussion in ARROW-16211.
> Thinking about a simple example to reflect the usage.
This is actually an example more complex than the
single-Substrait-plan-with-UDFs one that I described above.
> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?
I think the right organization for your example is that each nested registry
has the global one as its parent. Each of the 3 stages has its own set of UDFs
to register.
> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going
> to correct that relationship when we remove TF1. Are we planning to handle
> this or is this irrelevant?
When following the recommended way of using nested registries that I described
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect
to even modify, let alone drop, TF1 while TF2 is alive.
> Considering the practical usage, I assume what should happen is, when I ask
> for function `f1` to be called, it should scan through the global, then go
> level by level on the scoped and retrieve the function once located. Is this
> right?
It's the other way around. In the case of GFR->TF1->TF2, the function is first
looked up in TF2, then in TF1, and finally in GFR. This way, modification to
TF2 take precedence, which is what one expects from value-semantics.
> For Python UDF users or R UDF users, do we have to do anything special where
> we expose the FunctionRegistry (I guess we don't have to, but curious)...
Eventually, the end-user should typically just invoke a single function to
execute a Substrait plan. If the Substrait plan has UDFs, their registration
into fresh nested registries will be automated (I have this locally worked out
for Python-UDFs). The facilities we discuss here are for developers and should
eventually be encapsulated from the end-user.
> In addition, I have this general question, depending on the usage, should we
> keep a separate temporary function registry for Substrait UDF users, plain
> UDF users (directly using Arrow), in future there could be similar cases
> where we need to support...
As described above, the recommended way is to create nested registries for a
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).
> Diving a little deep into the parallel case, we are going to have separate
> scoped registry for each instance. I would say that is efficient for
> communication and there is no sync issues. May be the intended use is
> multiple plans with non-overlapping functions? ...
A thread is a separate scope, and if it needs to modify registries then it will
be passed fresh nested registries (or create them by itself first thing) that
it can freely modify. For example, this need arises when there are multiple
threads, each processing a Substrait plan with its own UDFs. The parent
registries will be kept constant while the threads are working. Since the
parent registries are reused, so does their memory, hence the extra
registration memory cost is only due to the UDFs registered in the nested
registries. Even in a case with 1000 threads, it is still possible to minimize
the extra memory required, e.g., when all threads share nested registries that
were set up once before they start using them in a read-only manner.
> [C++] Arrow Substrait enhancements for UDF
> ------------------------------------------
>
> Key: ARROW-16823
> URL: https://issues.apache.org/jira/browse/ARROW-16823
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Yaron Gvili
> Assignee: Yaron Gvili
> Priority: Major
> Labels: pull-request-available
> Time Spent: 9h 20m
> Remaining Estimate: 0h
>
> The enhancements include support for:
> * user-provided extension-id-registries and function-registries (for scoped
> registries)
> * registering a function (with an Id) external to the plan
> * a dataset-write-sink (for convenience and multiple outputting)
--
This message was sent by Atlassian Jira
(v8.20.7#820007)