[jira] [Commented] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

Vibhatha Lakmal Abeykoon (Jira) Wed, 15 Jun 2022 17:23:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554833#comment-17554833
 ]


Vibhatha Lakmal Abeykoon commented on ARROW-16823:
--------------------------------------------------

I think this is in general a good idea. But I have a few questions to 
genaralize the usage. Looking into this discussion, I think for general usage 
of UDFs we could also keep a temporary registry which is in the scope of the 
application and it get destroyed when the application ends it's life. So it is 
external to the global function registry (GFR). But we didn't design the 
initial version of UDFs to support this. Keeping this aside, for Substrait 
users the proposed idea is to keep a separate registry to hold the registered 
functions and let the application lifetime decide it's destruction. So this 
would always be independent from the temporary registry we design for UDF 
(assuming we are going to).

Thinking about a simple example to reflect the usage. Let's say there is a user 
who is writing an application with 3 stages. The first stage finishes and 
independent of that, the 2nd stage continues, but the results from stage 1 and 
stage 2 are required for stage3. The user defines a set of custom functions and 
get them registered in the proposed manner. Now this is in the temporary 
function registry called TF1. The first stage concludes. In the second stage, 
the user wants to consume a substrait plan and pre-process some data. Here we 
have the TF2 which has it's own functions, plus it requires some of the 
functions required in TF1. But if we made TF2 such that it is nested as 
suggested, we don't need to re-register we can re-use the prioir. And in the 
third stage we can use the results from both stages and conclude our work. 
Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right? What if TF1 destroyed, that 
means TF2 get detached from the GFR, are we going to correct that relationship 
when we remove TF1. Are we planning to handle this or is this irrelevant? 
Please correct me if I am wrong. I guess a simple design doc would come in 
handy if we are not grasping the major aspects how the temporary registry would 
be used.

Considering the practical usage, I assume what should happen is, when I ask for 
function `f1` to be called, it should scan through the global, then go level by 
level on the scoped and retrieve the function once located. Is this right? For 
Python UDF users or R UDF users, do we have to do anything special where we 
expose the FunctionRegistry (I guess we don't have to, but curious). I would 
assume the temporary registry idea is powerful to give more control to the 
application developer to control what is done with functions. If it is exposed 
they can efficiently manage it rather than we manage it for them internally. I 
could be wrong, but please evaluate this statement.

In addition, I have this general question, depending on the usage, should we 
keep a separate temporary function registry for Substrait UDF users, plain UDF 
users (directly using Arrow), in future there could be similar cases where we 
need to support. It could be a third-party library which has a different 
flavour of requirements. So should we create temporaries for each such case or 
just create single temporary to be used in all cases (won't be practical, but 
curious). I assume scoped registries would be the solution to support such 
events. 

Diving a little deep into the parallel case, we are going to have separate 
scoped registry for each instance. I would say that is efficient for 
communication and there is no sync issues. May be the intended use is multiple 
plans with non-overlapping functions? I assume for multi-node multi-core 
setting we won't be keeping duplicated memory in each node. In the optimized 
way, I would assume to minimize communications we can keep function copies 
across each process if required by other plans. Here we are saving execution 
time. But in case these registries grow too big (Could this allocate a huge 
memory if we store 1000 UDFs?), we could have a shared-memory model. This is 
out of scope, but just curious about the parallel setting.

Appreciate your thoughts on this. cc [~rtpsw] 

 

> [C++] Arrow Substrait enhancements for UDF
> ------------------------------------------
>
>                 Key: ARROW-16823
>                 URL: https://issues.apache.org/jira/browse/ARROW-16823
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yaron Gvili
>            Assignee: Yaron Gvili
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> The enhancements include support for:
>  * user-provided extension-id-registries and function-registries (for scoped 
> registries)
>  * registering a function (with an Id) external to the plan
>  * a dataset-write-sink (for convenience and multiple outputting)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

Reply via email to