[GitHub] [arrow] rtpsw commented on pull request #13214: ARROW-15635: [C++] Support nested extension-id-registry

GitBox Wed, 25 May 2022 05:07:17 -0700


rtpsw commented on PR #13214:
URL: https://github.com/apache/arrow/pull/13214#issuecomment-1137158671


   > What would you envision the lifetime of the custom extension id registry 
being? Based on some of the comments it sounds like you might be thinking 
per-plan.
   
   Either or both per-process and per-plan. The scope can be controlled by the 
user (from Arrow or PyArrow):
   - For per-process scope, the user makes `per_process_registry = 
nested_extension_id_registry(default_extension_id_registry())` and keeps it 
throughout the lifetime of the process.
   - For per-plan scope, the user recreates `per_plan_registry = 
nested_extension_id_registry(default_extension_id_registry())` for each plan, 
   - For both per-process and per-plan scope, the user makes 
`per_process_registry = 
nested_extension_id_registry(default_extension_id_registry())` and recreates 
`per_plan_registry = nested_extension_id_registry(per_process_registry)` (note 
the double-nesting) for each plan.
    
   > For a plan-specific "embedded UDF" (i.e. a UDF that has been pickled and 
serialized as part of the plan) I though the idea was that the embedded UDF 
would have some special way of being inserted into an expression (admittedly, 
the protobuf is lacking for this part at the moment) and so it wouldn't need to 
be a part of the extension set.
   
   Locally, I was able to get this to work for UDFs restricted to element-wise 
flat-scalar-valued (not analytic or reduction, not struct, not table or 
dataset).
   
   > For UDFs that are not embedded in the plan itself it seems the user would 
probably want to just register those once so I think the lifetime of this 
nested set would still be process-scoped.
   
   My local solution involves an Ibis/Substrait/Arrow workflow, where all UDFs 
are registered once and get serialized into each Substrait plan they are used 
in.
   
   > The proposed change works fine for process-scoped but I want to make sure 
I understand the intended usage.
   
   As explained above, I think it works more broadly.
   
   > Lastly, we should have some unit tests in place before we merge this.
   
   Agreed. I'll move on to add them, now that this approach seems to be 
acceptable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rtpsw commented on pull request #13214: ARROW-15635: [C++] Support nested extension-id-registry

Reply via email to