rtpsw commented on PR #13214:
URL: https://github.com/apache/arrow/pull/13214#issuecomment-1137158671
> What would you envision the lifetime of the custom extension id registry
being? Based on some of the comments it sounds like you might be thinking
per-plan.
Either or both per-process and per-plan. The scope can be controlled by the
user (from Arrow or PyArrow):
- For per-process scope, the user makes `per_process_registry =
nested_extension_id_registry(default_extension_id_registry())` and keeps it
throughout the lifetime of the process.
- For per-plan scope, the user recreates `per_plan_registry =
nested_extension_id_registry(default_extension_id_registry())` for each plan,
- For both per-process and per-plan scope, the user makes
`per_process_registry =
nested_extension_id_registry(default_extension_id_registry())` and recreates
`per_plan_registry = nested_extension_id_registry(per_process_registry)` (note
the double-nesting) for each plan.
> For a plan-specific "embedded UDF" (i.e. a UDF that has been pickled and
serialized as part of the plan) I though the idea was that the embedded UDF
would have some special way of being inserted into an expression (admittedly,
the protobuf is lacking for this part at the moment) and so it wouldn't need to
be a part of the extension set.
Locally, I was able to get this to work for UDFs restricted to element-wise
flat-scalar-valued (not analytic or reduction, not struct, not table or
dataset).
> For UDFs that are not embedded in the plan itself it seems the user would
probably want to just register those once so I think the lifetime of this
nested set would still be process-scoped.
My local solution involves an Ibis/Substrait/Arrow workflow, where all UDFs
are registered once and get serialized into each Substrait plan they are used
in.
> The proposed change works fine for process-scoped but I want to make sure
I understand the intended usage.
As explained above, I think it works more broadly.
> Lastly, we should have some unit tests in place before we merge this.
Agreed. I'll move on to add them, now that this approach seems to be
acceptable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]