First of all, this is a nice discussion, but I have a doubt.

I have a question regarding the simplicity of things. At the moment as we
are not exposing the execution engine primitives to Python user, are you
expecting to expose them by this approach?

On Fri, Jun 3, 2022 at 9:02 PM Yaron Gvili <rt...@hotmail.com> wrote:

> Hi,
>
> I'm working on support for data-source UDFs and would like to get feedback
> about the design I have in mind for it.
>
> By support for data-source UDFs, at a basic level, I mean enabling a user
> to define using PyArrow APIs a record-batch-generating function implemented
> in Python that would be easily plugged into a source-node in a
> streaming-engine execution plan. Such functions are similar to the existing
> scalar UDFs with zero inputs, but an important difference is that scalar
> UDFs are plugged and composed in expressions whereas data-source UDFs would
> be plugged into a source-node.
>
> Focusing on the Arrow and PyArrow parts (I'm leaving the Ibis and
> Ibis-Substrait parts out), the design I have in mind includes:
>
>   *   In Arrow: Adding a new source-UDF kind of arrow::compute::Function,
> for functions that generate data. Such functions would be registered in a
> FunctionRegistry but not used in scalar expressions nor composed.
>   *   In Arrow: Adding SourceUdfContext and SourceUdfOptions (similar to
> ScalarUdfContext and ScalarUdfOptions) in "cpp/src/arrow/python/udf.h".
>   *   In Arrow: Adding a UdfSourceExecNode into which a (source-UDF-kind
> of) function can be plugged.
>   *   In PyArrow: Following the design of scalar UDFs, and hopefully
> reusing much of it.
>
> Cheers,
> Yaron.
>
-- 
Vibhatha Abeykoon

Reply via email to