Hi,

I'm working on support for data-source UDFs and would like to get feedback 
about the design I have in mind for it.

By support for data-source UDFs, at a basic level, I mean enabling a user to 
define using PyArrow APIs a record-batch-generating function implemented in 
Python that would be easily plugged into a source-node in a streaming-engine 
execution plan. Such functions are similar to the existing scalar UDFs with 
zero inputs, but an important difference is that scalar UDFs are plugged and 
composed in expressions whereas data-source UDFs would be plugged into a 
source-node.

Focusing on the Arrow and PyArrow parts (I'm leaving the Ibis and 
Ibis-Substrait parts out), the design I have in mind includes:

  *   In Arrow: Adding a new source-UDF kind of arrow::compute::Function, for 
functions that generate data. Such functions would be registered in a 
FunctionRegistry but not used in scalar expressions nor composed.
  *   In Arrow: Adding SourceUdfContext and SourceUdfOptions (similar to 
ScalarUdfContext and ScalarUdfOptions) in "cpp/src/arrow/python/udf.h".
  *   In Arrow: Adding a UdfSourceExecNode into which a (source-UDF-kind of) 
function can be plugged.
  *   In PyArrow: Following the design of scalar UDFs, and hopefully reusing 
much of it.

Cheers,
Yaron.

Reply via email to