Hi, I'm working on support for data-source UDFs and would like to get feedback about the design I have in mind for it.
By support for data-source UDFs, at a basic level, I mean enabling a user to define using PyArrow APIs a record-batch-generating function implemented in Python that would be easily plugged into a source-node in a streaming-engine execution plan. Such functions are similar to the existing scalar UDFs with zero inputs, but an important difference is that scalar UDFs are plugged and composed in expressions whereas data-source UDFs would be plugged into a source-node. Focusing on the Arrow and PyArrow parts (I'm leaving the Ibis and Ibis-Substrait parts out), the design I have in mind includes: * In Arrow: Adding a new source-UDF kind of arrow::compute::Function, for functions that generate data. Such functions would be registered in a FunctionRegistry but not used in scalar expressions nor composed. * In Arrow: Adding SourceUdfContext and SourceUdfOptions (similar to ScalarUdfContext and ScalarUdfOptions) in "cpp/src/arrow/python/udf.h". * In Arrow: Adding a UdfSourceExecNode into which a (source-UDF-kind of) function can be plugged. * In PyArrow: Following the design of scalar UDFs, and hopefully reusing much of it. Cheers, Yaron.