wesm opened a new pull request #7240: URL: https://github.com/apache/arrow/pull/7240
This patch is a major reworking of our development strategy for implementing array-valued functions and applying them in a query processing setting. There are a ton of details, but one nice thing is that there is now a single API entry point for invoking any function by its name: ```c++ Result<Datum> CallFunction(ExecContext* ctx, const std::string& func_name, const std::vector<Datum>& args, const FunctionOptions* options = NULLPTR); ``` What occurs when you do this: * A `Function` instance is looked up in the global `FunctionRegistry` * Given the descriptors of `args` (their types and shapes -- array or scalar), the Function searches for `Kernel` that is able to process those types and shapes. A kernel might be able to do `array[T0], array[T1]` or only `scalar[T0], scalar[T1]`, for example. This permits kernel specialization to treat different type and shape combinations * The kernel is executed iteratively against `args` based on what `args` contains -- if there are ChunkedArrays, they will be split into contiguous pieces. Kernels never see ChunkedArray, only Array or Scalar * The Executor implementation is able to split contiguous Array inputs into smaller chunks, which is important for parallel execution. See `ExecContext::set_exec_chunksize` To summarize: the REGISTRY contains FUNCTIONS. A FUNCTION contains KERNELS. A KERNEL is a specific implementation of a function that services a particular type combination. An additional effort in this patch is to radically simplify the process of creating kernels that are based on a scalar function. To do this, there is a growing collection of template-based kernel generation classes in compute/kernels/codegen_internal.h that will surely be the topic of much debate. I want to make it a lot easier for people to add new kernels. There are some other incidental changes in the PR, such as changing the convenience APIs like `Cast` to return `Result`. I'm afraid we may have to live with the API breakage unless someone else wants to add backward compatibility code for the old APIs. I have to apologize for making such a large PR. I've been working long hours on this for nearly a month and the process of porting all of our existing functionality and making the unit tests pass caused much iteration in the "framework" part of the code, such that it would have been a huge time drain to review incomplete iterations of the framework that had not been proven to capture the functionality that previously existed in the project. Given the size of this PR and that fact that it completely blocks any work into src/arrow/compute, I don't think we should let this sit unmerged for more than 4 or 5 days, tops. I'm committed to responding to all of your questions and working to address your feedback about the design and improving the documentation and code comments. I tried to leave copious comments to explain my thought process in various places. Feel free to make any and all comments in this PR or in whatever form you like. I don't think that merging should be blocked on stylistic issues. Current status viz-a-viz merge-readiness - [ ] Windows C++ issues - [ ] Port Python bindings -- I have ported the code and the tests pass for me locally - [ ] Port R bindings -- I have ported the C++ code and there is one test failure whose root cause I understand - [ ] Port GLib bindings ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org