wesm opened a new pull request #7240:
URL: https://github.com/apache/arrow/pull/7240


   This patch is a major reworking of our development strategy for implementing 
array-valued functions and applying them in a query processing setting. 
   
   There are a ton of details, but one nice thing is that there is now a single 
API entry point for invoking any function by its name:
   
   ```c++
   Result<Datum> CallFunction(ExecContext* ctx, const std::string& func_name,
                              const std::vector<Datum>& args,
                              const FunctionOptions* options = NULLPTR);
   ```
   
   What occurs when you do this:
   
   * A `Function` instance is looked up in the global `FunctionRegistry`
   * Given the descriptors of `args` (their types and shapes -- array or 
scalar), the Function searches for `Kernel` that is able to process those types 
and shapes. A kernel might be able to do `array[T0], array[T1]` or only 
`scalar[T0], scalar[T1]`, for example. This permits kernel specialization to 
treat different type and shape combinations
   * The kernel is executed iteratively against `args` based on what `args` 
contains -- if there are ChunkedArrays, they will be split into contiguous 
pieces. Kernels never see ChunkedArray, only Array or Scalar
   * The Executor implementation is able to split contiguous Array inputs into 
smaller chunks, which is important for parallel execution. See 
`ExecContext::set_exec_chunksize`
   
   To summarize: the REGISTRY contains FUNCTIONS. A FUNCTION contains KERNELS. 
A KERNEL is a specific implementation of a function that services a particular 
type combination. 
   
   An additional effort in this patch is to radically simplify the process of 
creating kernels that are based on a scalar function. To do this, there is a 
growing collection of template-based kernel generation classes in 
compute/kernels/codegen_internal.h that will surely be the topic of much 
debate. I want to make it a lot easier for people to add new kernels.
   
   There are some other incidental changes in the PR, such as changing the 
convenience APIs like `Cast` to return `Result`. I'm afraid we may have to live 
with the API breakage unless someone else wants to add backward compatibility 
code for the old APIs. 
   
   I have to apologize for making such a large PR. I've been working long hours 
on this for nearly a month and the process of porting all of our existing 
functionality and making the unit tests pass caused much iteration in the 
"framework" part of the code, such that it would have been a huge time drain to 
review incomplete iterations of the framework that had not been proven to 
capture the functionality that previously existed in the project.
   
   Given the size of this PR and that fact that it completely blocks any work 
into src/arrow/compute, I don't think we should let this sit unmerged for more 
than 4 or 5 days, tops. I'm committed to responding to all of your questions 
and working to address your feedback about the design and improving the 
documentation and code comments. I tried to leave copious comments to explain 
my thought process in various places. Feel free to make any and all comments in 
this PR or in whatever form you like. I don't think that merging should be 
blocked on stylistic issues. 
   
   Current status viz-a-viz merge-readiness
   
   - [ ] Windows C++ issues
   - [ ] Port Python bindings -- I have ported the code and the tests pass for 
me locally
   - [ ] Port R bindings -- I have ported the C++ code and there is one test 
failure whose root cause I understand
   - [ ] Port GLib bindings


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to