niyue opened a new pull request, #38116:
URL: https://github.com/apache/arrow/pull/38116

   # Rationale for this change
   
   This PR tries to enhance Gandiva by supporting external function registry, 
so that developers can author third party functions without modifying Gandiva's 
core codebase. See https://github.com/apache/arrow/issues/37753 for more 
details.
   
   # What changes are included in this PR?
   Two major changes are included in this PR:
   1) add a new API AddFunction(NativeFunction native_function) for 
FunctionRegistry, where the given parameter native_function stores the external 
function metadata, so that developers can register external functions by 
calling this API.
   2) add a new class for storing external functions' LLVM IR buffers, and 
merge the external IRs with the built-in function IR into the LLVM module, so 
that third party pre-compiled functions can be integrated via LLVM bitcode
   
   The overall flow looks like this:
   <img width="2758" alt="dataflow" 
src="https://github.com/apache/arrow/assets/27754/72bf2e20-a9e7-4cec-beb2-fc8d799d5c6e";>
   
   # Are these changes tested?
   
   Some unit tests are added to verify this enhancement
   
   # Are there any user-facing changes?
   
   No change to the existing behavior. But some new ways to interfacing the 
library are added in this PR.
   
   Closes: https://github.com/apache/arrow/issues/37753
   
   # Notes
   * Performance
       * since the function registry is loaded once and stored as a static map 
internally, there shouldn't too much performance impact typically for 
registering metadata for a new external function
       * the code generation time grows with the number of externally added 
function bitcodes (the more functions are added, the slower the codegen will 
be), even if the externally function is not used in the given expression at 
all. But this is not a new issue, and it applies to built-in functions as well 
(the more built-in functions are there, the slower the codegen will be). In my 
limited testing, this is because `llvm::Linker::linkModule` takes non trivial 
of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` 
happens after that, which doesn't help to reduce the time of `linkModule`. We 
may have to selectively load only necessary IR (primarily selectively doing 
`linkModule` for these IR), but more metadata may be needed to tell which 
functions can be found in which IR. This could be a separated PR for improving 
it, please advice if any one has any idea on improving it. Thanks.
   * Integration with other programming languages via LLVM IR/bitcode
       * So far I only added an external C++ function in the codebase for unit 
testing purpose. Rust based function is possible but I gave it a try and found 
another issue (Rust has std lib which needs to be processed in different 
approach), I will do some exploration for other languages such as zig later.
       * Non pre-compiled functions, may require some different approach to get 
the function pointer, and we may discuss and work on it in a separated PR 
later. I am currently thinking using approach like loading shared libraries 
during runtime, and find the corresponding function pointer symbols from the 
shared library, but I don't do any experiment yet so I am not sure if it works 
this way. Any comment on how this should be done is appreciated.
   * The discussion thread in dev mail list, 
https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
        * I submitted another PR previously 
(https://github.com/apache/arrow/pull/37787) which introduced JSON based 
function registry, and after discussion, I will close that PR and use this PR 
instead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to