niyue commented on code in PR #38116: URL: https://github.com/apache/arrow/pull/38116#discussion_r1357691913
########## cpp/src/gandiva/engine.cc: ########## @@ -152,6 +153,7 @@ Status Engine::LoadFunctionIRs() { if (!functions_loaded_) { ARROW_RETURN_NOT_OK(LoadPreCompiledIR()); ARROW_RETURN_NOT_OK(DecimalIR::AddFunctions(this)); + ARROW_RETURN_NOT_OK(LoadExternalPreCompiledIR()); Review Comment: # Issue 1 > I like the std::vector<NativeFunction> approach. I think that managing both of signatures and bitcodes by FunctionRegistry is better than introducing IR store for bitcodes. The signature and bitcode mapping will be useful for loading bitcode to gandiva::Engine selectively for performance I agree. It seems simpler for users to call just one API because I don't see any use scenario that users would only call one of the two APIs (`FunctionRegistry::Add`, `LLVMExternalIRStore::Add`), additionally, we could take advantage of this signature <==> bitcode mapping later to optimize performance during LLVM module construction. @js8544 do you have any opinion on this idea? Thanks. # Issue 2 > Could you explain "some shared library mechanism" a bit more? > I imagine that gcc -shared -o multiply_by_two.so multiply_by_two.cc and handle = dlopen("multiply_by_two.so"); dlsym(handle, "multiply_by_two_int32")) to gandiva::Engine Sorry I don't make it clear enough. This probably worths another PR for it and I can add some background here. This PR allows users to call pre-compiled functions from LLVM IR/bitcode. Gandiva users need to generate LLVM IR for their functions for integration. This approach will work for small and simple functions, typically for functions without using any dependent library. However, if a function would like to do some more complex task, typically function authors would like to rely on third party libraries. For example, if I would like to implement a function calling a service via HTTP with JSON request/response (e.g. twitter API/OpenAI API), this function will have to depend on an HTTP library and a JSON parsing library (or even some async IO library). If all these dependent libraries are compiled and generated into LLVM IR, the IR/bitcode will be non trivial, and this will make LLVM module construction pretty slow. So I think it may help to compile this kind of complex function into a shared library (instead of LLVM IR), and ask the LLVM engine to load the shared library during runtime and lookup function directly in the shared library. I briefly look into how this could be done, but more investigation is needed later: 1) LLVM has an API called `DynamicLibrary::LoadLibraryPermanently` which can be used for loading a shared library during runtime. 1.1) this is the `DynamicLibrary` API doc in LLVM, https://llvm.org/doxygen/classllvm_1_1sys_1_1DynamicLibrary.html 1.2) `Engine::InitOnce` function in `engine.cc:115` currently calls this `LoadLibraryPermanently` API but it loads nothing currently. 2) As you said, we can use `dlopen` and `dlsym` to get function pointer in the shared library 3) `Engine::AddGlobalMappingForFunc` is used for programmatically setting up a mapping from function name to function pointer, currently, there are some functions like `gdv_fn_context_arena_malloc` is mapped via this API 4) `void* Engine::CompiledFunction(llvm::Function* irFunction)` is used for looking up the function in the module I think it is possible to piece all these together to make them work and address the above complex function performance issue. And back to the original topic, I think about it again, and if we really try to pursuit such goal, maybe the API for function registration could be something like `FunctionRegistry::Add(std::vector<NativeFunction> funcs, const std::string path_to_some_shared_library)` if we adopt the approach mentioned in #1 section. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org