niyue opened a new issue, #40024:
URL: https://github.com/apache/arrow/issues/40024

   ### Describe the enhancement requested
   
   # Description
   This enhancement request plans to speed up the construct of LLVM module by 
examining the particular functions used in Gandiva expressions, and avoid 
unnecessary operations to speed it up.
   
   When constructing an LLVM module for the given expressions, Gandiva performs 
the following tasks:
   1) Instantiate a new `Engine`, which internally constructs a new LLVM module
   2) Add many C functions and their pointers that may be called by the 
expression to the LLVM module. 
        * Most of the C functions are user-facing, and will be used in Gandiva 
expressions by users, such as the `random` function, and the 
`gdv_fn_base64_decode_utf8` function (which is used by `unbase64`)
       * Some of the C functions are internally used only, such as 
`gdv_fn_populate_varlen_vector` and `gdv_fn_context_arena_malloc` and not 
directly used in Gandiva expressions composed by users, but by the LLVM IR 
composed for the LLVM module
   3) Load LLVM bitcode, which contains many LLVM IR implemented functions into 
the LLVM module
         * Most of the IR functions are user-facing, and will be used in 
Gandiva expressions by users, such as the `negative` function and the `log10` 
function
         * Some of the IR functions are internally used only, such as the 
`bitMapGetBit` and `bitMapValidityGetBit` functions
   
   During the above process, some of the operations are not trivial and they 
makes the above process not fast enough:
   1) For each of the C function added to the LLVM module, in the end, the C 
function's pointer will be added and defined in the LLVM module's JITDylib, 
`jit_dylib.define(llvm::orc::absoluteSymbols({{mangle(name), symbol}}))`. This 
is not a cheap operation, and since each LLVM module will add many C functions 
into it (143 such usage so far in the codebase), which makes constructing the 
LLVM module not fast enough (when cache is not hit).
   2) Loading LLVM bitcode will call `llvm::Linker::linkModules` to copy the 
bitcode's module into the `Engine`'s LLVM module, and this is an expensive 
operation. 
   
   
   # Proposal
   To speed up the above process, the key observation is:
   1) typically, besides the internally used C functions, only a very small 
number of C functions are used in most expressions, so we don't have to add map 
the 143 functions every time (it is very rare that users will come up with some 
expressions calling 100+ functions at the same time)
   2) typically, besides the internally used IR functions, only a very small 
number of IR functions are used in most expressions, we could avoid loading the 
LLVM bitcode and linking them into the LLVM module if the functions are not 
used at all (for example, all the functions used in the expressions are C 
functions)
   
   The proposal to improve this part is:
   1) parse the expressions and keep track of the functions used in the 
expressions
   2) when adding/mapping C functions, if it is an internally used function, we 
could simply add it, otherwise, check the used functions obtained above, to see 
if it is really needed to be defined in the LLVM module
   3) Split LLVM bitcode into two parts:
       * one part for storing internal IR functions, more specifically, 
`bitMapGetBit`/`bitMapSetBit`/`bitMapValidityGetBit`/`bitMapClearBitIfFalse`. 
This part of bitcode will always loaded and added to the LLVM module
       * the other part for storing all user-facing IR functions. When loading 
LLVM bitcode, check if all the functions used in the expressions are C 
functions, if yes, there is no need to load IR function bitcode at all.
   
   This kind of processing will avoid the expensive operations mentioned above, 
hence achieving better performance in some cases.
   
   
   
   ### Component(s)
   
   C++ - Gandiva


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to