LiaCastaneda commented on issue #8938:
URL: https://github.com/apache/arrow-rs/issues/8938#issuecomment-3737953626

   I'm thinking if maybe we could implement this in steps.
   
   1. As a first step we could do something like what @notfilippo suggested. We 
can have a global Context with a Memory Pool that if it's set, we can 
automatically track all created Arrays by claiming inside `MutableBuffer`. I 
don't think this change would require much effort, since every Buffer creation 
goes through `MutableBuffer`, and we can probably directly test this in 
`DataFusion`  and see if there are any perf regressions and validate the 
approach. This could be easily reverted if we see it's not feasible.
   2. For a second step (if we see step 1 proves feasible) we could have 
explicit APIs with the context, like what @alamb suggested. I imagine we could 
have something like this to also avoid breaking the existing API:
   
   ```
   pub fn add_with_context(
       context: &Context,
       lhs: &dyn Datum,
       rhs: &dyn Datum
   ) -> Result<ArrayRef, ArrowError> {
       arithmetic_op_with_context(Op::Add, context, lhs, rhs)
   }
   
   pub fn add(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<ArrayRef, ArrowError> 
{
       match global_context() {
           Some(ctx) => add_with_context(ctx, lhs, rhs),
           None => {
               // Fallback: use default context
               let ctx = Context::default();
               add_with_context(&ctx, lhs, rhs)
           }
       }
   }
   ```
   This step will be complex because of the number of APIs involved: ~16 array 
types with constructors (both `new` and `try_new`). For the kernels it's more 
complicated because there are >100 and each would require its counterpart with 
a context. In addition, it would require passing the context down from the 
kernel function to the Array constructor function, changing signatures at every 
level. For example, for the `add` kernel it would involve passing the context 
down 6 levels until we reach the array constructor: `add` → `arithmetic_op` → 
`integer_op` → `try_op` → `try_binary` → `PrimitiveArray::new`.
   
   Maybe we can do this in batches, starting with the most common kernels. 
However, as @alamb says, the danger would be leaving it halfway done and having 
a bunch of kernels claiming created data while others don't.
   
   Step 2 would be the long term solution since it will allow more granular 
memory accounting, and we can have as many memory pools/contexts as we want. 
Also, I think adding a Context struct now (even if it only contains MemoryPool 
initially) sets up the infra for future features. For instance, arrow 
[cpp](https://github.com/apache/arrow/blob/342c5d563005709eab03dde332e44bcb237385fc/cpp/src/arrow/compute/exec.h#L78)
 has vars to tweak execution parameters like chunk sizes, threading etc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to