thisisnic commented on PR #46667:
URL: https://github.com/apache/arrow/pull/46667#issuecomment-2930550966

   Thanks for the PR @etiennebacher, much appreciated!
   
   OK, this is a bit long-winded, but I'm a bit unsure about something so 
wanted to spell out my reasoning.
   
   * In `arrow_mask()`, we take the functions created in the .cache environment 
and add them to the mask in order to be able to use our bindings from R 
functions to Arrow compute functions here.  We add the column references to the 
mask too.
   * In `arrow_eval()` we call `add_user_functions_to_mask()` (which is used to 
enable user-created R functions in Arrow without needing to use the UDF 
functionality), which looks for R functions in expressions being evaluated 
which are currently absent from the mask, and see if we can find them in the 
mask's "grandparent" environment and adds them to the data mask.
     * In the comment in `add_user_functions_to_mask()` we reference 
`case_when()` which makes me think that this gives a clue to why it 
specifically doesn't work there.
   * When we now pass `mask` to the `env` param in this PR (`eval_tidy(expr, 
data = mask, env = mask)`), we get to add those items manually
     * Also: "Objects in data have priority over those in env" (from the 
`eval_tidy()` docs).
   
   Source of the issue:
   * In `case_when()` the mask passed to `arrow_eval()` is based on the 
`caller_env()`.  
     * We also do this in other places, e.g. bindings for verbs like `mutate()` 
but I can't reproduce the original issue when using those functions.
       * But in `mutate()` we use `expand_across()` which calls `quo_get_env()` 
which does the job of getting the environment - (from the docs:) "A quosure is 
guaranteed to evaluate in its original environment and can refer to local 
objects safely".
       * I don't know what "safely" means here, but I'm wondering if there's 
some reason we might want `case_when()` to be able to refer to the relevant 
environment instead of blanket updating for `eval_tidy()`, if there's a reason 
we didn't pass the mask in as `env` beyond just leaving things as the default?
         * So for example, in `case_when()`, where we have the code `mask <- 
caller_env()`, maybe instead we actually want `mask <- caller_env(n=2)` or 
something like that, because if that works we've got less risk of unintended 
consequences?
   
   Or, are there no unintended consequences likely (as `data` takes precedent 
over `env` anyway), and we can just do this as-is and everyone's happy.
   
   cc @nealrichardson - is this good to go as-is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to