thisisnic commented on PR #46667: URL: https://github.com/apache/arrow/pull/46667#issuecomment-2930550966
Thanks for the PR @etiennebacher, much appreciated! OK, this is a bit long-winded, but I'm a bit unsure about something so wanted to spell out my reasoning. * In `arrow_mask()`, we take the functions created in the .cache environment and add them to the mask in order to be able to use our bindings from R functions to Arrow compute functions here. We add the column references to the mask too. * In `arrow_eval()` we call `add_user_functions_to_mask()` (which is used to enable user-created R functions in Arrow without needing to use the UDF functionality), which looks for R functions in expressions being evaluated which are currently absent from the mask, and see if we can find them in the mask's "grandparent" environment and adds them to the data mask. * In the comment in `add_user_functions_to_mask()` we reference `case_when()` which makes me think that this gives a clue to why it specifically doesn't work there. * When we now pass `mask` to the `env` param in this PR (`eval_tidy(expr, data = mask, env = mask)`), we get to add those items manually * Also: "Objects in data have priority over those in env" (from the `eval_tidy()` docs). Source of the issue: * In `case_when()` the mask passed to `arrow_eval()` is based on the `caller_env()`. * We also do this in other places, e.g. bindings for verbs like `mutate()` but I can't reproduce the original issue when using those functions. * But in `mutate()` we use `expand_across()` which calls `quo_get_env()` which does the job of getting the environment - (from the docs:) "A quosure is guaranteed to evaluate in its original environment and can refer to local objects safely". * I don't know what "safely" means here, but I'm wondering if there's some reason we might want `case_when()` to be able to refer to the relevant environment instead of blanket updating for `eval_tidy()`, if there's a reason we didn't pass the mask in as `env` beyond just leaving things as the default? * So for example, in `case_when()`, where we have the code `mask <- caller_env()`, maybe instead we actually want `mask <- caller_env(n=2)` or something like that, because if that works we've got less risk of unintended consequences? Or, are there no unintended consequences likely (as `data` takes precedent over `env` anyway), and we can just do this as-is and everyone's happy. cc @nealrichardson - is this good to go as-is? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org