dgreiss commented on issue #33149:
URL: https://github.com/apache/arrow/issues/33149#issuecomment-1608402342

   I was looking to open a PR, but thought I'd check to see how you would want 
it implemented. `rowwise` is just a special case of `group_by` where you group 
on each row. dplyr implements this by adding adding a sequence to the 
[`group_data` 
attribute](https://github.com/tidyverse/dplyr/blob/16b472fb2afc50a87502c2b4ed803e2f5f82b9d6/R/rowwise.R#L110)
 which is a list of rows to group on. In arrow there is no `group_data` 
attribute because everything is lazily evaluated, so we only save the 
`group_vars` in the `arrow_dplyr_query` container. The way to implement in 
arrow I think is to add the row sequence directly to the table:
   
   ```r
   .data$.nrows <- seq_len(nrow(.data))
   ```
   
   However that means you can't do that in a dplyr pipeline because you don't 
reliably have access to the `.data` object in the container. Possible 
workarounds
   
   * Raise that `rowwise` can only be used at the start of a pipeline. Or only 
apply rowwise to an `ArrowTabular` object, and raise on the `adq` container.
   
   * Run `compute` on the current object to obtain the intermediate `.data` 
object and then add the sequence to the object and apply the `group_by`. This 
option seems off to me, because you wouldn't typically execute a query plan 
without a specific call to `compute`/`collect`. 
   
   Let me know what you think or if there's a different way to approach this as 
well. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to