dgreiss commented on issue #33149: URL: https://github.com/apache/arrow/issues/33149#issuecomment-1608402342
I was looking to open a PR, but thought I'd check to see how you would want it implemented. `rowwise` is just a special case of `group_by` where you group on each row. dplyr implements this by adding adding a sequence to the [`group_data` attribute](https://github.com/tidyverse/dplyr/blob/16b472fb2afc50a87502c2b4ed803e2f5f82b9d6/R/rowwise.R#L110) which is a list of rows to group on. In arrow there is no `group_data` attribute because everything is lazily evaluated, so we only save the `group_vars` in the `arrow_dplyr_query` container. The way to implement in arrow I think is to add the row sequence directly to the table: ```r .data$.nrows <- seq_len(nrow(.data)) ``` However that means you can't do that in a dplyr pipeline because you don't reliably have access to the `.data` object in the container. Possible workarounds * Raise that `rowwise` can only be used at the start of a pipeline. Or only apply rowwise to an `ArrowTabular` object, and raise on the `adq` container. * Run `compute` on the current object to obtain the intermediate `.data` object and then add the sequence to the object and apply the `group_by`. This option seems off to me, because you wouldn't typically execute a query plan without a specific call to `compute`/`collect`. Let me know what you think or if there's a different way to approach this as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
