[GitHub] [arrow] dgreiss commented on issue #33149: [R] dplyr support for rowwise()

via GitHub Mon, 26 Jun 2023 15:23:44 -0700


dgreiss commented on issue #33149:
URL: https://github.com/apache/arrow/issues/33149#issuecomment-1608402342

I was looking to open a PR, but thought I'd check to see how you would want
it implemented. `rowwise` is just a special case of `group_by` where you group
on each row. dplyr implements this by adding adding a sequence to the
[`group_data`
attribute](https://github.com/tidyverse/dplyr/blob/16b472fb2afc50a87502c2b4ed803e2f5f82b9d6/R/rowwise.R#L110)
which is a list of rows to group on. In arrow there is no `group_data`
attribute because everything is lazily evaluated, so we only save the
`group_vars` in the `arrow_dplyr_query` container. The way to implement in
arrow I think is to add the row sequence directly to the table:

```r
.data$.nrows <- seq_len(nrow(.data))
```

However that means you can't do that in a dplyr pipeline because you don't
reliably have access to the `.data` object in the container. Possible
workarounds

* Raise that `rowwise` can only be used at the start of a pipeline. Or only
apply rowwise to an `ArrowTabular` object, and raise on the `adq` container.

* Run `compute` on the current object to obtain the intermediate `.data`
object and then add the sequence to the object and apply the `group_by`. This
option seems off to me, because you wouldn't typically execute a query plan
without a specific call to `compute`/`collect`.

Let me know what you think or if there's a different way to approach this as
well.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] dgreiss commented on issue #33149: [R] dplyr support for rowwise()

Reply via email to