[
https://issues.apache.org/jira/browse/ARROW-14266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461530#comment-17461530
]
Dewey Dunnington commented on ARROW-14266:
------------------------------------------
I'd be happy to take a look at this but need a bit more background on what
changes you envision in (approximately) which parts of the code.
Some example code with a simple join and aggregation + write_dataset:
{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df1 <- data.frame(a = letters, b = 1:26)
df2 <- data.frame(b = 1:5, c = LETTERS[1:5])
tf1 <- tempfile()
tf2 <- tempfile()
record_batch(df2) %>%
left_join(df1) %>%
write_dataset(tf1)
open_dataset(tf1) %>%
collect()
#> b c a
#> 1 1 A a
#> 2 2 B b
#> 3 3 C c
#> 4 4 D d
#> 5 5 E e
record_batch(df1) %>%
summarise(col = mean(b)) %>%
write_dataset(tf2)
open_dataset(tf2) %>%
collect()
#> # A tibble: 1 × 1
#> col
#> <dbl>
#> 1 13.5
{code}
> [R] Use WriteNode to write queries
> ----------------------------------
>
> Key: ARROW-14266
> URL: https://issues.apache.org/jira/browse/ARROW-14266
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Neal Richardson
> Priority: Major
> Labels: query-engine
> Fix For: 7.0.0
>
>
> Following ARROW-13542. Any query that has a join or an aggregation currently
> has to first evaluate the query and hold it in memory before creating a
> Scanner to write it. We could improve that by using a WriteNode inside
> write_dataset() (and maybe that improves the other cases too, or at least
> allows us to delete some code).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)