[ 
https://issues.apache.org/jira/browse/ARROW-11679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-11679:
--------------------------------
    Summary: [R] Optimal arrow queries for benchmarking  (was: Optimal arrow 
queries for benchmarking)

> [R] Optimal arrow queries for benchmarking
> ------------------------------------------
>
>                 Key: ARROW-11679
>                 URL: https://issues.apache.org/jira/browse/ARROW-11679
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Benchmarking, R
>            Reporter: Jan Gorecki
>            Priority: Major
>
> Hi
> Hello,
> We are running a continuous benchmarking project 
> (https://h2oai.github.io/db-benchmark). In recent days we added Arrow project.
> It uses R's dplyr and ArrowTable as backend. Queries have been written based 
> on arrow R package documentation.
> There are 10 grouping queries:
> ```
> # q1: sum v1 by id1
> AT %>% select(id1, v1) %>% group_by(id1) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE))
> # q2: sum v1 by id1:id2
> AT %>% select(id1, id2, v1) %>% group_by(id1, id2) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE))
> # q3: sum v1 mean v3 by id3
> AT %>% select(id3, v1, v3) %>% group_by(id3) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE), v3=mean(v3, na.rm=TRUE))
> # q4: mean v1:v3 by id4
> AT %>% select(id4, v1, v2, v3) %>% group_by(id4) %>% collect() %>% 
> summarise_at(.funs=\"mean\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
> # q5: sum v1:v3 by id6
> AT %>% select(id6, v1, v2, v3) %>% group_by(id6) %>% collect () %>% 
> summarise_at(.funs=\"sum\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
> # q6: median v3 sd v3 by id4 id5
> AT %>% select(id4, id5, v3) %>% group_by(id4, id5) %>% collect() %>% 
> summarise(median_v3=median(v3, na.rm=TRUE), sd_v3=sd(v3, na.rm=TRUE))
> # q7: max v1 - min v2 by id3
> AT %>% select(id3, v1, v2) %>% group_by(id3) %>% collect() %>% 
> summarise(range_v1_v2=max(v1, na.rm=TRUE)-min(v2, na.rm=TRUE))
> # q8: largest two v3 by id6
> AT %>% select(id6, largest2_v3=v3) %>% filter(!is.na(largest2_v3)) %>% 
> arrange(desc(largest2_v3)) %>% group_by(id6) %>% filter(row_number() <= 2L) 
> %>% compute()
> # q9: regression v1 v2 by id2 id4
> AT %>% select(id2, id4, v1, v2) %>% group_by(id2, id4) %>% collect() %>% 
> summarise(r2=cor(v1, v2, use=\"na.or.complete\")^2)
> # q10: sum v3 count by id1:id6
> AT %>% select(id1, id2, id3, id4, id5, id6, v3) %>% group_by(id1, id2, id3, 
> id4, id5, id6) %>% collect() %>% summarise(v3=sum(v3, na.rm=TRUE), count=n())
> ```
> Full benchmark script can be found at 
> https://github.com/h2oai/db-benchmark/blob/master/arrow/groupby-arrow.R
> ----
> As per my understanding, all above queries (maybe excluding query 8) will not 
> utilize any arrow computation, as of now. It is because those operations are 
> not yet implemented in arrow, and they are falling back to dplyr 
> implementation.
> According to Neal's presentation I watched recently, code written now will 
> over time get improved by improvements in arrow implementation. Continuous 
> benchmark I am working on upgrades software automatically, therefore I would 
> like to use the fact to write code now, and have it faster in future, as 
> arrow implementation progresses. I believe the mentioned queries will not 
> satisfy that, because of `collect()` call in the middle. AFAIU it needs a 
> `compute()` call at the end instead (like now in query 8).
> Is there a way to write this code to be optimal now, and also optimal in 
> future. Similarly as presented by Neal in his presentation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to