[ 
https://issues.apache.org/jira/browse/ARROW-11679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286128#comment-17286128
 ] 

Neal Richardson commented on ARROW-11679:
-----------------------------------------

Thanks [~jangorecki] for adding arrow to your benchmarking. Since you're 
working against an in-memory Arrow Table (and not a Dataset, which may be 
backed by lots of files not in memory), you actually don't need to 
{{collect()}} in the query, {{summarize()}} will {{collect()}} the columns it 
needs. We don't do that automatically for Datasets because we thought it best 
to make explicit when you were pulling data (potentially too much data) into R. 
Regarding {{compute()}}, I don't think that's implemented.

[~wesm] is right, currently these benchmarks would mostly show dplyr's grouped 
aggregation. If you're interested in showing improvements over time as we 
implement the Arrow-native query engine, you might want to go forward with them 
now. If you're only interested in the latest snapshot comparison, it's probably 
not that informative until we're computing more in Arrow.

> [R] Optimal arrow queries for benchmarking
> ------------------------------------------
>
>                 Key: ARROW-11679
>                 URL: https://issues.apache.org/jira/browse/ARROW-11679
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Benchmarking, R
>            Reporter: Jan Gorecki
>            Priority: Major
>
> Hi
> Hello,
> We are running a continuous benchmarking project 
> (https://h2oai.github.io/db-benchmark). In recent days we added Arrow project.
> It uses R's dplyr and ArrowTable as backend. Queries have been written based 
> on arrow R package documentation.
> There are 10 grouping queries:
> ```
> # q1: sum v1 by id1
> AT %>% select(id1, v1) %>% group_by(id1) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE))
> # q2: sum v1 by id1:id2
> AT %>% select(id1, id2, v1) %>% group_by(id1, id2) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE))
> # q3: sum v1 mean v3 by id3
> AT %>% select(id3, v1, v3) %>% group_by(id3) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE), v3=mean(v3, na.rm=TRUE))
> # q4: mean v1:v3 by id4
> AT %>% select(id4, v1, v2, v3) %>% group_by(id4) %>% collect() %>% 
> summarise_at(.funs=\"mean\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
> # q5: sum v1:v3 by id6
> AT %>% select(id6, v1, v2, v3) %>% group_by(id6) %>% collect () %>% 
> summarise_at(.funs=\"sum\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
> # q6: median v3 sd v3 by id4 id5
> AT %>% select(id4, id5, v3) %>% group_by(id4, id5) %>% collect() %>% 
> summarise(median_v3=median(v3, na.rm=TRUE), sd_v3=sd(v3, na.rm=TRUE))
> # q7: max v1 - min v2 by id3
> AT %>% select(id3, v1, v2) %>% group_by(id3) %>% collect() %>% 
> summarise(range_v1_v2=max(v1, na.rm=TRUE)-min(v2, na.rm=TRUE))
> # q8: largest two v3 by id6
> AT %>% select(id6, largest2_v3=v3) %>% filter(!is.na(largest2_v3)) %>% 
> arrange(desc(largest2_v3)) %>% group_by(id6) %>% filter(row_number() <= 2L) 
> %>% compute()
> # q9: regression v1 v2 by id2 id4
> AT %>% select(id2, id4, v1, v2) %>% group_by(id2, id4) %>% collect() %>% 
> summarise(r2=cor(v1, v2, use=\"na.or.complete\")^2)
> # q10: sum v3 count by id1:id6
> AT %>% select(id1, id2, id3, id4, id5, id6, v3) %>% group_by(id1, id2, id3, 
> id4, id5, id6) %>% collect() %>% summarise(v3=sum(v3, na.rm=TRUE), count=n())
> ```
> Full benchmark script can be found at 
> https://github.com/h2oai/db-benchmark/blob/master/arrow/groupby-arrow.R
> ----
> As per my understanding, all above queries (maybe excluding query 8) will not 
> utilize any arrow computation, as of now. It is because those operations are 
> not yet implemented in arrow, and they are falling back to dplyr 
> implementation.
> According to Neal's presentation I watched recently, code written now will 
> over time get improved by improvements in arrow implementation. Continuous 
> benchmark I am working on upgrades software automatically, therefore I would 
> like to use the fact to write code now, and have it faster in future, as 
> arrow implementation progresses. I believe the mentioned queries will not 
> satisfy that, because of `collect()` call in the middle. AFAIU it needs a 
> `compute()` call at the end instead (like now in query 8).
> Is there a way to write this code to be optimal now, and also optimal in 
> future. Similarly as presented by Neal in his presentation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to