[GitHub] [arrow-site] jonkeane commented on a change in pull request #158: R blog post

GitBox Thu, 04 Nov 2021 12:01:29 -0700


jonkeane commented on a change in pull request #158:
URL: https://github.com/apache/arrow-site/pull/158#discussion_r743124395




##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this 
functionality - for the next release we’ll be looking to profile and optimize 
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` 
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` 
and `quantile()` with one probability are also supported and currently return 
approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to 
the way that dplyr pipelines are constructed and executed in R. There are 
almost no changes (besides new capabilities) that one would run into, but the 
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead 
you must call `collect()` or `compute()` to evaluate the query. This is inline 
with how datasets worked before as well as how a number of other dplyr backends 
work.
+the order of dataset queries is no longer deterministic. If you need a stable 
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB 
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and 
want to avoid the worst-of-the-worst delays. To do this, we can use 
`percent_rank()`; however that requires a window function which isn’t yet 
available in Arrow, so let’s try sending the data to DuckDB to do that, then 
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+  select(carrier, origin, dest, arr_delay) %>%
+  # arriving early doesn't matter, so call negative delays 0
+  mutate(arr_delay = pmax(arr_delay, 0)) %>%
+  to_duckdb() %>%
+  # for each carrier-origin-dest, take the worst 5% of delays
+  group_by(carrier, origin, dest) %>%
+  mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+  filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+  to_arrow() %>%
+  # now summarise to get mean/min
+  group_by(carrier, origin, dest) %>%
+  summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay), 
num_flights = n()) %>%
+  filter(dest %in% c("ORD", "MDW")) %>%
+  arrange(desc(arr_delay_mean)) %>%
+  collect()
+```
+# Expanded use of altrep
+
+We are continuing our use of R’s altrep where possible. In 5.0.0 there were a 
limited set of circumstances that took advantage of altrep, but in 6.0.0 we 
have expanded types (to include strings), as well as vectors with `NA`s. 
+
+```r
+library(microbenchmark)
+library(arrow)
+
+tbl <-
+  arrow_table(data.frame(
+    x = rnorm(10000000),
+    y = sample(c(letters, NA), 10000000, replace = TRUE)
+  ))
+
+with_altrep <- function(data){
+  options(arrow.use_altrep = TRUE)
+  as.data.frame(data)  
+}
+
+without_altrep <- function(data){
+  options(arrow.use_altrep = FALSE)
+  as.data.frame(data)  
+}
+
+microbenchmark(
+  without_altrep(tbl),
+  with_altrep(tbl)
+)
+
+# Unit: milliseconds
+#                 expr      min        lq      mean    median        uq      
max neval
+#  without_altrep(tbl) 191.0788 213.82235 249.65076 225.52120 244.26977 
512.1652   100
+#     with_altrep(tbl)  48.7152  50.97269  65.56832  52.93795  55.24505 
338.4602   100
+
+```
+# Over 30 new compute functions
+
+There are over 30 new compute functions available in this release, including 
string functions `str_to_lower()`, `str_to_upper()`, `str_to_title()`, 
`startsWith()`, `endsWith()`, `str_starts()`, and `str_ends()`, date and time 
functions `strftime()`, `format_ISO8601()`, `is_timestamp()` and others.
+
+# Can now install in an offline mode on linux
+
+For folks who need to install Arrow on an airgapped server, we have included 
helper functions and installation options that make it easier to download a 
fat-source of the arrow package that includes both the Arrow source as well as 
third-party dependencies that are needed when building Arrow.
+
+The helper function `create_package_with_all_dependencies()` can be run from a 
computer that does have access to the internet and it will create a fat-source 
package which can then be transferred and installed on a server without 
connectivity. This helper is also available on GitHub without installing the 
arrow package.  For more installation [see the 
docs](https://arrow.apache.org/docs/r/articles/install.html#offline-installation).

Review comment:
       ```suggestion
   The helper function `create_package_with_all_dependencies()` can be run from 
a computer that does have access to the internet and it will create a 
fat-source package which can then be transferred and installed on a server 
without connectivity. This helper is also available on GitHub without 
installing the arrow package.  For more installation [see the 
docs](https://arrow.apache.org/docs/r/articles/install.html#offline-installation).
   
   Special thanks to Karl Dunkle Werner for the PRs to make this possible. Karl 
has also been added as a contributor in recognition of this contribution along 
with a number before this. Thank you, Karl!
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-site] jonkeane commented on a change in pull request #158: R blog post

Reply via email to