jonkeane commented on a change in pull request #158:
URL: https://github.com/apache/arrow-site/pull/158#discussion_r743124395
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this
functionality - for the next release we’ll be looking to profile and optimize
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),`
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()`
and `quantile()` with one probability are also supported and currently return
approximate results using the t-digest algorithm.
+
+# Joins
+
+Multiple Arrow tables and datasets can now be joined in queries.
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+```
+
+# Big changes to the execution of queries under-the-hood
+
+Both of the first two points are driven by a large under-the-hood change to
the way that dplyr pipelines are constructed and executed in R. There are
almost no changes (besides new capabilities) that one would run into, but the
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead
you must call `collect()` or `compute()` to evaluate the query. This is inline
with how datasets worked before as well as how a number of other dplyr backends
work.
+the order of dataset queries is no longer deterministic. If you need a stable
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand
off an Arrow Dataset or query object to duckdb for further querying using the
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as
well as its SQL interface, to aggregate data. Filtering and column projection
done before `to_duckdb()` is evaluated in Arrow. You can also hand off DuckDB
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and
want to avoid the worst-of-the-worst delays. To do this, we can use
`percent_rank()`; however that requires a window function which isn’t yet
available in Arrow, so let’s try sending the data to DuckDB to do that, then
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+ select(carrier, origin, dest, arr_delay) %>%
+ # arriving early doesn't matter, so call negative delays 0
+ mutate(arr_delay = pmax(arr_delay, 0)) %>%
+ to_duckdb() %>%
+ # for each carrier-origin-dest, take the worst 5% of delays
+ group_by(carrier, origin, dest) %>%
+ mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+ filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+ to_arrow() %>%
+ # now summarise to get mean/min
+ group_by(carrier, origin, dest) %>%
+ summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay),
num_flights = n()) %>%
+ filter(dest %in% c("ORD", "MDW")) %>%
+ arrange(desc(arr_delay_mean)) %>%
+ collect()
+```
+# Expanded use of altrep
+
+We are continuing our use of R’s altrep where possible. In 5.0.0 there were a
limited set of circumstances that took advantage of altrep, but in 6.0.0 we
have expanded types (to include strings), as well as vectors with `NA`s.
+
+```r
+library(microbenchmark)
+library(arrow)
+
+tbl <-
+ arrow_table(data.frame(
+ x = rnorm(10000000),
+ y = sample(c(letters, NA), 10000000, replace = TRUE)
+ ))
+
+with_altrep <- function(data){
+ options(arrow.use_altrep = TRUE)
+ as.data.frame(data)
+}
+
+without_altrep <- function(data){
+ options(arrow.use_altrep = FALSE)
+ as.data.frame(data)
+}
+
+microbenchmark(
+ without_altrep(tbl),
+ with_altrep(tbl)
+)
+
+# Unit: milliseconds
+# expr min lq mean median uq
max neval
+# without_altrep(tbl) 191.0788 213.82235 249.65076 225.52120 244.26977
512.1652 100
+# with_altrep(tbl) 48.7152 50.97269 65.56832 52.93795 55.24505
338.4602 100
+
+```
+# Over 30 new compute functions
+
+There are over 30 new compute functions available in this release, including
string functions `str_to_lower()`, `str_to_upper()`, `str_to_title()`,
`startsWith()`, `endsWith()`, `str_starts()`, and `str_ends()`, date and time
functions `strftime()`, `format_ISO8601()`, `is_timestamp()` and others.
+
+# Can now install in an offline mode on linux
+
+For folks who need to install Arrow on an airgapped server, we have included
helper functions and installation options that make it easier to download a
fat-source of the arrow package that includes both the Arrow source as well as
third-party dependencies that are needed when building Arrow.
+
+The helper function `create_package_with_all_dependencies()` can be run from a
computer that does have access to the internet and it will create a fat-source
package which can then be transferred and installed on a server without
connectivity. This helper is also available on GitHub without installing the
arrow package. For more installation [see the
docs](https://arrow.apache.org/docs/r/articles/install.html#offline-installation).
Review comment:
```suggestion
The helper function `create_package_with_all_dependencies()` can be run from
a computer that does have access to the internet and it will create a
fat-source package which can then be transferred and installed on a server
without connectivity. This helper is also available on GitHub without
installing the arrow package. For more installation [see the
docs](https://arrow.apache.org/docs/r/articles/install.html#offline-installation).
Special thanks to Karl Dunkle Werner for the PRs to make this possible. Karl
has also been added as a contributor in recognition of this contribution along
with a number before this. Thank you, Karl!
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]