nealrichardson commented on a change in pull request #158:
URL: https://github.com/apache/arrow-site/pull/158#discussion_r743953748
##########
File path: _posts/2021-11-05-r-6.0.0.md
##########
@@ -0,0 +1,207 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-05"
+author: Nic Crane, Jonathan Keane, Neal Richardson
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are excited to announce the recent release of version 6.0.0 of the Arrow R
package on CRAN. While we usually don't write a dedicated release blog post for
the R package, this one is special. There are a number of major new features in
this version, some of which we've been building up to for several years.
Review comment:
```suggestion
We are excited to announce the recent release of version 6.0.0 of the Arrow
R package on [CRAN](https://cran.r-project.org/package=arrow). While we usually
don't write a dedicated release blog post for the R package, this one is
special. There are a number of major new features in this version, some of
which we've been building up to for several years.
```
##########
File path: _posts/2021-11-05-r-6.0.0.md
##########
@@ -0,0 +1,207 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-05"
+author: Nic Crane, Jonathan Keane, Neal Richardson
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are excited to announce the recent release of version 6.0.0 of the Arrow R
package on CRAN. While we usually don't write a dedicated release blog post for
the R package, this one is special. There are a number of major new features in
this version, some of which we've been building up to for several years.
+
+# More dplyr support
+
+In version 0.16.0 (February 2020), we released the first version of the
Dataset feature, which allowed you to query multi-file datasets using
`dplyr::select()` and `filter()`. These tools allowed you to find a slice of
data in a large dataset that may not fit into memory and pull it into R for
further analysis. In version 4.0.0 earlier this year, we added support for
`mutate()` and a number of other dplyr verbs, and all year we've been adding
hundreds of functions you can use to transform and filter data in Datasets.
However, to aggregate, you'd still need to pull the data into R.
+
+## Grouped aggregation
+
+With `arrow` 6.0.0, you can now `summarise()` on Arrow data, both with or
without `group_by()`. These are supported both with in-memory Arrow tables as
well as across partitioned datasets. Most common aggregation functions are
supported: `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`,
`sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability
are also supported and currently return approximate results using the t-digest
algorithm.
+
+As usual, Arrow will read and process data in chunks and in parallel when
possible to produce results much faster than one could by loading it all into
memory then processing. This allows for operations that wouldn't fit into
memory on a single machine.
+
+## Joins
+
+In addition to aggregation, Arrow also supports all of dplyr's mutating joins
(inner, left, right, and full) and filtering joins (semi and anti).
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+
+#> # A tibble: 12 × 4
+#> dep_time arr_time carrier name
+#> <int> <int> <chr> <chr>
+#> 1 637 853 B6 JetBlue Airways
+#> 2 648 912 AA American Airlines Inc.
+#> 3 812 1029 DL Delta Air Lines Inc.
+#> 4 945 1206 VX Virgin America
+#> 5 955 1219 B6 JetBlue Airways
+#> 6 1018 1231 DL Delta Air Lines Inc.
+#> 7 1120 1338 B6 JetBlue Airways
+#> 8 1451 1705 DL Delta Air Lines Inc.
+#> 9 1656 1915 AA American Airlines Inc.
+#> 10 1755 2001 DL Delta Air Lines Inc.
+#> 11 1827 2049 B6 JetBlue Airways
+#> 12 1917 2126 DL Delta Air Lines Inc.
+```
+
+In this example, we're working on an in-memory table, so you wouldn't need
`arrow` to do this--but the same code would work on a larger-than-memory
dataset backed by thousands of Parquet files.
+
+## Under the hood
+
+To support these features, we've made some internal changes to how queries are
built up and--importantly--when they are evaluated. As a result, there are some
changes in behavior compared to past versions of `arrow`. First, calls to
`summarise()`, `head()`, and `tail()` no longer eagerly evaluate: this means
you need to call either `compute()` (to evaluate it and produce an Arrow Table)
or `collect()` (to evaluate and pull the Table into an R `data.frame`) to see
the results.
+
+Second, the order of rows in a dataset query is no longer determinisitic due
to the way the parallelization of work happens in the C++ library. This means
that you can't assume that the results of a query will be in the same order as
the rows of data in the files on disk. If you do need a stable sort order, call
`arrange()` to specify ordering.
+
+While these changes are a break from past `arrow` behavior, they are
consistent with many `dbplyr` backends and are needed to allow queries to scale
beyond data-frame workflows that can fit into memory.
+
+# Integration with DuckDB
+
+The Arrow engine is not the only new way to query Arrow Datasets in this
release. If you have the [duckdb](https://duckdb.org/) package installed, you
can hand off an Arrow Dataset or query object to duckdb for further querying
using the `to_duckdb()` function. This allows you to use duckdb's `dbplyr`
methods, as well as its SQL interface, to aggregate data. DuckDB supports
filter pushdown, so you can take advantage of Arrow Datasets and Arrow-based
optimizations even within a DuckDB SQL query with a `where` clause. Filtering
and column projection specified before `to_duckdb()` in a pipeline is evaluated
in Arrow, which can be helpful in some circumstances, such as complicated
dbplyr pipelines. You can also hand off DuckDB data (or the result of a query)
to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and
want to avoid the worst-of-the-worst delays. To do this, we can use
`percent_rank()`; however that requires a window function which isn’t yet
available in Arrow, so let’s try sending the data to DuckDB to do that, then
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+ select(carrier, origin, dest, arr_delay) %>%
+ # arriving early doesn't matter, so call negative delays 0
+ mutate(arr_delay = pmax(arr_delay, 0)) %>%
+ to_duckdb() %>%
+ # for each carrier-origin-dest, take the worst 5% of delays
+ group_by(carrier, origin, dest) %>%
+ mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+ filter(arr_delay_rank > 0.95)
+
+head(flights_filtered)
+#> # Source: lazy query [?? x 5]
+#> # Database: duckdb_connection
+#> # Groups: carrier, origin, dest
+#> carrier origin dest arr_delay arr_delay_rank
+#> <chr> <chr> <chr> <dbl> <dbl>
+#> 1 9E JFK RIC 119 0.952
+#> 2 9E JFK RIC 125 0.956
+#> 3 9E JFK RIC 137 0.960
+#> 4 9E JFK RIC 137 0.960
+#> 5 9E JFK RIC 158 0.968
+#> 6 9E JFK RIC 163 0.972
+```
+
+Now we have all of the flights filtered to those that are the
worst-of-the-worst, and stored as a dbplyr lazy `tbl` with our DuckDB
connection. This is an example of using Arrow -> DuckDB.
+
+But we can do more: we can then bring that data back into Arrow just as
easily. For the rest of our analysis, we pick up where we left off with the
`tbl` referring to the DuckDB query:
+
+```r
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+ to_arrow() %>%
+ # now summarise to get mean/min
+ group_by(carrier, origin, dest) %>%
+ summarise(
+ arr_delay_mean = mean(arr_delay),
+ arr_delay_min = min(arr_delay),
+ num_flights = n()
+ ) %>%
+ filter(dest %in% c("ORD", "MDW")) %>%
+ arrange(desc(arr_delay_mean)) %>%
+ collect()
+#> # A tibble: 10 × 6
+#> # Groups: carrier, origin [10]
+#> carrier origin dest arr_delay_mean arr_delay_min num_flights
+#> <chr> <chr> <chr> <dbl> <dbl> <int>
+#> 1 MQ EWR ORD 190. 103 113
+#> 2 9E JFK ORD 185. 134 52
+#> 3 UA LGA ORD 179. 101 157
+#> 4 WN LGA MDW 178. 107 103
+#> 5 AA JFK ORD 178. 133 19
+#> 6 B6 JFK ORD 174. 129 46
+#> 7 WN EWR MDW 167. 107 103
+#> 8 UA EWR ORD 149. 87 189
+#> 9 AA LGA ORD 135. 78 280
+#> 10 EV EWR ORD 35 35 1
+```
+
+And just like that, we've passed data back and forth between Arrow and DuckDB
without having to write a single file to disk!
+
+# Expanded use of ALTREP
+
+We are continuing our use of R’s ALTREP where possible. In 5.0.0 there were a
limited set of circumstances that took advantage of ALTREP, but in 6.0.0 we
have expanded types (to include strings), as well as vectors with `NA`s.
Review comment:
```suggestion
We are continuing our use of R’s
[ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) where
possible. In 5.0.0 there were a limited set of circumstances that took
advantage of ALTREP, but in 6.0.0 we have expanded types (to include strings),
as well as vectors with `NA`s.
```
##########
File path: _posts/2021-11-05-r-6.0.0.md
##########
@@ -0,0 +1,207 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-05"
+author: Nic Crane, Jonathan Keane, Neal Richardson
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are excited to announce the recent release of version 6.0.0 of the Arrow R
package on CRAN. While we usually don't write a dedicated release blog post for
the R package, this one is special. There are a number of major new features in
this version, some of which we've been building up to for several years.
+
+# More dplyr support
+
+In version 0.16.0 (February 2020), we released the first version of the
Dataset feature, which allowed you to query multi-file datasets using
`dplyr::select()` and `filter()`. These tools allowed you to find a slice of
data in a large dataset that may not fit into memory and pull it into R for
further analysis. In version 4.0.0 earlier this year, we added support for
`mutate()` and a number of other dplyr verbs, and all year we've been adding
hundreds of functions you can use to transform and filter data in Datasets.
However, to aggregate, you'd still need to pull the data into R.
+
+## Grouped aggregation
+
+With `arrow` 6.0.0, you can now `summarise()` on Arrow data, both with or
without `group_by()`. These are supported both with in-memory Arrow tables as
well as across partitioned datasets. Most common aggregation functions are
supported: `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`,
`sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability
are also supported and currently return approximate results using the t-digest
algorithm.
+
+As usual, Arrow will read and process data in chunks and in parallel when
possible to produce results much faster than one could by loading it all into
memory then processing. This allows for operations that wouldn't fit into
memory on a single machine.
+
+## Joins
+
+In addition to aggregation, Arrow also supports all of dplyr's mutating joins
(inner, left, right, and full) and filtering joins (semi and anti).
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+
+#> # A tibble: 12 × 4
+#> dep_time arr_time carrier name
+#> <int> <int> <chr> <chr>
+#> 1 637 853 B6 JetBlue Airways
+#> 2 648 912 AA American Airlines Inc.
+#> 3 812 1029 DL Delta Air Lines Inc.
+#> 4 945 1206 VX Virgin America
+#> 5 955 1219 B6 JetBlue Airways
+#> 6 1018 1231 DL Delta Air Lines Inc.
+#> 7 1120 1338 B6 JetBlue Airways
+#> 8 1451 1705 DL Delta Air Lines Inc.
+#> 9 1656 1915 AA American Airlines Inc.
+#> 10 1755 2001 DL Delta Air Lines Inc.
+#> 11 1827 2049 B6 JetBlue Airways
+#> 12 1917 2126 DL Delta Air Lines Inc.
+```
+
+In this example, we're working on an in-memory table, so you wouldn't need
`arrow` to do this--but the same code would work on a larger-than-memory
dataset backed by thousands of Parquet files.
+
+## Under the hood
+
+To support these features, we've made some internal changes to how queries are
built up and--importantly--when they are evaluated. As a result, there are some
changes in behavior compared to past versions of `arrow`. First, calls to
`summarise()`, `head()`, and `tail()` no longer eagerly evaluate: this means
you need to call either `compute()` (to evaluate it and produce an Arrow Table)
or `collect()` (to evaluate and pull the Table into an R `data.frame`) to see
the results.
+
+Second, the order of rows in a dataset query is no longer determinisitic due
to the way the parallelization of work happens in the C++ library. This means
that you can't assume that the results of a query will be in the same order as
the rows of data in the files on disk. If you do need a stable sort order, call
`arrange()` to specify ordering.
+
+While these changes are a break from past `arrow` behavior, they are
consistent with many `dbplyr` backends and are needed to allow queries to scale
beyond data-frame workflows that can fit into memory.
+
+# Integration with DuckDB
+
+The Arrow engine is not the only new way to query Arrow Datasets in this
release. If you have the [duckdb](https://duckdb.org/) package installed, you
can hand off an Arrow Dataset or query object to duckdb for further querying
using the `to_duckdb()` function. This allows you to use duckdb's `dbplyr`
methods, as well as its SQL interface, to aggregate data. DuckDB supports
filter pushdown, so you can take advantage of Arrow Datasets and Arrow-based
optimizations even within a DuckDB SQL query with a `where` clause. Filtering
and column projection specified before `to_duckdb()` in a pipeline is evaluated
in Arrow, which can be helpful in some circumstances, such as complicated
dbplyr pipelines. You can also hand off DuckDB data (or the result of a query)
to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and
want to avoid the worst-of-the-worst delays. To do this, we can use
`percent_rank()`; however that requires a window function which isn’t yet
available in Arrow, so let’s try sending the data to DuckDB to do that, then
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+ select(carrier, origin, dest, arr_delay) %>%
+ # arriving early doesn't matter, so call negative delays 0
+ mutate(arr_delay = pmax(arr_delay, 0)) %>%
+ to_duckdb() %>%
+ # for each carrier-origin-dest, take the worst 5% of delays
+ group_by(carrier, origin, dest) %>%
+ mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+ filter(arr_delay_rank > 0.95)
+
+head(flights_filtered)
+#> # Source: lazy query [?? x 5]
+#> # Database: duckdb_connection
+#> # Groups: carrier, origin, dest
+#> carrier origin dest arr_delay arr_delay_rank
+#> <chr> <chr> <chr> <dbl> <dbl>
+#> 1 9E JFK RIC 119 0.952
+#> 2 9E JFK RIC 125 0.956
+#> 3 9E JFK RIC 137 0.960
+#> 4 9E JFK RIC 137 0.960
+#> 5 9E JFK RIC 158 0.968
+#> 6 9E JFK RIC 163 0.972
+```
+
+Now we have all of the flights filtered to those that are the
worst-of-the-worst, and stored as a dbplyr lazy `tbl` with our DuckDB
connection. This is an example of using Arrow -> DuckDB.
+
+But we can do more: we can then bring that data back into Arrow just as
easily. For the rest of our analysis, we pick up where we left off with the
`tbl` referring to the DuckDB query:
+
+```r
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+ to_arrow() %>%
+ # now summarise to get mean/min
+ group_by(carrier, origin, dest) %>%
+ summarise(
+ arr_delay_mean = mean(arr_delay),
+ arr_delay_min = min(arr_delay),
+ num_flights = n()
+ ) %>%
+ filter(dest %in% c("ORD", "MDW")) %>%
+ arrange(desc(arr_delay_mean)) %>%
+ collect()
+#> # A tibble: 10 × 6
+#> # Groups: carrier, origin [10]
+#> carrier origin dest arr_delay_mean arr_delay_min num_flights
+#> <chr> <chr> <chr> <dbl> <dbl> <int>
+#> 1 MQ EWR ORD 190. 103 113
+#> 2 9E JFK ORD 185. 134 52
+#> 3 UA LGA ORD 179. 101 157
+#> 4 WN LGA MDW 178. 107 103
+#> 5 AA JFK ORD 178. 133 19
+#> 6 B6 JFK ORD 174. 129 46
+#> 7 WN EWR MDW 167. 107 103
+#> 8 UA EWR ORD 149. 87 189
+#> 9 AA LGA ORD 135. 78 280
+#> 10 EV EWR ORD 35 35 1
+```
+
+And just like that, we've passed data back and forth between Arrow and DuckDB
without having to write a single file to disk!
+
+# Expanded use of ALTREP
+
+We are continuing our use of R’s ALTREP where possible. In 5.0.0 there were a
limited set of circumstances that took advantage of ALTREP, but in 6.0.0 we
have expanded types (to include strings), as well as vectors with `NA`s.
Review comment:
```suggestion
We are continuing our use of R’s
[ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) where
possible. In 5.0.0 there were a limited set of circumstances that took
advantage of ALTREP, but in 6.0.0 we have expanded types to include strings, as
well as vectors with `NA`s.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]