jonkeane commented on a change in pull request #158:
URL: https://github.com/apache/arrow-site/pull/158#discussion_r743073339
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this
functionality - for the next release we’ll be looking to profile and optimize
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),`
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()`
and `quantile()` with one probability are also supported and currently return
approximate results using the t-digest algorithm.
+
+# Joins
+
+Multiple Arrow tables and datasets can now be joined in queries.
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+```
+
+# Big changes to the execution of queries under-the-hood
+
+Both of the first two points are driven by a large under-the-hood change to
the way that dplyr pipelines are constructed and executed in R. There are
almost no changes (besides new capabilities) that one would run into, but the
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead
you must call `collect()` or `compute()` to evaluate the query. This is inline
with how datasets worked before as well as how a number of other dplyr backends
work.
+the order of dataset queries is no longer deterministic. If you need a stable
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand
off an Arrow Dataset or query object to duckdb for further querying using the
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as
well as its SQL interface, to aggregate data. Filtering and column projection
done before `to_duckdb()` is evaluated in Arrow. You can also hand off DuckDB
data (or the result of a query) to arrow with `to_arrow()`.
Review comment:
```suggestion
If you have the [duckdb](https://duckdb.org/) package installed, you can
hand off an Arrow Dataset or query object to duckdb for further querying using
the `to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods,
as well as its SQL interface, to aggregate data. DuckDB supports filter
pushdown (e.g. passing a dataset to DuckDB and then running a query with a
`where` clause), so you can take advantage of Arrow Datasets and Arrow-based
filtering optimizations even within a DuckDB query. Filtering and column
projection specified before `to_duckdb()` in a pipeline is evaluated in Arrow,
which can be helpful in some circumstances (e.g. complicated dbplyr pipelines).
You can also hand off DuckDB data (or the result of a query) to arrow with
`to_arrow()`.
```
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this
functionality - for the next release we’ll be looking to profile and optimize
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),`
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()`
and `quantile()` with one probability are also supported and currently return
approximate results using the t-digest algorithm.
+
+# Joins
+
+Multiple Arrow tables and datasets can now be joined in queries.
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+```
+
+# Big changes to the execution of queries under-the-hood
+
+Both of the first two points are driven by a large under-the-hood change to
the way that dplyr pipelines are constructed and executed in R. There are
almost no changes (besides new capabilities) that one would run into, but the
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead
you must call `collect()` or `compute()` to evaluate the query. This is inline
with how datasets worked before as well as how a number of other dplyr backends
work.
+the order of dataset queries is no longer deterministic. If you need a stable
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand
off an Arrow Dataset or query object to duckdb for further querying using the
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as
well as its SQL interface, to aggregate data. Filtering and column projection
done before `to_duckdb()` is evaluated in Arrow. You can also hand off DuckDB
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and
want to avoid the worst-of-the-worst delays. To do this, we can use
`percent_rank()`; however that requires a window function which isn’t yet
available in Arrow, so let’s try sending the data to DuckDB to do that, then
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+ select(carrier, origin, dest, arr_delay) %>%
+ # arriving early doesn't matter, so call negative delays 0
+ mutate(arr_delay = pmax(arr_delay, 0)) %>%
+ to_duckdb() %>%
+ # for each carrier-origin-dest, take the worst 5% of delays
+ group_by(carrier, origin, dest) %>%
+ mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+ filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
Review comment:
```suggestion
filter(arr_delay_rank > 0.95)
head(flights_filtered)
```
Now we have all of the flights filtered that are the worst of the worst
stored as a dbplyr lazy `tbl` with our DuckDB connection. This is an example of
using Arrow -> DuckDB.
But we can do more: we can then bring that data back into Arrow jus as
easily. For the rest of our analysis, we pick up where we left off with the
`tbl` referring to the DuckDB query:
```r
# pull data back into arrow to complete analysis
flights_filtered %>%
```
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this
functionality - for the next release we’ll be looking to profile and optimize
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),`
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()`
and `quantile()` with one probability are also supported and currently return
approximate results using the t-digest algorithm.
+
+# Joins
+
+Multiple Arrow tables and datasets can now be joined in queries.
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+```
+
+# Big changes to the execution of queries under-the-hood
+
+Both of the first two points are driven by a large under-the-hood change to
the way that dplyr pipelines are constructed and executed in R. There are
almost no changes (besides new capabilities) that one would run into, but the
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead
you must call `collect()` or `compute()` to evaluate the query. This is inline
with how datasets worked before as well as how a number of other dplyr backends
work.
+the order of dataset queries is no longer deterministic. If you need a stable
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand
off an Arrow Dataset or query object to duckdb for further querying using the
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as
well as its SQL interface, to aggregate data. Filtering and column projection
done before `to_duckdb()` is evaluated in Arrow. You can also hand off DuckDB
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and
want to avoid the worst-of-the-worst delays. To do this, we can use
`percent_rank()`; however that requires a window function which isn’t yet
available in Arrow, so let’s try sending the data to DuckDB to do that, then
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+ select(carrier, origin, dest, arr_delay) %>%
+ # arriving early doesn't matter, so call negative delays 0
+ mutate(arr_delay = pmax(arr_delay, 0)) %>%
+ to_duckdb() %>%
+ # for each carrier-origin-dest, take the worst 5% of delays
+ group_by(carrier, origin, dest) %>%
+ mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+ filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+ to_arrow() %>%
+ # now summarise to get mean/min
+ group_by(carrier, origin, dest) %>%
+ summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay),
num_flights = n()) %>%
+ filter(dest %in% c("ORD", "MDW")) %>%
+ arrange(desc(arr_delay_mean)) %>%
+ collect()
+```
Review comment:
```suggestion
```
And just like that we've passed data back and forth between Arrow and DuckDB
without having to write a signle file to disk!
```
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this
functionality - for the next release we’ll be looking to profile and optimize
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),`
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()`
and `quantile()` with one probability are also supported and currently return
approximate results using the t-digest algorithm.
+
+# Joins
+
+Multiple Arrow tables and datasets can now be joined in queries.
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+```
+
+# Big changes to the execution of queries under-the-hood
+
+Both of the first two points are driven by a large under-the-hood change to
the way that dplyr pipelines are constructed and executed in R. There are
almost no changes (besides new capabilities) that one would run into, but the
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead
you must call `collect()` or `compute()` to evaluate the query. This is inline
with how datasets worked before as well as how a number of other dplyr backends
work.
+the order of dataset queries is no longer deterministic. If you need a stable
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand
off an Arrow Dataset or query object to duckdb for further querying using the
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as
well as its SQL interface, to aggregate data. Filtering and column projection
done before `to_duckdb()` is evaluated in Arrow. You can also hand off DuckDB
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and
want to avoid the worst-of-the-worst delays. To do this, we can use
`percent_rank()`; however that requires a window function which isn’t yet
available in Arrow, so let’s try sending the data to DuckDB to do that, then
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+ select(carrier, origin, dest, arr_delay) %>%
+ # arriving early doesn't matter, so call negative delays 0
+ mutate(arr_delay = pmax(arr_delay, 0)) %>%
+ to_duckdb() %>%
+ # for each carrier-origin-dest, take the worst 5% of delays
+ group_by(carrier, origin, dest) %>%
+ mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+ filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+ to_arrow() %>%
+ # now summarise to get mean/min
+ group_by(carrier, origin, dest) %>%
+ summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay),
num_flights = n()) %>%
+ filter(dest %in% c("ORD", "MDW")) %>%
+ arrange(desc(arr_delay_mean)) %>%
+ collect()
+```
Review comment:
````suggestion
```
And just like that we've passed data back and forth between Arrow and DuckDB
without having to write a signle file to disk!
````
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this
functionality - for the next release we’ll be looking to profile and optimize
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),`
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()`
and `quantile()` with one probability are also supported and currently return
approximate results using the t-digest algorithm.
+
+# Joins
+
+Multiple Arrow tables and datasets can now be joined in queries.
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+```
+
+# Big changes to the execution of queries under-the-hood
+
+Both of the first two points are driven by a large under-the-hood change to
the way that dplyr pipelines are constructed and executed in R. There are
almost no changes (besides new capabilities) that one would run into, but the
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead
you must call `collect()` or `compute()` to evaluate the query. This is inline
with how datasets worked before as well as how a number of other dplyr backends
work.
+the order of dataset queries is no longer deterministic. If you need a stable
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand
off an Arrow Dataset or query object to duckdb for further querying using the
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as
well as its SQL interface, to aggregate data. Filtering and column projection
done before `to_duckdb()` is evaluated in Arrow. You can also hand off DuckDB
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and
want to avoid the worst-of-the-worst delays. To do this, we can use
`percent_rank()`; however that requires a window function which isn’t yet
available in Arrow, so let’s try sending the data to DuckDB to do that, then
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+ select(carrier, origin, dest, arr_delay) %>%
+ # arriving early doesn't matter, so call negative delays 0
+ mutate(arr_delay = pmax(arr_delay, 0)) %>%
+ to_duckdb() %>%
+ # for each carrier-origin-dest, take the worst 5% of delays
+ group_by(carrier, origin, dest) %>%
+ mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+ filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
Review comment:
````suggestion
filter(arr_delay_rank > 0.95)
head(flights_filtered)
```
Now we have all of the flights filtered that are the worst of the worst
stored as a dbplyr lazy `tbl` with our DuckDB connection. This is an example of
using Arrow -> DuckDB.
But we can do more: we can then bring that data back into Arrow jus as
easily. For the rest of our analysis, we pick up where we left off with the
`tbl` referring to the DuckDB query:
```r
# pull data back into arrow to complete analysis
flights_filtered %>%
````
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,140 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
Review comment:
Oops, re-reading the paragraph in the post and not just the comments
(sorry about not doing that before!) @thisisnic is absolutely right here.
Here's `summarize` from the 5.0.0 release
tag](https://github.com/apache/arrow/blob/release-5.0.0/r/R/dplyr-summarize.R#L21-L36).
I think we can close this comment.
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
Review comment:
```suggestion
Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()` (a workflow we know people have been waiting for and asking
about!). These are usable both with in-memory Arrow tables as well as across
partitioned datasets. As usual, Arrow will read and process data in chunks and
in parallel when possible to produce results much faster than one could by
loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
```
We could also add something like this that emphasizes that we know (and
knew) that `group_by() %>% summarise()` is the workflow people cared about
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()`. These are usable both with in-memory Arrow tables as well as
across partitioned datasets. As usual, Arrow will read and process data in
chunks and in parallel when possible to produce results much faster than one
could by loading it all into memory then processing and even better, allows for
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this
functionality - for the next release we’ll be looking to profile and optimize
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),`
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()`
and `quantile()` with one probability are also supported and currently return
approximate results using the t-digest algorithm.
+
+# Joins
+
+Multiple Arrow tables and datasets can now be joined in queries.
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+ filter(
+ year == 2013,
+ month == 10,
+ day == 9,
+ origin == "JFK",
+ dest == "LAS"
+ ) %>%
+ select(dep_time, arr_time, carrier) %>%
+ left_join(
+ arrow_table(nycflights13::airlines)
+ ) %>%
+ collect()
+```
+
+# Big changes to the execution of queries under-the-hood
+
+Both of the first two points are driven by a large under-the-hood change to
the way that dplyr pipelines are constructed and executed in R. There are
almost no changes (besides new capabilities) that one would run into, but the
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead
you must call `collect()` or `compute()` to evaluate the query. This is inline
with how datasets worked before as well as how a number of other dplyr backends
work.
+the order of dataset queries is no longer deterministic. If you need a stable
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand
off an Arrow Dataset or query object to duckdb for further querying using the
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as
well as its SQL interface, to aggregate data. Filtering and column projection
done before `to_duckdb()` is evaluated in Arrow. You can also hand off DuckDB
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and
want to avoid the worst-of-the-worst delays. To do this, we can use
`percent_rank()`; however that requires a window function which isn’t yet
available in Arrow, so let’s try sending the data to DuckDB to do that, then
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+ select(carrier, origin, dest, arr_delay) %>%
+ # arriving early doesn't matter, so call negative delays 0
+ mutate(arr_delay = pmax(arr_delay, 0)) %>%
+ to_duckdb() %>%
+ # for each carrier-origin-dest, take the worst 5% of delays
+ group_by(carrier, origin, dest) %>%
+ mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+ filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+ to_arrow() %>%
+ # now summarise to get mean/min
+ group_by(carrier, origin, dest) %>%
+ summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay),
num_flights = n()) %>%
+ filter(dest %in% c("ORD", "MDW")) %>%
+ arrange(desc(arr_delay_mean)) %>%
+ collect()
+```
+# Expanded use of altrep
+
+We are continuing our use of R’s altrep where possible. In 5.0.0 there were a
limited set of circumstances that took advantage of altrep, but in 6.0.0 we
have expanded types (to include strings), as well as vectors with `NA`s.
+
+```r
+library(microbenchmark)
+library(arrow)
+
+tbl <-
+ arrow_table(data.frame(
+ x = rnorm(10000000),
+ y = sample(c(letters, NA), 10000000, replace = TRUE)
+ ))
+
+with_altrep <- function(data){
+ options(arrow.use_altrep = TRUE)
+ as.data.frame(data)
+}
+
+without_altrep <- function(data){
+ options(arrow.use_altrep = FALSE)
+ as.data.frame(data)
+}
+
+microbenchmark(
+ without_altrep(tbl),
+ with_altrep(tbl)
+)
+
+# Unit: milliseconds
+# expr min lq mean median uq
max neval
+# without_altrep(tbl) 191.0788 213.82235 249.65076 225.52120 244.26977
512.1652 100
+# with_altrep(tbl) 48.7152 50.97269 65.56832 52.93795 55.24505
338.4602 100
+
+```
+# Over 30 new compute functions
+
+There are over 30 new compute functions available in this release, including
string functions `str_to_lower()`, `str_to_upper()`, `str_to_title()`,
`startsWith()`, `endsWith()`, `str_starts()`, and `str_ends()`, date and time
functions `strftime()`, `format_ISO8601()`, `is_timestamp()` and others.
+
+# Can now install in an offline mode on linux
+
+For folks who need to install Arrow on an airgapped server, we have included
helper functions and installation options that make it easier to download a
fat-source of the arrow package that includes both the Arrow source as well as
third-party dependencies that are needed when building Arrow.
+
+The helper function `create_package_with_all_dependencies()` can be run from a
computer that does have access to the internet and it will create a fat-source
package which can then be transferred and installed on a server without
connectivity. This helper is also available on GitHub without installing the
arrow package. For more installation [see the
docs](https://arrow.apache.org/docs/r/articles/install.html#offline-installation).
Review comment:
```suggestion
The helper function `create_package_with_all_dependencies()` can be run from
a computer that does have access to the internet and it will create a
fat-source package which can then be transferred and installed on a server
without connectivity. This helper is also available on GitHub without
installing the arrow package. For more installation [see the
docs](https://arrow.apache.org/docs/r/articles/install.html#offline-installation).
Special thanks to Karl Dunkle Werner for the PRs to make this possible. Karl
has also been added as a contributor in recognition of this contribution along
with a number before this. Thank you, Karl!
```
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,183 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are happy to announce the recent release of version 6.0.0 of Arrow on CRAN,
+and in this blog post we highlight the main updates in this version. A big
+thanks goes to Dragos Moldovan-Grünfeld, Percy Camilo Triveño Aucahuasi,
Review comment:
```suggestion
thanks goes to Dragoș Moldovan-Grünfeld, Percy Camilo Triveño Aucahuasi,
```
##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,183 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author:
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are happy to announce the recent release of version 6.0.0 of Arrow on CRAN,
+and in this blog post we highlight the main updates in this version. A big
+thanks goes to Dragos Moldovan-Grünfeld, Percy Camilo Triveño Aucahuasi,
+Dewey Dunnington, Matt Peterson, and Phillip Cloud, who, in this release, made
+their first contributions the to the R package.
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>%
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a
whole dataset, but 6.0.0 now allows you to aggregate across groups with
`group_by()` (a workflow we know people have been waiting for and asking
about!). These are usable both with in-memory Arrow tables as well as across
partitioned datasets. As usual, Arrow will read and process data in chunks and
in parallel when possible to produce results much faster than one could by
loading it all into memory then processing, and even better, allows for
operations that wouldn’t fit into memory on a single machine.
+
+The focus of this release has been on the initial implementation of this
functionality - for the next release, we’ll be looking to profile and optimize
to enhance performance.
Review comment:
```suggestion
The focus of this release has been on the initial implementation of this
functionality - for the next release, we’ll be looking to profile and optimize
to enhance performance. Connected with that, much of this functionality is
still very new and slightly experimental, (for example, it's not even wired up
in the pyarrow package yet!). We are excited to have people try this out, if
you run into any issues at all, please [let us
know](https://issues.apache.org/jira/browse/ARROW) so that we can improve these
for our next release.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]