[GitHub] [arrow-site] jonkeane commented on a change in pull request #158: R blog post

GitBox Fri, 05 Nov 2021 12:28:28 -0700


jonkeane commented on a change in pull request #158:
URL: https://github.com/apache/arrow-site/pull/158#discussion_r743073339




##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this 
functionality - for the next release we’ll be looking to profile and optimize 
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` 
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` 
and `quantile()` with one probability are also supported and currently return 
approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to 
the way that dplyr pipelines are constructed and executed in R. There are 
almost no changes (besides new capabilities) that one would run into, but the 
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead 
you must call `collect()` or `compute()` to evaluate the query. This is inline 
with how datasets worked before as well as how a number of other dplyr backends 
work.
+the order of dataset queries is no longer deterministic. If you need a stable 
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB 
data (or the result of a query) to arrow with `to_arrow()`.

Review comment:
       ```suggestion
   If you have the [duckdb](https://duckdb.org/) package installed, you can 
hand off an Arrow Dataset or query object to duckdb for further querying using 
the `to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, 
as well as its SQL interface, to aggregate data. DuckDB supports filter 
pushdown (e.g. passing a dataset to DuckDB and then running a query with a 
`where` clause), so you can take advantage of Arrow Datasets and Arrow-based 
filtering optimizations even within a DuckDB query. Filtering and column 
projection specified before `to_duckdb()` in a pipeline is evaluated in Arrow, 
which can be helpful in some circumstances (e.g. complicated dbplyr pipelines). 
 You can also hand off DuckDB data (or the result of a query) to arrow with 
`to_arrow()`.
   ```

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this 
functionality - for the next release we’ll be looking to profile and optimize 
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` 
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` 
and `quantile()` with one probability are also supported and currently return 
approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to 
the way that dplyr pipelines are constructed and executed in R. There are 
almost no changes (besides new capabilities) that one would run into, but the 
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead 
you must call `collect()` or `compute()` to evaluate the query. This is inline 
with how datasets worked before as well as how a number of other dplyr backends 
work.
+the order of dataset queries is no longer deterministic. If you need a stable 
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB 
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and 
want to avoid the worst-of-the-worst delays. To do this, we can use 
`percent_rank()`; however that requires a window function which isn’t yet 
available in Arrow, so let’s try sending the data to DuckDB to do that, then 
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+  select(carrier, origin, dest, arr_delay) %>%
+  # arriving early doesn't matter, so call negative delays 0
+  mutate(arr_delay = pmax(arr_delay, 0)) %>%
+  to_duckdb() %>%
+  # for each carrier-origin-dest, take the worst 5% of delays
+  group_by(carrier, origin, dest) %>%
+  mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+  filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%

Review comment:
       ```suggestion
     filter(arr_delay_rank > 0.95)
   
   head(flights_filtered)
   ```
   
   Now we have all of the flights filtered that are the worst of the worst 
stored as a dbplyr lazy `tbl` with our DuckDB connection. This is an example of 
using Arrow -> DuckDB.
   
   But we can do more: we can then bring that data back into Arrow jus as 
easily. For the rest of our analysis, we pick up where we left off with the 
`tbl` referring to the DuckDB query:
   
   ```r
   # pull data back into arrow to complete analysis
   flights_filtered %>%
   ```

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this 
functionality - for the next release we’ll be looking to profile and optimize 
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` 
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` 
and `quantile()` with one probability are also supported and currently return 
approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to 
the way that dplyr pipelines are constructed and executed in R. There are 
almost no changes (besides new capabilities) that one would run into, but the 
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead 
you must call `collect()` or `compute()` to evaluate the query. This is inline 
with how datasets worked before as well as how a number of other dplyr backends 
work.
+the order of dataset queries is no longer deterministic. If you need a stable 
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB 
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and 
want to avoid the worst-of-the-worst delays. To do this, we can use 
`percent_rank()`; however that requires a window function which isn’t yet 
available in Arrow, so let’s try sending the data to DuckDB to do that, then 
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+  select(carrier, origin, dest, arr_delay) %>%
+  # arriving early doesn't matter, so call negative delays 0
+  mutate(arr_delay = pmax(arr_delay, 0)) %>%
+  to_duckdb() %>%
+  # for each carrier-origin-dest, take the worst 5% of delays
+  group_by(carrier, origin, dest) %>%
+  mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+  filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+  to_arrow() %>%
+  # now summarise to get mean/min
+  group_by(carrier, origin, dest) %>%
+  summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay), 
num_flights = n()) %>%
+  filter(dest %in% c("ORD", "MDW")) %>%
+  arrange(desc(arr_delay_mean)) %>%
+  collect()
+```

Review comment:
       ```suggestion
   ```
   
   And just like that we've passed data back and forth between Arrow and DuckDB 
without having to write a signle file to disk!
   
   ```

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this 
functionality - for the next release we’ll be looking to profile and optimize 
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` 
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` 
and `quantile()` with one probability are also supported and currently return 
approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to 
the way that dplyr pipelines are constructed and executed in R. There are 
almost no changes (besides new capabilities) that one would run into, but the 
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead 
you must call `collect()` or `compute()` to evaluate the query. This is inline 
with how datasets worked before as well as how a number of other dplyr backends 
work.
+the order of dataset queries is no longer deterministic. If you need a stable 
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB 
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and 
want to avoid the worst-of-the-worst delays. To do this, we can use 
`percent_rank()`; however that requires a window function which isn’t yet 
available in Arrow, so let’s try sending the data to DuckDB to do that, then 
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+  select(carrier, origin, dest, arr_delay) %>%
+  # arriving early doesn't matter, so call negative delays 0
+  mutate(arr_delay = pmax(arr_delay, 0)) %>%
+  to_duckdb() %>%
+  # for each carrier-origin-dest, take the worst 5% of delays
+  group_by(carrier, origin, dest) %>%
+  mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+  filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+  to_arrow() %>%
+  # now summarise to get mean/min
+  group_by(carrier, origin, dest) %>%
+  summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay), 
num_flights = n()) %>%
+  filter(dest %in% c("ORD", "MDW")) %>%
+  arrange(desc(arr_delay_mean)) %>%
+  collect()
+```

Review comment:
       ````suggestion
   ```
   
   And just like that we've passed data back and forth between Arrow and DuckDB 
without having to write a signle file to disk!
   
   ````

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this 
functionality - for the next release we’ll be looking to profile and optimize 
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` 
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` 
and `quantile()` with one probability are also supported and currently return 
approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to 
the way that dplyr pipelines are constructed and executed in R. There are 
almost no changes (besides new capabilities) that one would run into, but the 
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead 
you must call `collect()` or `compute()` to evaluate the query. This is inline 
with how datasets worked before as well as how a number of other dplyr backends 
work.
+the order of dataset queries is no longer deterministic. If you need a stable 
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB 
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and 
want to avoid the worst-of-the-worst delays. To do this, we can use 
`percent_rank()`; however that requires a window function which isn’t yet 
available in Arrow, so let’s try sending the data to DuckDB to do that, then 
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+  select(carrier, origin, dest, arr_delay) %>%
+  # arriving early doesn't matter, so call negative delays 0
+  mutate(arr_delay = pmax(arr_delay, 0)) %>%
+  to_duckdb() %>%
+  # for each carrier-origin-dest, take the worst 5% of delays
+  group_by(carrier, origin, dest) %>%
+  mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+  filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%

Review comment:
       ````suggestion
     filter(arr_delay_rank > 0.95)
   
   head(flights_filtered)
   ```
   
   Now we have all of the flights filtered that are the worst of the worst 
stored as a dbplyr lazy `tbl` with our DuckDB connection. This is an example of 
using Arrow -> DuckDB.
   
   But we can do more: we can then bring that data back into Arrow jus as 
easily. For the rest of our analysis, we pick up where we left off with the 
`tbl` referring to the DuckDB query:
   
   ```r
   # pull data back into arrow to complete analysis
   flights_filtered %>%
   ````

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,140 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.

Review comment:
       Oops, re-reading the paragraph in the post and not just the comments 
(sorry about not doing that before!) @thisisnic is absolutely right here. 
Here's `summarize` from the 5.0.0 release 
tag](https://github.com/apache/arrow/blob/release-5.0.0/r/R/dplyr-summarize.R#L21-L36).
 
   
   I think we can close this comment.

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.

Review comment:
       ```suggestion
   Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()` (a workflow we know people have been waiting for and asking 
about!). These are usable both with in-memory Arrow tables as well as across 
partitioned datasets. As usual, Arrow will read and process data in chunks and 
in parallel when possible to produce results much faster than one could by 
loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
   ```
   
   We could also add something like this that emphasizes that we know (and 
knew) that `group_by() %>% summarise()` is the workflow people cared about

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,139 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()`. These are usable both with in-memory Arrow tables as well as 
across partitioned datasets. As usual, Arrow will read and process data in 
chunks and in parallel when possible to produce results much faster than one 
could by loading it all into memory then processing and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this 
functionality - for the next release we’ll be looking to profile and optimize 
to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` 
`max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` 
and `quantile()` with one probability are also supported and currently return 
approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to 
the way that dplyr pipelines are constructed and executed in R. There are 
almost no changes (besides new capabilities) that one would run into, but the 
improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead 
you must call `collect()` or `compute()` to evaluate the query. This is inline 
with how datasets worked before as well as how a number of other dplyr backends 
work.
+the order of dataset queries is no longer deterministic. If you need a stable 
sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB 
data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and 
want to avoid the worst-of-the-worst delays. To do this, we can use 
`percent_rank()`; however that requires a window function which isn’t yet 
available in Arrow, so let’s try sending the data to DuckDB to do that, then 
pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+  select(carrier, origin, dest, arr_delay) %>%
+  # arriving early doesn't matter, so call negative delays 0
+  mutate(arr_delay = pmax(arr_delay, 0)) %>%
+  to_duckdb() %>%
+  # for each carrier-origin-dest, take the worst 5% of delays
+  group_by(carrier, origin, dest) %>%
+  mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+  filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+  to_arrow() %>%
+  # now summarise to get mean/min
+  group_by(carrier, origin, dest) %>%
+  summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay), 
num_flights = n()) %>%
+  filter(dest %in% c("ORD", "MDW")) %>%
+  arrange(desc(arr_delay_mean)) %>%
+  collect()
+```
+# Expanded use of altrep
+
+We are continuing our use of R’s altrep where possible. In 5.0.0 there were a 
limited set of circumstances that took advantage of altrep, but in 6.0.0 we 
have expanded types (to include strings), as well as vectors with `NA`s. 
+
+```r
+library(microbenchmark)
+library(arrow)
+
+tbl <-
+  arrow_table(data.frame(
+    x = rnorm(10000000),
+    y = sample(c(letters, NA), 10000000, replace = TRUE)
+  ))
+
+with_altrep <- function(data){
+  options(arrow.use_altrep = TRUE)
+  as.data.frame(data)  
+}
+
+without_altrep <- function(data){
+  options(arrow.use_altrep = FALSE)
+  as.data.frame(data)  
+}
+
+microbenchmark(
+  without_altrep(tbl),
+  with_altrep(tbl)
+)
+
+# Unit: milliseconds
+#                 expr      min        lq      mean    median        uq      
max neval
+#  without_altrep(tbl) 191.0788 213.82235 249.65076 225.52120 244.26977 
512.1652   100
+#     with_altrep(tbl)  48.7152  50.97269  65.56832  52.93795  55.24505 
338.4602   100
+
+```
+# Over 30 new compute functions
+
+There are over 30 new compute functions available in this release, including 
string functions `str_to_lower()`, `str_to_upper()`, `str_to_title()`, 
`startsWith()`, `endsWith()`, `str_starts()`, and `str_ends()`, date and time 
functions `strftime()`, `format_ISO8601()`, `is_timestamp()` and others.
+
+# Can now install in an offline mode on linux
+
+For folks who need to install Arrow on an airgapped server, we have included 
helper functions and installation options that make it easier to download a 
fat-source of the arrow package that includes both the Arrow source as well as 
third-party dependencies that are needed when building Arrow.
+
+The helper function `create_package_with_all_dependencies()` can be run from a 
computer that does have access to the internet and it will create a fat-source 
package which can then be transferred and installed on a server without 
connectivity. This helper is also available on GitHub without installing the 
arrow package.  For more installation [see the 
docs](https://arrow.apache.org/docs/r/articles/install.html#offline-installation).

Review comment:
       ```suggestion
   The helper function `create_package_with_all_dependencies()` can be run from 
a computer that does have access to the internet and it will create a 
fat-source package which can then be transferred and installed on a server 
without connectivity. This helper is also available on GitHub without 
installing the arrow package.  For more installation [see the 
docs](https://arrow.apache.org/docs/r/articles/install.html#offline-installation).
   
   Special thanks to Karl Dunkle Werner for the PRs to make this possible. Karl 
has also been added as a contributor in recognition of this contribution along 
with a number before this. Thank you, Karl!
   ```

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,183 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are happy to announce the recent release of version 6.0.0 of Arrow on CRAN, 
+and in this blog post we highlight the main updates in this version.  A big 
+thanks goes to Dragos Moldovan-Grünfeld, Percy Camilo Triveño Aucahuasi, 

Review comment:
       ```suggestion
   thanks goes to Dragoș Moldovan-Grünfeld, Percy Camilo Triveño Aucahuasi, 
   ```

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,183 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are happy to announce the recent release of version 6.0.0 of Arrow on CRAN, 
+and in this blog post we highlight the main updates in this version.  A big 
+thanks goes to Dragos Moldovan-Grünfeld, Percy Camilo Triveño Aucahuasi, 
+Dewey Dunnington, Matt Peterson, and Phillip Cloud, who, in this release, made 
+their first contributions the to the R package.
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% 
summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a 
whole dataset, but 6.0.0 now allows you to aggregate across groups with 
`group_by()` (a workflow we know people have been waiting for and asking 
about!). These are usable both with in-memory Arrow tables as well as across 
partitioned datasets. As usual, Arrow will read and process data in chunks and 
in parallel when possible to produce results much faster than one could by 
loading it all into memory then processing, and even better, allows for 
operations that wouldn’t fit into memory on a single machine.
+
+The focus of this release has been on the initial implementation of this 
functionality - for the next release, we’ll be looking to profile and optimize 
to enhance performance.

Review comment:
       ```suggestion
   The focus of this release has been on the initial implementation of this 
functionality - for the next release, we’ll be looking to profile and optimize 
to enhance performance. Connected with that, much of this functionality is 
still very new and slightly experimental, (for example, it's not even wired up 
in the pyarrow package yet!). We are excited to have people try this out, if 
you run into any issues at all, please [let us 
know](https://issues.apache.org/jira/browse/ARROW) so that we can improve these 
for our next release.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-site] jonkeane commented on a change in pull request #158: R blog post

Reply via email to