[GitHub] [arrow-site] drabastomek commented on a change in pull request #158: R blog post

GitBox Fri, 05 Nov 2021 14:33:38 -0700


drabastomek commented on a change in pull request #158:
URL: https://github.com/apache/arrow-site/pull/158#discussion_r743976182




##########
File path: _posts/2021-11-05-r-6.0.0.md
##########
@@ -0,0 +1,207 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-05"
+author: Nic Crane, Jonathan Keane, Neal Richardson
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are excited to announce the recent release of version 6.0.0 of the Arrow R 
package on CRAN. While we usually don't write a dedicated release blog post for 
the R package, this one is special. There are a number of major new features in 
this version, some of which we've been building up to for several years.
+
+# More dplyr support
+
+In version 0.16.0 (February 2020), we released the first version of the 
Dataset feature, which allowed you to query multi-file datasets using 
`dplyr::select()` and `filter()`. These tools allowed you to find a slice of 
data in a large dataset that may not fit into memory and pull it into R for 
further analysis. In version 4.0.0 earlier this year, we added support for 
`mutate()` and a number of other dplyr verbs, and all year we've been adding 
hundreds of functions you can use to transform and filter data in Datasets. 
However, to aggregate, you'd still need to pull the data into R.
+
+## Grouped aggregation
+
+With `arrow` 6.0.0, you can now `summarise()` on Arrow data, both with or 
without `group_by()`. These are supported both with in-memory Arrow tables as 
well as across partitioned datasets. Most common aggregation functions are 
supported: `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, 
`sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability 
are also supported and currently return approximate results using the t-digest 
algorithm.
+
+As usual, Arrow will read and process data in chunks and in parallel when 
possible to produce results much faster than one could by loading it all into 
memory then processing. This allows for operations that wouldn't fit into 
memory on a single machine.
+

Review comment:
       I think we might add a small example here. I saw an aggregation example 
below but I think it fits to have one here, as well.

##########
File path: _posts/2021-11-05-r-6.0.0.md
##########
@@ -0,0 +1,207 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-05"
+author: Nic Crane, Jonathan Keane, Neal Richardson
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are excited to announce the recent release of version 6.0.0 of the Arrow R 
package on CRAN. While we usually don't write a dedicated release blog post for 
the R package, this one is special. There are a number of major new features in 
this version, some of which we've been building up to for several years.
+
+# More dplyr support
+
+In version 0.16.0 (February 2020), we released the first version of the 
Dataset feature, which allowed you to query multi-file datasets using 
`dplyr::select()` and `filter()`. These tools allowed you to find a slice of 
data in a large dataset that may not fit into memory and pull it into R for 
further analysis. In version 4.0.0 earlier this year, we added support for 
`mutate()` and a number of other dplyr verbs, and all year we've been adding 
hundreds of functions you can use to transform and filter data in Datasets. 
However, to aggregate, you'd still need to pull the data into R.
+
+## Grouped aggregation
+
+With `arrow` 6.0.0, you can now `summarise()` on Arrow data, both with or 
without `group_by()`. These are supported both with in-memory Arrow tables as 
well as across partitioned datasets. Most common aggregation functions are 
supported: `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, 
`sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability 
are also supported and currently return approximate results using the t-digest 
algorithm.
+
+As usual, Arrow will read and process data in chunks and in parallel when 
possible to produce results much faster than one could by loading it all into 
memory then processing. This allows for operations that wouldn't fit into 
memory on a single machine.
+
+## Joins
+
+In addition to aggregation, Arrow also supports all of dplyr's mutating joins 
(inner, left, right, and full) and filtering joins (semi and anti).
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on

Review comment:
       ```suggestion
   Say, I want to get a table of all the flights from JFK to Las Vegas Airport 
on
   ```

##########
File path: _posts/2021-11-05-r-6.0.0.md
##########
@@ -0,0 +1,207 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-05"
+author: Nic Crane, Jonathan Keane, Neal Richardson
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are excited to announce the recent release of version 6.0.0 of the Arrow R 
package on CRAN. While we usually don't write a dedicated release blog post for 
the R package, this one is special. There are a number of major new features in 
this version, some of which we've been building up to for several years.
+
+# More dplyr support
+
+In version 0.16.0 (February 2020), we released the first version of the 
Dataset feature, which allowed you to query multi-file datasets using 
`dplyr::select()` and `filter()`. These tools allowed you to find a slice of 
data in a large dataset that may not fit into memory and pull it into R for 
further analysis. In version 4.0.0 earlier this year, we added support for 
`mutate()` and a number of other dplyr verbs, and all year we've been adding 
hundreds of functions you can use to transform and filter data in Datasets. 
However, to aggregate, you'd still need to pull the data into R.
+
+## Grouped aggregation
+
+With `arrow` 6.0.0, you can now `summarise()` on Arrow data, both with or 
without `group_by()`. These are supported both with in-memory Arrow tables as 
well as across partitioned datasets. Most common aggregation functions are 
supported: `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, 
`sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability 
are also supported and currently return approximate results using the t-digest 
algorithm.
+
+As usual, Arrow will read and process data in chunks and in parallel when 
possible to produce results much faster than one could by loading it all into 
memory then processing. This allows for operations that wouldn't fit into 
memory on a single machine.
+
+## Joins
+
+In addition to aggregation, Arrow also supports all of dplyr's mutating joins 
(inner, left, right, and full) and filtering joins (semi and anti).
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+
+#> # A tibble: 12 × 4
+#>    dep_time arr_time carrier name
+#>       <int>    <int> <chr>   <chr>
+#>  1      637      853 B6      JetBlue Airways
+#>  2      648      912 AA      American Airlines Inc.
+#>  3      812     1029 DL      Delta Air Lines Inc.
+#>  4      945     1206 VX      Virgin America
+#>  5      955     1219 B6      JetBlue Airways
+#>  6     1018     1231 DL      Delta Air Lines Inc.
+#>  7     1120     1338 B6      JetBlue Airways
+#>  8     1451     1705 DL      Delta Air Lines Inc.
+#>  9     1656     1915 AA      American Airlines Inc.
+#> 10     1755     2001 DL      Delta Air Lines Inc.
+#> 11     1827     2049 B6      JetBlue Airways
+#> 12     1917     2126 DL      Delta Air Lines Inc.
+```
+
+In this example, we're working on an in-memory table, so you wouldn't need 
`arrow` to do this--but the same code would work on a larger-than-memory 
dataset backed by thousands of Parquet files.
+
+## Under the hood
+
+To support these features, we've made some internal changes to how queries are 
built up and--importantly--when they are evaluated. As a result, there are some 
changes in behavior compared to past versions of `arrow`. First, calls to 
`summarise()`, `head()`, and `tail()` no longer eagerly evaluate: this means 
you need to call either `compute()` (to evaluate it and produce an Arrow Table) 
or `collect()` (to evaluate and pull the Table into an R `data.frame`) to see 
the results.

Review comment:
       ```suggestion
   To support these features, we've made some internal changes to how queries 
are built up and--importantly--when they are evaluated. As a result, there are 
some changes in behavior compared to past versions of `arrow`. 
   
   First, calls to `summarise()`, `head()`, and `tail()` no longer eagerly 
evaluate: this means you need to call either `compute()` (to evaluate it and 
produce an Arrow Table) or `collect()` (to evaluate and pull the Table into an 
R `data.frame`) to see the results.
   ```

##########
File path: _posts/2021-11-05-r-6.0.0.md
##########
@@ -0,0 +1,207 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-05"
+author: Nic Crane, Jonathan Keane, Neal Richardson
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are excited to announce the recent release of version 6.0.0 of the Arrow R 
package on CRAN. While we usually don't write a dedicated release blog post for 
the R package, this one is special. There are a number of major new features in 
this version, some of which we've been building up to for several years.
+
+# More dplyr support
+
+In version 0.16.0 (February 2020), we released the first version of the 
Dataset feature, which allowed you to query multi-file datasets using 
`dplyr::select()` and `filter()`. These tools allowed you to find a slice of 
data in a large dataset that may not fit into memory and pull it into R for 
further analysis. In version 4.0.0 earlier this year, we added support for 
`mutate()` and a number of other dplyr verbs, and all year we've been adding 
hundreds of functions you can use to transform and filter data in Datasets. 
However, to aggregate, you'd still need to pull the data into R.
+
+## Grouped aggregation
+
+With `arrow` 6.0.0, you can now `summarise()` on Arrow data, both with or 
without `group_by()`. These are supported both with in-memory Arrow tables as 
well as across partitioned datasets. Most common aggregation functions are 
supported: `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, 
`sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability 
are also supported and currently return approximate results using the t-digest 
algorithm.
+
+As usual, Arrow will read and process data in chunks and in parallel when 
possible to produce results much faster than one could by loading it all into 
memory then processing. This allows for operations that wouldn't fit into 
memory on a single machine.
+
+## Joins
+
+In addition to aggregation, Arrow also supports all of dplyr's mutating joins 
(inner, left, right, and full) and filtering joins (semi and anti).
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+
+#> # A tibble: 12 × 4
+#>    dep_time arr_time carrier name
+#>       <int>    <int> <chr>   <chr>
+#>  1      637      853 B6      JetBlue Airways
+#>  2      648      912 AA      American Airlines Inc.
+#>  3      812     1029 DL      Delta Air Lines Inc.
+#>  4      945     1206 VX      Virgin America
+#>  5      955     1219 B6      JetBlue Airways
+#>  6     1018     1231 DL      Delta Air Lines Inc.
+#>  7     1120     1338 B6      JetBlue Airways
+#>  8     1451     1705 DL      Delta Air Lines Inc.
+#>  9     1656     1915 AA      American Airlines Inc.
+#> 10     1755     2001 DL      Delta Air Lines Inc.
+#> 11     1827     2049 B6      JetBlue Airways
+#> 12     1917     2126 DL      Delta Air Lines Inc.
+```
+
+In this example, we're working on an in-memory table, so you wouldn't need 
`arrow` to do this--but the same code would work on a larger-than-memory 
dataset backed by thousands of Parquet files.
+
+## Under the hood
+
+To support these features, we've made some internal changes to how queries are 
built up and--importantly--when they are evaluated. As a result, there are some 
changes in behavior compared to past versions of `arrow`. First, calls to 
`summarise()`, `head()`, and `tail()` no longer eagerly evaluate: this means 
you need to call either `compute()` (to evaluate it and produce an Arrow Table) 
or `collect()` (to evaluate and pull the Table into an R `data.frame`) to see 
the results.
+
+Second, the order of rows in a dataset query is no longer determinisitic due 
to the way the parallelization of work happens in the C++ library. This means 
that you can't assume that the results of a query will be in the same order as 
the rows of data in the files on disk. If you do need a stable sort order, call 
`arrange()` to specify ordering.
+
+While these changes are a break from past `arrow` behavior, they are 
consistent with many `dbplyr` backends and are needed to allow queries to scale 
beyond data-frame workflows that can fit into memory.
+
+# Integration with DuckDB
+
+The Arrow engine is not the only new way to query Arrow Datasets in this 
release. If you have the [duckdb](https://duckdb.org/) package installed, you 
can hand off an Arrow Dataset or query object to duckdb for further querying 
using the `to_duckdb()` function. This allows you to use duckdb's `dbplyr` 
methods, as well as its SQL interface, to aggregate data. DuckDB supports 
filter pushdown, so you can take advantage of Arrow Datasets and Arrow-based 
optimizations even within a DuckDB SQL query with a `where` clause. Filtering 
and column projection specified before `to_duckdb()` in a pipeline is evaluated 
in Arrow, which can be helpful in some circumstances, such as complicated 
dbplyr pipelines.  You can also hand off DuckDB data (or the result of a query) 
to arrow with `to_arrow()`.

Review comment:
       ```suggestion
   The Arrow engine is not the only new way to query Arrow Datasets in this 
release. If you have the [duckdb](https://duckdb.org/) package installed, you 
can hand off an Arrow Dataset or query object to duckdb for further querying 
using the `to_duckdb()` function. This allows you to use duckdb's `dbplyr` 
methods, as well as its SQL interface, to aggregate data. DuckDB supports 
filter pushdown, so you can take advantage of Arrow Datasets and Arrow-based 
optimizations even within a DuckDB SQL query with a `where` clause. Filtering 
and column projection specified before the `to_duckdb()` call in a pipeline is 
evaluated in Arrow; this can be helpful in some circumstances like complicated 
dbplyr pipelines.  You can also hand off DuckDB data (or the result of a query) 
to arrow with the `to_arrow()` call.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-site] drabastomek commented on a change in pull request #158: R blog post

Reply via email to