[GitHub] [arrow-site] drabastomek commented on a change in pull request #158: R blog post

GitBox Sat, 06 Nov 2021 07:51:34 -0700


drabastomek commented on a change in pull request #158:
URL: https://github.com/apache/arrow-site/pull/158#discussion_r743975442




##########
File path: _posts/2021-11-05-r-6.0.0.md
##########
@@ -0,0 +1,207 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-05"
+author: Nic Crane, Jonathan Keane, Neal Richardson
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are excited to announce the recent release of version 6.0.0 of the Arrow R 
package on CRAN. While we usually don't write a dedicated release blog post for 
the R package, this one is special. There are a number of major new features in 
this version, some of which we've been building up to for several years.
+
+# More dplyr support
+
+In version 0.16.0 (February 2020), we released the first version of the 
Dataset feature, which allowed you to query multi-file datasets using 
`dplyr::select()` and `filter()`. These tools allowed you to find a slice of 
data in a large dataset that may not fit into memory and pull it into R for 
further analysis. In version 4.0.0 earlier this year, we added support for 
`mutate()` and a number of other dplyr verbs, and all year we've been adding 
hundreds of functions you can use to transform and filter data in Datasets. 
However, to aggregate, you'd still need to pull the data into R.
+
+## Grouped aggregation
+
+With `arrow` 6.0.0, you can now `summarise()` on Arrow data, both with or 
without `group_by()`. These are supported both with in-memory Arrow tables as 
well as across partitioned datasets. Most common aggregation functions are 
supported: `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, 
`sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability 
are also supported and currently return approximate results using the t-digest 
algorithm.
+
+As usual, Arrow will read and process data in chunks and in parallel when 
possible to produce results much faster than one could by loading it all into 
memory then processing. This allows for operations that wouldn't fit into 
memory on a single machine.
+
+## Joins
+
+In addition to aggregation, Arrow also supports all of dplyr's mutating joins 
(inner, left, right, and full) and filtering joins (semi and anti).
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on

Review comment:
       ```suggestion
   Suppose I want to get a table of all the flights from JFK to Las Vegas 
Airport on
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-site] drabastomek commented on a change in pull request #158: R blog post

Reply via email to