[GitHub] [arrow] nealrichardson commented on a change in pull request #11122: ARROW-13962: [R] Catch up on the NEWS

GitBox Thu, 09 Sep 2021 10:48:53 -0700


nealrichardson commented on a change in pull request #11122:
URL: https://github.com/apache/arrow/pull/11122#discussion_r705568192




##########
File path: r/NEWS.md
##########
@@ -19,10 +19,36 @@
 
 # arrow 5.0.0.9000
 
+There are now two ways to query Arrow data:
+
+## 1. Grouped aggregation in Arrow
+
+`dplyr::summarize()`, both grouped and ungrouped, is now implemented for Arrow 
Datasets, Tables, and RecordBatches. Because data is scanned in chunks, you can 
aggregate over larger-than-memory datasets backed by many files. Supported 
aggregation functions include `n()`, `n_distinct()`, `sum()`, `mean()`, 
`var()`, `sd()`, `any()`, and `all()`.
+
+This enhancement does change the behavior of `summarize()` and `collect()` in 
some cases: see "Breaking changes" below for details.
+
+New compute functions include `str_to_title()` and `strftime()`.
+
+## 2. duckdb integration
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.
 ## Breaking changes
 
 * `dplyr::summarize()` on an in-memory Arrow Table or RecordBatch no longer 
eagerly evaluates. Call `compute()` or `collect()` to evaluate the query.
-* Row order of data from a Dataset query is no longer deterministic. If you 
need a stable sort order, you should explicitly `arrange()` the query. For 
calls to `summarize()`, you can set `options(arrow.summarise.sort = TRUE)` to 
match the current `dplyr` behavior of sorting on the grouping columns.
+* Row order of data from a Dataset query is no longer deterministic. If you 
need a stable sort order, you should explicitly `arrange()` the query result. 
For calls to `summarize()`, you can set `options(arrow.summarise.sort = TRUE)` 
to match the current `dplyr` behavior of sorting on the grouping columns.
+
+## Installation on Linux
+
+* Package installation now fails if the Arrow C++ library does not compile. In 
previous versions, if the C++ library failed to compile, you would get a 
successful R package installation that wouldn't do much useful.
+* You can disable all optional C++ components when building from source by 
setting the environment variable `LIBARROW_MINIMAL=true`. This will have the 
core Arrow/Feather components but excludes Parquet, Datasets, compression 
libraries, and other optional features.
+* Source packages now bundle the Arrow C++ source code, so it does not have to 
be downloaded in order to build the package. Because the source is included, it 
is now possible to build the package on an offline/airgapped system. By 
default, the offline build will be minimal because it cannot download 
third-party C++ dependencies required to support all features. To allow a fully 
featured offline build, the included `create_package_with_all_dependencies()` 
function (also available on GitHub without installing the arrow package) will 
download all third-party C++ dependencies and bundle them inside the R source 
package. Run this function on a system connected to the network to produce the 
"fat" source package, then copy that .tar.gz package to your offline machine 
and install.
+* Source builds can make use of system dependencies (such as `libz`) by 
setting `ARROW_DEPENDENCY_SOURCE=AUTO`. This is not the default in this release 
(`BUNDLED`, i.e. download and build all dependencies) but may become the 
default in the future.
+* The JSON library components (`read_json_arrow()`) are now optional and still 
on by default; set `ARROW_JSON=OFF` before building to disable them.
+* It is now possible to build Arrow on the raspbian operating system.

Review comment:
       I can also drop this bullet for now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a change in pull request #11122: ARROW-13962: [R] Catch up on the NEWS

Reply via email to