[GitHub] [arrow] ianmcook commented on a change in pull request #11425: ARROW-14304: [R] Update news for 6.0.0

GitBox Tue, 19 Oct 2021 06:50:42 -0700


ianmcook commented on a change in pull request #11425:
URL: https://github.com/apache/arrow/pull/11425#discussion_r731888525




##########
File path: r/NEWS.md
##########
@@ -21,32 +21,64 @@
 
 There are now two ways to query Arrow data:
 
-## 1. Grouped aggregation in Arrow
+## 1. Expanded Arrow-native queries: aggregation and joins
 
 `dplyr::summarize()`, both grouped and ungrouped, is now implemented for Arrow 
Datasets, Tables, and RecordBatches. Because data is scanned in chunks, you can 
aggregate over larger-than-memory datasets backed by many files. Supported 
aggregation functions include `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, 
`mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` and `quantile()` 
with one probability are also supported and currently return approximate 
results using the t-digest algorithm.
 
+Along with `summarize()`, you can also call `count()`, `tally()`, and 
`distinct()`, which effectively wrap `summarize()`.
+
 This enhancement does change the behavior of `summarize()` and `collect()` in 
some cases: see "Breaking changes" below for details.
 
-New compute functions include `str_to_title()` and `strftime()`.
+In addition to `summarize()`, equality joins (`left_join()`, `inner_join()`, 
`semi_join()`, et al.) are also supported natively in Arrow.
+
+Grouped aggregation and (especially) joins should be considered somewhat 
experimental in this release. We expect them to work, but they may not be well 
optimized for all workloads. To help us focus our efforts on improving them in 
the next release, please let us know if you encounter unexpected behavior or 
poor performance.
+
+New non-aggregating compute functions include string functions like 
`str_to_title()` and `strftime()` as well as compute functions for extracting 
date parts (e.g. `year()`, `month()`) from dates. This is not a complete list 
of additional compute functions; for an exhaustive list of available compute 
functions see `list_compute_functions()`.
+
+We've also worked to fill in support for all data types, such as `Decimal`, 
for functions added in previous releases. All type limitations mentioned in 
previous release notes should be no longer valid, and if you find a function 
that is not implemented for a certain data type, please [report an 
issue](https://issues.apache.org/jira/projects/ARROW/issues).
 
 ## 2. duckdb integration
 
-If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow.
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand 
off an Arrow Dataset or query object to duckdb for further querying using the 
`to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as 
well as its SQL interface, to aggregate data. Filtering and column projection 
done before `to_duckdb()` is evaluated in Arrow, and duckdb can push down some 
predicates to Arrow as well. This handoff *does not* copy the data, instead it 
uses Arrow's C-interface (just like passing arrow data between R and Python). 
This means there is no serialization or data copying costs are incurred.
+
+You can also take a duckdb `tbl` and call `to_arrow()` to stream data to 
Arrow's query engine. This means that in a single dplyr pipeline, you could 
start with an Arrow Dataset, evaulate some steps in duckdb, then evaluate the 
rest in arrow.
+
 ## Breaking changes
 
 * Row order of data from a Dataset query is no longer deterministic. If you 
need a stable sort order, you should explicitly `arrange()` the query result. 
For calls to `summarize()`, you can set `options(arrow.summarise.sort = TRUE)` 
to match the current `dplyr` behavior of sorting on the grouping columns.
 * `dplyr::summarize()` on an in-memory Arrow Table or RecordBatch no longer 
eagerly evaluates. Call `compute()` or `collect()` to evaluate the query.
 * `head()` and `tail()` also no longer eagerly evaluate, both for in-memory 
data and for Datasets. Also, because row order is no longer deterministic, they 
will effectively give you a random slice of data from somewhere in the dataset 
unless you `arrange()` to specify sorting.
+* Simple Feature (SF) columns no longer save all of their metadata when 
converting to Arrow tables (and thus when saving to parquet or feather). This 
also includes any dataframe column that has attributes on each element (in 
other words: row-level metadata). Our previous approach to saving this metadata 
is both (computationally) inefficient and unreliable with Arrow queries + 
datasets. This will most impact saving SF columns. For saving these columns we 
recommend either converting the columns to well-known binary representations 
(using `sf::st_as_binary(col)`) or using the [sfarrow 
package](https://CRAN.R-project.org/package=sfarrow) which handles some of the 
intricacies of this conversion process. We have plans to improve this and 
re-enable custom metadata like this in the future when we can implement the 
saving in a safe and efficient way. If you need to preserve the pre-6.0.0 
behavior of saving this metadata, you can set 
`options(arrow.preserve_row_level_metadata = TRUE)`. We wil
 l be removing this option in a coming release. We strongly recommend avoiding 
using this workaround if possible since the results will not be supported in 
the future and can lead to surprising and inaccurate results. If you run into a 
custom class besides sf columns that are impacted by this please [report an 
issue](https://issues.apache.org/jira/projects/ARROW/issues).

Review comment:
       ```suggestion
   * Simple Feature (SF) columns no longer save all of their metadata when 
converting to Arrow tables (and thus when saving to Parquet or Feather). This 
also includes any dataframe column that has attributes on each element (in 
other words: row-level metadata). Our previous approach to saving this metadata 
is both (computationally) inefficient and unreliable with Arrow queries + 
datasets. This will most impact saving SF columns. For saving these columns we 
recommend either converting the columns to well-known binary representations 
(using `sf::st_as_binary(col)`) or using the [sfarrow 
package](https://CRAN.R-project.org/package=sfarrow) which handles some of the 
intricacies of this conversion process. We have plans to improve this and 
re-enable custom metadata like this in the future when we can implement the 
saving in a safe and efficient way. If you need to preserve the pre-6.0.0 
behavior of saving this metadata, you can set 
`options(arrow.preserve_row_level_metadata = TRUE)`. We w
 ill be removing this option in a coming release. We strongly recommend 
avoiding using this workaround if possible since the results will not be 
supported in the future and can lead to surprising and inaccurate results. If 
you run into a custom class besides sf columns that are impacted by this please 
[report an issue](https://issues.apache.org/jira/projects/ARROW/issues).
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ianmcook commented on a change in pull request #11425: ARROW-14304: [R] Update news for 6.0.0

Reply via email to