[GitHub] [arrow] nealrichardson commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

GitBox Mon, 02 May 2022 08:40:59 -0700


nealrichardson commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r862955886



##########
r/NEWS.md:
##########
@@ -19,19 +19,123 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = 
"ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` 
(semester), `dst()` (daylight savings time indicator), `date()` (extract date), 
`epiyear()` (epiyear), improvements to `month()`, which now works with integer 
inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to 
create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, 
`ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, 
`dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV 
datasets.
+  - can take a list of datasets with differing schemas and attempt to unify 
the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results 
from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing 
the pipeline.
+  - no longer need to materialize the entire result table before writing to a 
dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset()` now has more options for controlling row group and file 
sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow()` can write a `Dataset` or an Arrow dplyr query to a 
single file.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes 
when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column 
names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = 
"ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, 
`dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/package=tzdb) is also
+* Timezone operations are supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is 
also
+  installed.
+
+## Extensibility
+
+* Added S3 generic conversion functions such as `as_arrow_array()`
+  and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
+  record batches, arrays, chunked arrays, record batch readers, schemas, and
+  data types. This allows other packages to define custom conversions from 
their
+  types to Arrow objects, including extension arrays.
+* Custom [extension types and 
arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
+  can be created and registered, allowing other packages to
+  define their own array types. Extension arrays wrap regular Arrow array 
types and
+  provide customized behavior and/or storage. See description and an example 
with
+  `?new_extension_type`.
+* Implemented a generic extension type and as_arrow_array() methods for all 
objects where     
+  `vctrs::vec_is()` returns TRUE (i.e., any object that can be used as a 
column in a 
+  `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be 
converted 
+  to an Arrow Array.
+
+## Concatenation Support
+
+Arrow arrays and tables can be easily concatenated:
+
+ * Arrays can be concatenated with `concat_arrays()` or, if zero-copy is 
desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * ChunkedArrays can be concatenated with `c()`.
+ * RecordBatches and Tables support `cbind()`.
+ * Tables support `rbind()`. `concat_tables()` is also provided to 
+ * Chunked arrays can be concatenated with `c()`.
+ * Record batches and tables support `cbind()`.
+ * Arrow tables support `rbind()`. `concat_tables()` is also provided to 
+   concatenate tables while unifying schemas.
+
+## Other improvements and fixes
+
+* Dictionary arrays support using ALTREP when converting to R factors.
+* Math group generics are implemented for ArrowDatum. This means you can use
+  base functions like `sqrt()`, `log()`, and `exp()` with Arrow arrays and 
scalars.
+* `read_*` and `write_*` functions support R Connection objects for reading
+  and writing files.
+* Parquet improvements:
+  * Parquet writer supports Duration type columns.
+  * The dataset Parquet reader consumes less memory.
 * `median()` and `quantile()` will warn once about approximate calculations 
regardless of interactivity.
-* Removed Solaris workarounds, libarrow is now required.
+* `Array$cast()` can cast struct arrays into another struct type with the same 
field names
+  and structure (or a subset of fields) but different field types.
+* The CSV writer is now much faster when writing string columns.
+* Remove special handling for Solaris
+* The CSV writer is much faster when writing string columns.
+* Removed Solaris workarounds, libarrow is required.

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Reply via email to