[GitHub] [arrow] paleolimbot commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

GitBox Fri, 29 Apr 2022 04:58:32 -0700


paleolimbot commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r861728055



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = 
"ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` 
(semester), `dst()` (daylight savings time indicator), `date()` (extract date), 
`epiyear()` (epiyear), improvements to `month()`, which now works with integer 
inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to 
create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, 
`ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, 
`dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV 
datasets.
+  - now can take a list of datasets with differing schemas and attempt to 
unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing 
the pipeline.
+  - no longer need to materialize the entire result table before writing to a 
dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file 
sizes when

Review Comment:
   ```suggestion
   * `write_dataset()` now has more options for controlling row group and file 
sizes when
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = 
"ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` 
(semester), `dst()` (daylight savings time indicator), `date()` (extract date), 
`epiyear()` (epiyear), improvements to `month()`, which now works with integer 
inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to 
create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, 
`ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, 
`dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV 
datasets.
+  - now can take a list of datasets with differing schemas and attempt to 
unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing 
the pipeline.
+  - no longer need to materialize the entire result table before writing to a 
dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file 
sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column 
names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = 
"ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, 
`dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format 
mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is 
also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages 
to
+define their own array types. Extension arrays wrap regular Arrow array types 
and
+provide customized behavior and/or storage. A common use-case for extension 
types 
+is to define a customized conversion between an an Arrow Array and an R object 
+when the default conversion is slow or looses metadata important to the 
interpretation
+of values in the array. For most types, the built-in vctrs extension type is 
probably 
+sufficient. See description and an example with `?new_extension_type`.
+
+## Concatenation Support
+
+Arrow arrays and tables can now be easily concatenated:
+
+ * Arrays can now be concatenated with `concat_arrays()` or, if zero-copy is 
desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * Chunked arrays can now be concatenated with `c()`.
+ * Record batches and tables now support `cbind()`.
+ * Arrow tables now support `rbind()`. `concat_tables()` is also provided to 
+   concatenate tables while unifying schemas.
+
+## S3 Conversion Generics
+
+Arrow now provides S3 generic conversion functions such as `as_arrow_array()`
+and `as_chunked_array()` for main Arrow objects. This includes, Arrow tables,

Review Comment:
   ```suggestion
   and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = 
"ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` 
(semester), `dst()` (daylight savings time indicator), `date()` (extract date), 
`epiyear()` (epiyear), improvements to `month()`, which now works with integer 
inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to 
create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, 
`ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, 
`dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV 
datasets.
+  - now can take a list of datasets with differing schemas and attempt to 
unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing 
the pipeline.
+  - no longer need to materialize the entire result table before writing to a 
dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file 
sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column 
names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = 
"ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, 
`dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format 
mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is 
also

Review Comment:
   ```suggestion
     [tzdb package](https://cran.r-project.org/package=tzdb) is also
   ```
   
   (I know that's a weird URL, but not using the 'canonical version' triggers a 
check NOTE, or used to, on the CMD check)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paleolimbot commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Reply via email to