(arrow) branch main updated (da0eb7e9fc -> 6800be9331)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git from da0eb7e9fc MINOR: [Swift] cleanup some go and C++ artifacts (#41878) add 6800be9331 MINOR: [R] Remove writing_bindings from _pkgdown.yml (#41877) No new revisions were added by this update. Summary of changes: r/_pkgdown.yml | 1 - 1 file changed, 1 deletion(-)
(arrow) branch main updated (4a2df663bc -> 774ee0f2fe)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git from 4a2df663bc GH-41675: [Packaging][MATLAB] Add crossbow job to package MATLAB interface on macos-14 (#41677) add 774ee0f2fe GH-41834: [R] Better error handling in dplyr code (#41576) No new revisions were added by this update. Summary of changes: r/R/dplyr-across.R | 6 +- r/R/dplyr-arrange.R | 87 r/R/dplyr-datetime-helpers.R | 31 +-- r/R/dplyr-eval.R | 182 +--- r/R/dplyr-filter.R | 64 +++--- r/R/dplyr-funcs-agg.R| 6 +- r/R/dplyr-funcs-conditional.R| 16 +- r/R/dplyr-funcs-datetime.R | 18 +- r/R/dplyr-funcs-simple.R | 2 +- r/R/dplyr-funcs-string.R | 76 --- r/R/dplyr-funcs-type.R | 7 +- r/R/dplyr-mutate.R | 190 + r/R/dplyr-slice.R| 2 +- r/R/dplyr-summarize.R| 70 ++- r/R/dplyr.R | 16 -- r/man/arrow_not_supported.Rd | 56 + r/tests/testthat/_snaps/dataset-dplyr.md | 9 + r/tests/testthat/_snaps/dplyr-across.md | 11 + r/tests/testthat/_snaps/dplyr-eval.md| 27 +++ r/tests/testthat/_snaps/dplyr-funcs-datetime.md | 11 + r/tests/testthat/_snaps/dplyr-mutate.md | 25 +++ r/tests/testthat/_snaps/dplyr-query.md | 4 +- r/tests/testthat/_snaps/dplyr-summarize.md | 41 +++- r/tests/testthat/helper-expectation.R| 7 +- r/tests/testthat/test-dataset-dplyr.R| 5 +- r/tests/testthat/test-dplyr-across.R | 12 +- r/tests/testthat/test-dplyr-collapse.R | 13 -- r/tests/testthat/test-dplyr-eval.R | 60 ++ r/tests/testthat/test-dplyr-filter.R | 20 +- r/tests/testthat/test-dplyr-funcs-conditional.R | 107 -- r/tests/testthat/test-dplyr-funcs-datetime.R | 46 + r/tests/testthat/test-dplyr-funcs-string.R | 79 --- r/tests/testthat/test-dplyr-mutate.R | 13 +- r/tests/testthat/test-dplyr-summarize.R | 55 ++--- r/vignettes/developers/matchsubstringoptions.png | Bin 89899 -> 0 bytes r/vignettes/developers/starts_with_docs.png | Bin 9720 -> 0 bytes r/vignettes/developers/startswithdocs.png| Bin 42409 -> 0 bytes r/vignettes/developers/writing_bindings.Rmd | 253 --- 38 files changed, 804 insertions(+), 823 deletions(-) create mode 100644 r/man/arrow_not_supported.Rd create mode 100644 r/tests/testthat/_snaps/dataset-dplyr.md create mode 100644 r/tests/testthat/_snaps/dplyr-across.md create mode 100644 r/tests/testthat/_snaps/dplyr-eval.md create mode 100644 r/tests/testthat/_snaps/dplyr-funcs-datetime.md create mode 100644 r/tests/testthat/_snaps/dplyr-mutate.md create mode 100644 r/tests/testthat/test-dplyr-eval.R delete mode 100644 r/vignettes/developers/matchsubstringoptions.png delete mode 100644 r/vignettes/developers/starts_with_docs.png delete mode 100644 r/vignettes/developers/startswithdocs.png delete mode 100644 r/vignettes/developers/writing_bindings.Rmd
(arrow) branch main updated: GH-41540: [R] Simplify arrow_eval() logic and bindings environments (#41537)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/main by this push: new 03f8ae754e GH-41540: [R] Simplify arrow_eval() logic and bindings environments (#41537) 03f8ae754e is described below commit 03f8ae754ede16f118ccdba0abb593b1461024aa Author: Neal Richardson AuthorDate: Tue May 7 09:42:55 2024 -0400 GH-41540: [R] Simplify arrow_eval() logic and bindings environments (#41537) ### Rationale for this change NSE is hard enough. I wanted to see if I could remove some layers of complexity. ### What changes are included in this PR? * There no longer are separate collections of `agg_funcs` and `nse_funcs`. Now that the aggregation functions return Expressions (https://github.com/apache/arrow/pull/41223), there's no reason to treat them separately. All bindings return Expressions now. * Both are removed and functions are just stored in `.cache$functions`. There was a note wondering why both `nse_funcs` and that needed to exist. They don't. * `arrow_mask()` no longer has an `aggregations` argument: agg functions are always present. * Because agg functions are always present, `filter` and `arrange` now have to check for whether the expressions passed to them contain aggregations--this is supported in regular dplyr but we have deferred supporting it here for now (see https://github.com/apache/arrow/pull/41350). If we decide we want to support it later, these checks are the entry points where we'd drop in the `left_join()` as in `mutate()`. * The logic of evaluating expresssions in `filter()` has been simplified. * Assorted other cleanups: `register_binding()` has two fewer arguments, for example, and the duplicate functions for referencing agg_funcs are gone. There is one more refactor I intend to pursue, and that's to rework abandon_ship and how arrow_eval does error handling, but I ~may~ will defer that to a followup. ### Are these changes tested? Yes, though I'll add some more for filter/aggregate in the followup since I'm reworking things there. ### Are there any user-facing changes? There are a couple of edge cases where the error message will change subtly. For example, if you supplied a comma-separated list of filter expressions, and more than one of them did not evaluate, previously you would be informed of all of the failures; now, we error on the first one. I don't think this is concerning. * GitHub Issue: #41540 --- r/R/dplyr-arrange.R | 8 ++ r/R/dplyr-eval.R| 17 +--- r/R/dplyr-filter.R | 54 - r/R/dplyr-funcs-agg.R | 26 +++--- r/R/dplyr-funcs.R | 119 ++-- r/R/dplyr-mutate.R | 2 +- r/R/dplyr-summarize.R | 2 +- r/R/udf.R | 7 +- r/man/register_binding.Rd | 45 ++- r/tests/testthat/test-dataset-dplyr.R | 2 +- r/tests/testthat/test-dplyr-filter.R| 9 ++- r/tests/testthat/test-dplyr-funcs.R | 30 +++ r/tests/testthat/test-dplyr-summarize.R | 28 +++ r/tests/testthat/test-udf.R | 14 ++-- r/vignettes/developers/writing_bindings.Rmd | 7 +- 15 files changed, 109 insertions(+), 261 deletions(-) diff --git a/r/R/dplyr-arrange.R b/r/R/dplyr-arrange.R index f91cd14211..c8594c77df 100644 --- a/r/R/dplyr-arrange.R +++ b/r/R/dplyr-arrange.R @@ -47,6 +47,14 @@ arrange.arrow_dplyr_query <- function(.data, ..., .by_group = FALSE) { msg <- paste("Expression", names(sorts)[i], "not supported in Arrow") return(abandon_ship(call, .data, msg)) } +if (length(mask$.aggregations)) { + # dplyr lets you arrange on e.g. x < mean(x), but we haven't implemented it. + # But we could, the same way it works in mutate() via join, if someone asks. + # Until then, just error. + # TODO: add a test for this + msg <- paste("Expression", format_expr(expr), "not supported in arrange() in Arrow") + return(abandon_ship(call, .data, msg)) +} descs[i] <- x[["desc"]] } .data$arrange_vars <- c(sorts, .data$arrange_vars) diff --git a/r/R/dplyr-eval.R b/r/R/dplyr-eval.R index ff1619ce94..211c26cecc 100644 --- a/r/R/dplyr-eval.R +++ b/r/R/dplyr-eval.R @@ -121,24 +121,9 @@ arrow_not_supported <- function(msg) { } # Create a data mask for evaluating a dplyr expression -arrow_mask <- function(.data, aggregation = FALSE) { +arrow_mask <- function(.data) { f_env <- new_environmen
(arrow) branch main updated: MINOR: [R] fix no visible global function definition: left_join (#41542)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/main by this push: new d10ebf055a MINOR: [R] fix no visible global function definition: left_join (#41542) d10ebf055a is described below commit d10ebf055a393c94a693097db1dca08ff86745bd Author: Neal Richardson AuthorDate: Mon May 6 09:28:22 2024 -0400 MINOR: [R] fix no visible global function definition: left_join (#41542) ### Rationale for this change Followup to #41350, fixes a check NOTE that caused. ### What changes are included in this PR? `dplyr::` in two places. ### Are these changes tested? Check will be clean. ### Are there any user-facing changes? --- r/R/dplyr-mutate.R | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/r/R/dplyr-mutate.R b/r/R/dplyr-mutate.R index 880f7799e6..72882b6afd 100644 --- a/r/R/dplyr-mutate.R +++ b/r/R/dplyr-mutate.R @@ -84,12 +84,12 @@ mutate.arrow_dplyr_query <- function(.data, agg_query$aggregations <- mask$.aggregations agg_query <- collapse.arrow_dplyr_query(agg_query) if (length(grv)) { - out <- left_join(out, agg_query, by = grv) + out <- dplyr::left_join(out, agg_query, by = grv) } else { # If there are no group_by vars, add a scalar column to both and join on that agg_query$selected_columns[["..tempjoin"]] <- Expression$scalar(1L) out$selected_columns[["..tempjoin"]] <- Expression$scalar(1L) - out <- left_join(out, agg_query, by = "..tempjoin") + out <- dplyr::left_join(out, agg_query, by = "..tempjoin") } }
(arrow) branch main updated (00df70c6dc -> 2ef4059566)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git from 00df70c6dc GH-41398: [R][CI] Windows job failing after R 4.4 release (#41409) add 2ef4059566 GH-29537: [R] Support mutate/summarize with implicit join (#41350) No new revisions were added by this update. Summary of changes: r/R/arrow-package.R | 5 +-- r/R/dplyr-funcs-agg.R | 1 - r/R/dplyr-funcs-doc.R | 2 +- r/R/dplyr-mutate.R| 39 +--- r/man/acero.Rd| 2 +- r/tests/testthat/test-dataset-dplyr.R | 11 --- r/tests/testthat/test-dplyr-mutate.R | 57 --- r/vignettes/data_wrangling.Rmd| 28 + 8 files changed, 58 insertions(+), 87 deletions(-)
(arrow) branch main updated: MINOR: [R] refactor arrow_mask to include aggregations list (#41414)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/main by this push: new c87073737b MINOR: [R] refactor arrow_mask to include aggregations list (#41414) c87073737b is described below commit c87073737b6ffef9715549a199499b92630e8e5f Author: Neal Richardson AuthorDate: Mon Apr 29 11:32:01 2024 -0400 MINOR: [R] refactor arrow_mask to include aggregations list (#41414) ### Rationale for this change Keeping the `..aggregations` list in parent.frame felt a little wrong. As we're starting to use this in more places (like mutate in #41350, and potentially more places), I wanted to try to improve this. I tried a bunch of things before to put it somewhere better (like in the mask) but failed. Finally I found one that worked. ### What changes are included in this PR? Just a refactor ### Are these changes tested? Existing tests pass. ### Are there any user-facing changes? Nope. --- r/R/dplyr-eval.R | 8 +++- r/R/dplyr-funcs-agg.R | 23 --- r/R/dplyr-summarize.R | 41 ++--- 3 files changed, 33 insertions(+), 39 deletions(-) diff --git a/r/R/dplyr-eval.R b/r/R/dplyr-eval.R index 3aaa29696b..ff1619ce94 100644 --- a/r/R/dplyr-eval.R +++ b/r/R/dplyr-eval.R @@ -125,13 +125,9 @@ arrow_mask <- function(.data, aggregation = FALSE) { f_env <- new_environment(.cache$functions) if (aggregation) { -# Add the aggregation functions to the environment, and set the enclosing -# environment to the parent frame so that, when called from summarize_eval(), -# they can reference and assign into `..aggregations` defined there. -pf <- parent.frame() +# Add the aggregation functions to the environment. for (f in names(agg_funcs)) { f_env[[f]] <- agg_funcs[[f]] - environment(f_env[[f]]) <- pf } } else { # Add functions that need to error hard and clear. @@ -156,6 +152,8 @@ arrow_mask <- function(.data, aggregation = FALSE) { # TODO: figure out what rlang::as_data_pronoun does/why we should use it # (because if we do we get `Error: Can't modify the data pronoun` in mutate()) out$.data <- .data$selected_columns + # Add the aggregations list to collect any that get pulled out when evaluating + out$.aggregations <- empty_named_list() out } diff --git a/r/R/dplyr-funcs-agg.R b/r/R/dplyr-funcs-agg.R index ab1df1d2f1..d84f8f28f0 100644 --- a/r/R/dplyr-funcs-agg.R +++ b/r/R/dplyr-funcs-agg.R @@ -17,7 +17,7 @@ # Aggregation functions # -# These all insert into an ..aggregations list (in a parent frame) a list containing: +# These all insert into an .aggregations list in the mask, a list containing: # @param fun string function name # @param data list of 0 or more Expressions # @param options list of function options, as passed to call_function @@ -154,11 +154,11 @@ register_bindings_aggregate <- function() { set_agg <- function(...) { agg_data <- list2(...) - # Find the environment where ..aggregations is stored + # Find the environment where .aggregations is stored target <- find_aggregations_env() - aggs <- get("..aggregations", target) + aggs <- get(".aggregations", target) lapply(agg_data[["data"]], function(expr) { -# If any of the fields referenced in the expression are in ..aggregations, +# If any of the fields referenced in the expression are in .aggregations, # then we can't aggregate over them. # This is mainly for combinations of dataset columns and aggregations, # like sum(x - mean(x)), i.e. window functions. @@ -169,23 +169,24 @@ set_agg <- function(...) { } }) - # Record the (fun, data, options) in ..aggregations + # Record the (fun, data, options) in .aggregations # and return a FieldRef pointing to it tmpname <- paste0("..temp", length(aggs)) aggs[[tmpname]] <- agg_data - assign("..aggregations", aggs, envir = target) + assign(".aggregations", aggs, envir = target) Expression$field_ref(tmpname) } find_aggregations_env <- function() { - # Find the environment where ..aggregations is stored, + # Find the environment where .aggregations is stored, # it's in parent.env of something in the call stack - for (f in sys.frames()) { -if (exists("..aggregations", envir = f)) { - return(f) + n <- 1 + while (TRUE) { +if (exists(".aggregations", envir = caller_env(n))) { + return(caller_env(n)) } +n <- n + 1 } - stop("Could not find ..aggregations") } ensure_one_arg <- function(args, fun) { diff --git a/r/R/dplyr-summarize.R b/r/R/dplyr-summarize.R index 5bb81dc2b3
(arrow) branch main updated: GH-41358: [R] Support join "na_matches" argument (#41372)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/main by this push: new ea314a3f8d GH-41358: [R] Support join "na_matches" argument (#41372) ea314a3f8d is described below commit ea314a3f8d9d4446836aa999b66659c07421f7a4 Author: Neal Richardson AuthorDate: Fri Apr 26 18:32:32 2024 -0400 GH-41358: [R] Support join "na_matches" argument (#41372) ### Rationale for this change Noticed in #41350, I made #41358 to implement this in C++, but it turns out the option was there, just buried a bit. ### What changes are included in this PR? `na_matches` is mapped through to the `key_cmp` field in `HashJoinNodeOptions`. Acero supports having a different value for this for each of the join keys, but dplyr does not, so I kept it constant for all key columns to match the dplyr behavior. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * GitHub Issue: #41358 --- r/NEWS.md | 1 + r/R/arrow-package.R| 12 ++-- r/R/arrowExports.R | 4 ++-- r/R/dplyr-funcs-doc.R | 12 ++-- r/R/dplyr-join.R | 8 +--- r/R/query-engine.R | 8 +--- r/man/acero.Rd | 12 ++-- r/src/arrowExports.cpp | 11 ++- r/src/compute-exec.cpp | 18 +- r/tests/testthat/test-dplyr-join.R | 32 10 files changed, 82 insertions(+), 36 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index 4ed9f28a28..05f934dac6 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -21,6 +21,7 @@ * R functions that users write that use functions that Arrow supports in dataset queries now can be used in queries too. Previously, only functions that used arithmetic operators worked. For example, `time_hours <- function(mins) mins / 60` worked, but `time_hours_rounded <- function(mins) round(mins / 60)` did not; now both work. These are automatic translations rather than true user-defined functions (UDFs); for UDFs, see `register_scalar_function()`. (#41223) * `summarize()` supports more complex expressions, and correctly handles cases where column names are reused in expressions. +* The `na_matches` argument to the `dplyr::*_join()` functions is now supported. This argument controls whether `NA` values are considered equal when joining. (#41358) # arrow 16.0.0 diff --git a/r/R/arrow-package.R b/r/R/arrow-package.R index f6977e6262..7087a40c49 100644 --- a/r/R/arrow-package.R +++ b/r/R/arrow-package.R @@ -66,12 +66,12 @@ supported_dplyr_methods <- list( compute = NULL, collapse = NULL, distinct = "`.keep_all = TRUE` not supported", - left_join = "the `copy` and `na_matches` arguments are ignored", - right_join = "the `copy` and `na_matches` arguments are ignored", - inner_join = "the `copy` and `na_matches` arguments are ignored", - full_join = "the `copy` and `na_matches` arguments are ignored", - semi_join = "the `copy` and `na_matches` arguments are ignored", - anti_join = "the `copy` and `na_matches` arguments are ignored", + left_join = "the `copy` argument is ignored", + right_join = "the `copy` argument is ignored", + inner_join = "the `copy` argument is ignored", + full_join = "the `copy` argument is ignored", + semi_join = "the `copy` argument is ignored", + anti_join = "the `copy` argument is ignored", count = NULL, tally = NULL, rename_with = NULL, diff --git a/r/R/arrowExports.R b/r/R/arrowExports.R index 752d3a266b..62e2182ffc 100644 --- a/r/R/arrowExports.R +++ b/r/R/arrowExports.R @@ -484,8 +484,8 @@ ExecNode_Aggregate <- function(input, options, key_names) { .Call(`_arrow_ExecNode_Aggregate`, input, options, key_names) } -ExecNode_Join <- function(input, join_type, right_data, left_keys, right_keys, left_output, right_output, output_suffix_for_left, output_suffix_for_right) { - .Call(`_arrow_ExecNode_Join`, input, join_type, right_data, left_keys, right_keys, left_output, right_output, output_suffix_for_left, output_suffix_for_right) +ExecNode_Join <- function(input, join_type, right_data, left_keys, right_keys, left_output, right_output, output_suffix_for_left, output_suffix_for_right, na_matches) { + .Call(`_arrow_ExecNode_Join`, input, join_type, right_data, left_keys, right_keys, left_output, right_output, output_suffix_for_left, output_suffix_for_right, na_matches) } ExecNode_Union <- function(input, right_data) { diff --git a/r/R/dplyr-funcs-doc.R b/r/R/dplyr-funcs-doc.R index 2042f80014..fda77bca83 100644 --
(arrow) branch main updated: MINOR: [R] refactor: move aggregation function bindings to their own file (#41355)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/main by this push: new f1bc82f2b3 MINOR: [R] refactor: move aggregation function bindings to their own file (#41355) f1bc82f2b3 is described below commit f1bc82f2b39a317970427052c360383f983ec3f8 Author: Neal Richardson AuthorDate: Tue Apr 23 13:31:26 2024 -0400 MINOR: [R] refactor: move aggregation function bindings to their own file (#41355) For consistency with other bindings, and to allow `dplyr-summarize.R` to start with the summarize method, as do the other dplyr verb files. --- r/DESCRIPTION | 1 + r/R/dplyr-funcs-agg.R | 198 ++ r/R/dplyr-funcs.R | 16 +++- r/R/dplyr-summarize.R | 195 - 4 files changed, 213 insertions(+), 197 deletions(-) diff --git a/r/DESCRIPTION b/r/DESCRIPTION index 2efaed4d6c..eeff8168b3 100644 --- a/r/DESCRIPTION +++ b/r/DESCRIPTION @@ -107,6 +107,7 @@ Collate: 'dplyr-distinct.R' 'dplyr-eval.R' 'dplyr-filter.R' +'dplyr-funcs-agg.R' 'dplyr-funcs-augmented.R' 'dplyr-funcs-conditional.R' 'dplyr-funcs-datetime.R' diff --git a/r/R/dplyr-funcs-agg.R b/r/R/dplyr-funcs-agg.R new file mode 100644 index 00..ab1df1d2f1 --- /dev/null +++ b/r/R/dplyr-funcs-agg.R @@ -0,0 +1,198 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Aggregation functions +# +# These all insert into an ..aggregations list (in a parent frame) a list containing: +# @param fun string function name +# @param data list of 0 or more Expressions +# @param options list of function options, as passed to call_function +# The functions return a FieldRef pointing to the result of the aggregation. +# +# For group-by aggregation, `hash_` gets prepended to the function name when +# the query is executed. +# So to see a list of available hash aggregation functions, +# you can use list_compute_functions("^hash_") + +register_bindings_aggregate <- function() { + register_binding_agg("base::sum", function(..., na.rm = FALSE) { +set_agg( + fun = "sum", + data = ensure_one_arg(list2(...), "sum"), + options = list(skip_nulls = na.rm, min_count = 0L) +) + }) + register_binding_agg("base::prod", function(..., na.rm = FALSE) { +set_agg( + fun = "product", + data = ensure_one_arg(list2(...), "prod"), + options = list(skip_nulls = na.rm, min_count = 0L) +) + }) + register_binding_agg("base::any", function(..., na.rm = FALSE) { +set_agg( + fun = "any", + data = ensure_one_arg(list2(...), "any"), + options = list(skip_nulls = na.rm, min_count = 0L) +) + }) + register_binding_agg("base::all", function(..., na.rm = FALSE) { +set_agg( + fun = "all", + data = ensure_one_arg(list2(...), "all"), + options = list(skip_nulls = na.rm, min_count = 0L) +) + }) + register_binding_agg("base::mean", function(x, na.rm = FALSE) { +set_agg( + fun = "mean", + data = list(x), + options = list(skip_nulls = na.rm, min_count = 0L) +) + }) + register_binding_agg("stats::sd", function(x, na.rm = FALSE, ddof = 1) { +set_agg( + fun = "stddev", + data = list(x), + options = list(skip_nulls = na.rm, min_count = 0L, ddof = ddof) +) + }) + register_binding_agg("stats::var", function(x, na.rm = FALSE, ddof = 1) { +set_agg( + fun = "variance", + data = list(x), + options = list(skip_nulls = na.rm, min_count = 0L, ddof = ddof) +) + }) + register_binding_agg( +"stats::quantile", +function(x, probs, na.rm = FALSE) { + if (length(probs) != 1) { +arrow_not_supported("quantile() with length(probs) != 1") + } + # TODO: Bind to the Arrow function that returns an exact quantile and remove + # this warni
(arrow) branch main updated (79799e59b1 -> 5865e96db2)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git from 79799e59b1 GH-39664: [C++][Acero] Ensure Acero benchmarks present a metric for identifying throughput (#40884) add 5865e96db2 GH-41323: [R] Redo how summarize() evaluates expressions (#41223) No new revisions were added by this update. Summary of changes: r/NEWS.md | 3 + r/R/arrowExports.R | 4 + r/R/dplyr-across.R | 1 - r/R/dplyr-eval.R| 76 +- r/R/dplyr-summarize.R | 345 +++- r/R/expression.R| 3 + r/src/arrowExports.cpp | 9 + r/src/expression.cpp| 17 ++ r/tests/testthat/test-dplyr-across.R| 20 +- r/tests/testthat/test-dplyr-filter.R| 1 - r/tests/testthat/test-dplyr-funcs-conditional.R | 15 ++ r/tests/testthat/test-dplyr-summarize.R | 137 -- 12 files changed, 398 insertions(+), 233 deletions(-)
[arrow-site] branch main updated: MINOR: Update some affiliations (#361)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow-site.git The following commit(s) were added to refs/heads/main by this push: new 09c2ba8633a MINOR: Update some affiliations (#361) 09c2ba8633a is described below commit 09c2ba8633a8cbf1acf813bd83c41c5a1e861ff6 Author: Neal Richardson AuthorDate: Wed May 31 16:50:30 2023 -0400 MINOR: Update some affiliations (#361) --- _data/committers.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_data/committers.yml b/_data/committers.yml index f0d65228e26..a72faa0b1e1 100644 --- a/_data/committers.yml +++ b/_data/committers.yml @@ -130,7 +130,7 @@ - name: Neal Richardson role: PMC alias: npr - affiliation: Voltron Data + affiliation: Posit - name: Neville Dipale role: PMC alias: nevime @@ -371,7 +371,7 @@ - name: Romain Francois role: Committer alias: romainfrancois - affiliation: RStudio + affiliation: Posit - name: Ruihang Xia role: Committer alias: waynexia
[arrow] branch main updated: MINOR: [R] ARROW_ACERO should be ON by default in bundled build (#35407)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/main by this push: new c2f7d13e16 MINOR: [R] ARROW_ACERO should be ON by default in bundled build (#35407) c2f7d13e16 is described below commit c2f7d13e16c4ec3c8fba551a157cd71398194e6f Author: Neal Richardson AuthorDate: Fri May 5 10:04:06 2023 -0400 MINOR: [R] ARROW_ACERO should be ON by default in bundled build (#35407) To match ARROW_DATASET. Without this, the default CRAN version on Linux won't have Acero enabled. This should be cherry-picked for the 12.0.0 CRAN submission. --- r/inst/build_arrow_static.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/inst/build_arrow_static.sh b/r/inst/build_arrow_static.sh index 4c7f705708..fe56b9fca9 100755 --- a/r/inst/build_arrow_static.sh +++ b/r/inst/build_arrow_static.sh @@ -61,7 +61,7 @@ ${CMAKE} -DARROW_BOOST_USE_SHARED=OFF \ -DARROW_BUILD_TESTS=OFF \ -DARROW_BUILD_SHARED=OFF \ -DARROW_BUILD_STATIC=ON \ --DARROW_ACERO=${ARROW_ACERO:-$ARROW_DEFAULT_PARAM} \ +-DARROW_ACERO=${ARROW_ACERO:-ON} \ -DARROW_COMPUTE=ON \ -DARROW_CSV=ON \ -DARROW_DATASET=${ARROW_DATASET:-ON} \
[arrow] branch main updated: GH-35140: [R] Rewrite configure script and ensure we don't use mismatched libarrow (#35147)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/main by this push: new ec89360212 GH-35140: [R] Rewrite configure script and ensure we don't use mismatched libarrow (#35147) ec89360212 is described below commit ec893602124b776fb42261361d1a2d21a6d61f06 Author: Neal Richardson AuthorDate: Wed May 3 10:37:09 2023 -0400 GH-35140: [R] Rewrite configure script and ensure we don't use mismatched libarrow (#35147) I've significantly rewritten `r/configure` to make it easier to reason about and harder for issues like https://github.com/apache/arrow/pull/34229 and #35140 to happen. I've also added a version check to make sure that we don't obviously try to use a system C++ library that doesn't match the R package version. Making sure this was applied in all of the right places and handling what to do if the versions didn't match was the impetus for the whole refactor. `configure` has been broken up into some functions, and the flow of the script is, as is documented at the top of the file: ``` # * Find libarrow on the system. If it is present, make sure # that its version is compatible with the R package. # * If no suitable libarrow is found, download it (where allowed) # or build it from source. # * Determine what features this libarrow has and what other # flags it requires, and set them in src/Makevars for use when # compiling the bindings. # * Run a test program to confirm that arrow headers are found ``` All of the detection of CFLAGS and `-L` dirs etc. happen in one place now, and they all prefer using `pkg-config` to read from the libarrow build what libraries and flags it requires, rather than hard-coding. (autobrew is the only remaining exception, but I didn't feel like messing with that today.) This should make the builds more future proof, should make it so more build configurations work (e.g. I suspect that a static build in ARROW_HOME wouldn't have gotten picked up correctly b [...] Version checking has been added in an R script for ease of testing (and for easier handling of arithmetic), and there is an accompanying `test-check-versions.R` added. These are run on all the builds that use `ci/scripts/r_test.sh`. ### Behavior changes * If libarrow is found on the system (via ARROW_HOME, pkg-config, or brew), but the version does not match, it will not be used, and we will try a bundled build. This should mean that users installing a released version will never have libarrow version problems. * If both the found C++ library and R package are on matching dev versions (i.e. not identical given the x.y.z.9000 vs x+1.y.z-SNAPSHOT difference), it will proceed with a warning that you may need to rebuild if there are issues. This means that regular developers will see an extra message in the build output. * autobrew is only used on a release version unless you set FORCE_AUTOBREW=true. This eliminates another source of version mismatches (C++ release version, R dev version). * The path where you could set `LIB_DIR` and `INCLUDE_DIR` env vars has been removed. Use `ARROW_HOME` instead. * Closes: #35140 * Closes: #31989 Lead-authored-by: Neal Richardson Co-authored-by: Sutou Kouhei Signed-off-by: Neal Richardson --- dev/tasks/conda-recipes/r-arrow/meta.yaml | 4 +- dev/tasks/r/github.macos.autobrew.yml | 1 + dev/tasks/r/github.packages.yml| 3 +- r/Makefile | 2 +- r/configure| 555 - r/inst/build_arrow_static.sh | 8 +- r/tools/check-versions.R | 59 +++ r/tools/nixlibs.R | 30 +- r/tools/test-check-versions.R | 62 r/vignettes/developers/install_details.Rmd | 42 ++- r/vignettes/developers/install_nix.png | Bin 99333 -> 0 bytes r/vignettes/install.Rmd| 9 +- r/vignettes/install_nightly.Rmd| 2 +- 13 files changed, 502 insertions(+), 275 deletions(-) diff --git a/dev/tasks/conda-recipes/r-arrow/meta.yaml b/dev/tasks/conda-recipes/r-arrow/meta.yaml index 4c86dc9280..28ee8eb92c 100644 --- a/dev/tasks/conda-recipes/r-arrow/meta.yaml +++ b/dev/tasks/conda-recipes/r-arrow/meta.yaml @@ -59,8 +59,8 @@ requirements: test: commands: -- $R -e "library('arrow')" # [not win] -- "\"%R%\" -e \"library('arrow'); data(mtcars); write_parquet(mtcars, 'test.parquet')\"" # [win] +- $R -e "library('arrow'); stopifnot(arrow_with_acero(), arrow_with_dataset(), arrow_with_parquet(), arrow_with_s3())" # [not win] +- "\"%R%\" -e \
[arrow] branch main updated (7526df9ad9 -> 14e9e3cb13)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/arrow.git from 7526df9ad9 GH-34946: [Ruby] Remove DictionaryArrayBuilder related omissions (#34947) add 14e9e3cb13 MINOR: [R] Unskip acero tests (#34943) No new revisions were added by this update. Summary of changes: r/R/arrow-info.R | 1 + 1 file changed, 1 insertion(+)
[arrow] branch master updated: GH-33892: [R] Map `dplyr::n()` to `count_all` kernel (#33917)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 0368e410be GH-33892: [R] Map `dplyr::n()` to `count_all` kernel (#33917) 0368e410be is described below commit 0368e410be4dac30eada13d307b415165aedc6a7 Author: Ian Cook AuthorDate: Mon Feb 13 10:16:03 2023 -0500 GH-33892: [R] Map `dplyr::n()` to `count_all` kernel (#33917) ### Rationale for this change This PR is a follow-up to #15083. It allows the R package to register bindings to nullary aggregation functions, and it remaps `dplyr::n()` to the nullary aggregation function `count_all`. This PR also: - Prepares the R bindings to support aggregation functions with 2+ arguments, although none yet exist in the C++ library - Removes the heuristics that were used to infer the data types of aggregates, replacing that with actual type determination ### Are these changes tested? Yes, through existing tests. ### Are there any user-facing changes? No. * Closes: #33892 * Closes: #33960 Authored-by: Ian Cook Signed-off-by: Neal Richardson --- r/R/dplyr-collect.R | 18 +++--- r/R/dplyr-funcs.R | 2 +- r/R/dplyr-summarize.R | 102 +--- r/R/query-engine.R | 12 ++-- r/man/register_binding.Rd | 2 +- r/src/compute-exec.cpp | 8 ++- r/tests/testthat/test-dplyr-collapse.R | 4 +- r/tests/testthat/test-dplyr-summarize.R | 10 +++- 8 files changed, 103 insertions(+), 55 deletions(-) diff --git a/r/R/dplyr-collect.R b/r/R/dplyr-collect.R index 395026ce78..f45a9886ea 100644 --- a/r/R/dplyr-collect.R +++ b/r/R/dplyr-collect.R @@ -179,19 +179,15 @@ implicit_schema <- function(.data) { new_fields <- c(left_fields, right_fields) } } else { -# The output schema is based on the aggregations and any group_by vars -new_fields <- map(summarize_projection(.data), ~ .$type(old_schm)) -# * Put group_by_vars first (this can't be done by summarize, -# they have to be last per the aggregate node signature, -# and they get projected to this order after aggregation) -# * Infer the output types from the aggregations -group_fields <- new_fields[.data$group_by_vars] hash <- length(.data$group_by_vars) > 0 -agg_fields <- imap( - new_fields[setdiff(names(new_fields), .data$group_by_vars)], - ~ agg_fun_output_type(.data$aggregations[[.y]][["fun"]], .x, hash) +# The output schema is based on the aggregations and any group_by vars. +# The group_by vars come first (this can't be done by summarize; they have +# to be last per the aggregate node signature, and they get projected to +# this order after aggregation) +new_fields <- c( + group_types(.data, old_schm), + aggregate_types(.data, hash, old_schm) ) -new_fields <- c(group_fields, agg_fields) } schema(!!!new_fields) } diff --git a/r/R/dplyr-funcs.R b/r/R/dplyr-funcs.R index ce88e25bcb..2728a64539 100644 --- a/r/R/dplyr-funcs.R +++ b/r/R/dplyr-funcs.R @@ -49,7 +49,7 @@ NULL #' aggregate function. This function must accept `Expression` objects as #' arguments and return a `list()` with components: #' - `fun`: string function name -#' - `data`: `Expression` (these are all currently a single field) +#' - `data`: list of 0 or more `Expression`s #' - `options`: list of function options, as passed to call_function #' @param update_cache Update .cache$functions at the time of registration. #' the default is FALSE because the majority of usage is to register diff --git a/r/R/dplyr-summarize.R b/r/R/dplyr-summarize.R index 5e670538f6..184c0aade4 100644 --- a/r/R/dplyr-summarize.R +++ b/r/R/dplyr-summarize.R @@ -18,7 +18,7 @@ # Aggregation functions # These all return a list of: # @param fun string function name -# @param data Expression (these are all currently a single field) +# @param data list of 0 or more Expressions # @param options list of function options, as passed to call_function # For group-by aggregation, `hash_` gets prepended to the function name. # So to see a list of available hash aggregation functions, @@ -31,28 +31,7 @@ ensure_one_arg <- function(args, fun) { } else if (length(args) > 1) { arrow_not_supported(paste0("Multiple arguments to ", fun, "()")) } - args[[1]] -} - -agg_fun_output_type <- function(fun, input_type, hash) { - # These are quick and dirty heuristics. - if (fun %in% c("any", "all")) { -bool() - } else if (fun %in% "sum") { -# It may upcast to a bigger type but this is close enough -input_type - } else if (fu
[arrow] branch master updated: GH-33760: [R][C++] Handle nested field refs in scanner (#33770)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new d0a7fb9403 GH-33760: [R][C++] Handle nested field refs in scanner (#33770) d0a7fb9403 is described below commit d0a7fb9403a904b7850517c745c3925695d8658d Author: Neal Richardson AuthorDate: Tue Jan 24 11:56:30 2023 -0500 GH-33760: [R][C++] Handle nested field refs in scanner (#33770) ### Rationale for this change Followup to https://github.com/apache/arrow/pull/19706/files#r1073391100 with the goal of deleting and simplifying some code. As it turned out, it was more about moving code from the R bindings to the C++ library. ### Are there any user-facing changes? Not for R users, but this fixes a bug in the dataset C++ library where nested field refs could not be handled by the scanner. * Closes: #33760 Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- cpp/src/arrow/dataset/scanner.cc | 23 --- r/R/arrowExports.R | 9 +++-- r/R/query-engine.R | 10 +++--- r/src/arrowExports.cpp | 19 +-- r/src/compute-exec.cpp | 23 +-- r/src/expression.cpp | 19 --- 6 files changed, 40 insertions(+), 63 deletions(-) diff --git a/cpp/src/arrow/dataset/scanner.cc b/cpp/src/arrow/dataset/scanner.cc index f307787357..bc8feec96d 100644 --- a/cpp/src/arrow/dataset/scanner.cc +++ b/cpp/src/arrow/dataset/scanner.cc @@ -22,6 +22,7 @@ #include #include #include +#include #include #include "arrow/array/array_primitive.h" @@ -135,6 +136,7 @@ Result> GetProjectedSchemaFromExpression( const std::shared_ptr& dataset_schema) { // process resultant dataset_schema after projection FieldVector project_fields; + std::set field_names; if (auto call = projection.call()) { if (call->function_name != "make_struct") { return Status::Invalid("Top level projection expression call must be make_struct"); @@ -142,13 +144,11 @@ Result> GetProjectedSchemaFromExpression( for (const compute::Expression& arg : call->arguments) { if (auto field_ref = arg.field_ref()) { if (field_ref->IsName()) { - auto field = dataset_schema->GetFieldByName(*field_ref->name()); - if (field) { -project_fields.push_back(std::move(field)); - } - // if the field is not present in the schema we ignore it. - // the case is if kAugmentedFields are present in the expression - // and if they are not present in the provided schema, we ignore them. + field_names.emplace(*field_ref->name()); +} else if (field_ref->IsNested()) { + // We keep the top-level field name. + auto nested_field_refs = *field_ref->nested_refs(); + field_names.emplace(*nested_field_refs[0].name()); } else { return Status::Invalid( "No projected schema was supplied and we could not infer the projected " @@ -157,6 +157,15 @@ Result> GetProjectedSchemaFromExpression( } } } + for (auto f : field_names) { +auto field = dataset_schema->GetFieldByName(f); +if (field) { + // if the field is not present in the schema we ignore it. + // the case is if kAugmentedFields are present in the expression + // and if they are not present in the provided schema, we ignore them. + project_fields.push_back(std::move(field)); +} + } return schema(project_fields); } diff --git a/r/R/arrowExports.R b/r/R/arrowExports.R index 2eeca24dbd..5e807fbab1 100644 --- a/r/R/arrowExports.R +++ b/r/R/arrowExports.R @@ -460,8 +460,8 @@ ExecNode_output_schema <- function(node) { .Call(`_arrow_ExecNode_output_schema`, node) } -ExecNode_Scan <- function(plan, dataset, filter, materialized_field_names) { - .Call(`_arrow_ExecNode_Scan`, plan, dataset, filter, materialized_field_names) +ExecNode_Scan <- function(plan, dataset, filter, projection) { + .Call(`_arrow_ExecNode_Scan`, plan, dataset, filter, projection) } ExecPlan_Write <- function(plan, final_node, metadata, file_write_options, filesystem, base_dir, partitioning, basename_template, existing_data_behavior, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group) { @@ -1088,10 +1088,6 @@ compute___expr__is_field_ref <- function(x) { .Call(`_arrow_compute___expr__is_field_ref`, x) } -field_names_in_expression <- function(x) { - .Call(`_arrow_field_names_in_expression`, x) -} - compute___expr__get_field_ref_name <- function(x) { .Call(`_arrow_compute___expr__get_field_ref_name`, x) } @@ -2095
[arrow] branch master updated: GH-18818: [R] Create a field ref to a field in a struct (#19706)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 1d9366f19e GH-18818: [R] Create a field ref to a field in a struct (#19706) 1d9366f19e is described below commit 1d9366f19e4b9846b33cc0c7bd7941cb5f482d74 Author: Neal Richardson AuthorDate: Wed Jan 18 12:38:06 2023 -0500 GH-18818: [R] Create a field ref to a field in a struct (#19706) This PR implements `$.Expression` and `[[.Expression` methods, such that if the Expression is a FieldRef, it returns a nested FieldRef. This required revising some assumptions in a few places, particularly that if an Expression is a FieldRef, it has a `name`, and that all FieldRefs correspond to a Field in a Schema. In the case where the Expression is not a FieldRef, it will create an Expression call to `struct_field` to extract the field, iff the Expression has a knowable `type`, the [...] Things not done because they weren't needed to get this working: * `Expression$field_ref()` take a vector to construct a nested ref * Method to return vector of nested components of a field ref in R Next steps for future PRs: * Wrap this in [tidyr::unpack()](https://tidyr.tidyverse.org/reference/pack.html) method (but unfortunately, unpack() is not a generic) * https://github.com/apache/arrow/issues/33756 * https://github.com/apache/arrow/issues/33757 * https://github.com/apache/arrow/issues/33760 * Closes: #18818 Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/NAMESPACE | 3 ++ r/R/arrow-object.R | 2 +- r/R/arrowExports.R | 9 - r/R/expression.R| 55 + r/R/type.R | 3 ++ r/src/arrowExports.cpp | 19 ++ r/src/compute.cpp | 14 r/src/expression.cpp| 40 +++-- r/tests/testthat/test-dplyr-query.R | 70 + r/tests/testthat/test-expression.R | 26 ++ 10 files changed, 237 insertions(+), 4 deletions(-) diff --git a/r/NAMESPACE b/r/NAMESPACE index 3df107a2d8..3ab828a958 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -2,6 +2,7 @@ S3method("!=",ArrowObject) S3method("$",ArrowTabular) +S3method("$",Expression) S3method("$",Schema) S3method("$",StructArray) S3method("$",SubTreeFileSystem) @@ -14,6 +15,7 @@ S3method("[",Dataset) S3method("[",Schema) S3method("[",arrow_dplyr_query) S3method("[[",ArrowTabular) +S3method("[[",Expression) S3method("[[",Schema) S3method("[[",StructArray) S3method("[[<-",ArrowTabular) @@ -137,6 +139,7 @@ S3method(names,Scanner) S3method(names,ScannerBuilder) S3method(names,Schema) S3method(names,StructArray) +S3method(names,StructType) S3method(names,Table) S3method(names,arrow_dplyr_query) S3method(print,"arrow-enum") diff --git a/r/R/arrow-object.R b/r/R/arrow-object.R index 516f407aaf..5c2cf4691f 100644 --- a/r/R/arrow-object.R +++ b/r/R/arrow-object.R @@ -32,7 +32,7 @@ ArrowObject <- R6Class("ArrowObject", assign(".:xp:.", xp, envir = self) }, class_title = function() { - if (!is.null(self$.class_title)) { + if (".class_title" %in% ls(self, all.names = TRUE)) { # Allow subclasses to override just printing the class name first class_title <- self$.class_title() } else { diff --git a/r/R/arrowExports.R b/r/R/arrowExports.R index 38f1ecfb97..2eeca24dbd 100644 --- a/r/R/arrowExports.R +++ b/r/R/arrowExports.R @@ -1084,6 +1084,10 @@ compute___expr__call <- function(func_name, argument_list, options) { .Call(`_arrow_compute___expr__call`, func_name, argument_list, options) } +compute___expr__is_field_ref <- function(x) { + .Call(`_arrow_compute___expr__is_field_ref`, x) +} + field_names_in_expression <- function(x) { .Call(`_arrow_field_names_in_expression`, x) } @@ -1096,6 +1100,10 @@ compute___expr__field_ref <- function(name) { .Call(`_arrow_compute___expr__field_ref`, name) } +compute___expr__nested_field_ref <- function(x, name) { + .Call(`_arrow_compute___expr__nested_field_ref`, x, name) +} + compute___expr__scalar <- function(x) { .Call(`_arrow_compute___expr__scalar`, x) } @@ -2087,4 +2095,3 @@ SetIOThreadPoolCapacity <- function(threads) { Array__infer_type <- function(x) { .Call(`_arrow_Array__infer_type`, x) } - diff --git a/r/R/expression.R b/r/R/expression.R index a1163c12a8..8f84b4b31e 100644 --- a/r/R/expression.R +++ b/r/R/expression.R @@ -57,6 +57,9 @@ Expression <-
[arrow] branch master updated: MINOR: [R] Fix for dev purrr (#14581)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 04917f944b MINOR: [R] Fix for dev purrr (#14581) 04917f944b is described below commit 04917f944b65b73cc954b5b243f193a5b336f0f8 Author: Hadley Wickham AuthorDate: Thu Nov 3 12:00:51 2022 -0500 MINOR: [R] Fix for dev purrr (#14581) The recycling rules in map2() are now stricter, so we need to check that `x` actually has columns before applying the metadata. I also mildly refactored the test to make it easier to run in isolation; I'm happy to revert those changes if desired. Authored-by: Hadley Wickham Signed-off-by: Neal Richardson --- r/R/metadata.R | 8 +++-- r/tests/testthat/test-metadata.R | 65 +++- 2 files changed, 35 insertions(+), 38 deletions(-) diff --git a/r/R/metadata.R b/r/R/metadata.R index 747f08069e..6a54b3e384 100644 --- a/r/R/metadata.R +++ b/r/R/metadata.R @@ -86,9 +86,11 @@ apply_arrow_r_metadata <- function(x, r_metadata) { call. = FALSE ) } else { - x <- map2(x, columns_metadata, function(.x, .y) { -apply_arrow_r_metadata(.x, .y) - }) + if (length(x) > 0) { +x <- map2(x, columns_metadata, function(.x, .y) { + apply_arrow_r_metadata(.x, .y) +}) + } } x } diff --git a/r/tests/testthat/test-metadata.R b/r/tests/testthat/test-metadata.R index 21b7ebe11a..4cf8e49af1 100644 --- a/r/tests/testthat/test-metadata.R +++ b/r/tests/testthat/test-metadata.R @@ -254,8 +254,6 @@ test_that("Row-level metadata (does not) roundtrip in datasets", { skip_if_not_available("dataset") skip_if_not_available("parquet") - library(dplyr, warn.conflicts = FALSE) - df <- tibble::tibble( metadata = list( structure(1, my_value_as_attr = 1), @@ -269,39 +267,36 @@ test_that("Row-level metadata (does not) roundtrip in datasets", { dst_dir <- make_temp_dir() - withr::with_options( -list("arrow.preserve_row_level_metadata" = TRUE), -{ - expect_warning( -write_dataset(df, dst_dir, partitioning = "part"), -"Row-level metadata is not compatible with datasets and will be discarded" - ) - - # Reset directory as previous write will have created some files and the default - # behavior is to error on existing - dst_dir <- make_temp_dir() - # but we need to write a dataset with row-level metadata to make sure when - # reading ones that have been written with them we warn appropriately - fake_func_name <- write_dataset - fake_func_name(df, dst_dir, partitioning = "part") - - ds <- open_dataset(dst_dir) - expect_warning( -df_from_ds <- collect(ds), -"Row-level metadata is not compatible with this operation and has been ignored" - ) - expect_equal( -arrange(df_from_ds, int), -arrange(df, int), -ignore_attr = TRUE - ) - - # however there is *no* warning if we don't select the metadata column - expect_warning( -df_from_ds <- ds %>% select(int) %>% collect(), -NA - ) -} + withr::local_options("arrow.preserve_row_level_metadata" = TRUE) + + expect_warning( +write_dataset(df, dst_dir, partitioning = "part"), +"Row-level metadata is not compatible with datasets and will be discarded" + ) + + # Reset directory as previous write will have created some files and the default + # behavior is to error on existing + dst_dir <- make_temp_dir() + # but we need to write a dataset with row-level metadata to make sure when + # reading ones that have been written with them we warn appropriately + fake_func_name <- write_dataset + fake_func_name(df, dst_dir, partitioning = "part") + + ds <- open_dataset(dst_dir) + expect_warning( +df_from_ds <- collect(ds), +"Row-level metadata is not compatible with this operation and has been ignored" + ) + expect_equal( +dplyr::arrange(df_from_ds, int), +dplyr::arrange(df, int), +ignore_attr = TRUE + ) + + # however there is *no* warning if we don't select the metadata column + expect_warning( +df_from_ds <- ds %>% dplyr::select(int) %>% dplyr::collect(), +NA ) })
[arrow] branch master updated: ARROW-15460: [R] Add as.data.frame.Dataset method (#14461)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 5e53978b56 ARROW-15460: [R] Add as.data.frame.Dataset method (#14461) 5e53978b56 is described below commit 5e53978b56aa13f9c033f2e849cc22f2aed6e2d3 Author: Neal Richardson AuthorDate: Wed Nov 2 19:15:40 2022 -0400 ARROW-15460: [R] Add as.data.frame.Dataset method (#14461) Plus some refactoring and disentangling of compute/collect methods Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/NAMESPACE | 2 ++ r/R/dataset.R | 7 - r/R/dplyr-collect.R | 57 +++-- r/R/dplyr-group-by.R| 30 ++ r/R/dplyr.R | 6 - r/R/metadata.R | 13 +++--- r/R/table.R | 14 +- r/man/as_arrow_table.Rd | 3 +++ r/man/open_dataset.Rd | 2 +- r/tests/testthat/test-dataset.R | 5 r/tests/testthat/test-udf.R | 1 + 11 files changed, 92 insertions(+), 48 deletions(-) diff --git a/r/NAMESPACE b/r/NAMESPACE index 4a0c6ed261..0b18ace9ad 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -29,6 +29,7 @@ S3method(as.character,ArrowDatum) S3method(as.character,FileFormat) S3method(as.character,FragmentScanOptions) S3method(as.data.frame,ArrowTabular) +S3method(as.data.frame,Dataset) S3method(as.data.frame,RecordBatchReader) S3method(as.data.frame,Schema) S3method(as.data.frame,StructArray) @@ -47,6 +48,7 @@ S3method(as_arrow_array,data.frame) S3method(as_arrow_array,default) S3method(as_arrow_array,pyarrow.lib.Array) S3method(as_arrow_array,vctrs_list_of) +S3method(as_arrow_table,Dataset) S3method(as_arrow_table,RecordBatch) S3method(as_arrow_table,RecordBatchReader) S3method(as_arrow_table,Schema) diff --git a/r/R/dataset.R b/r/R/dataset.R index 54ac30e56b..78b59ecc24 100644 --- a/r/R/dataset.R +++ b/r/R/dataset.R @@ -131,7 +131,7 @@ #' dir.create(tf) #' on.exit(unlink(tf)) #' -#' write_dataset(mtcars, tf, partitioning="cyl") +#' write_dataset(mtcars, tf, partitioning = "cyl") #' #' # You can specify a directory containing the files for your dataset and #' # open_dataset will scan all files in your directory. @@ -397,6 +397,11 @@ dim.Dataset <- function(x) c(x$num_rows, x$num_cols) #' @export c.Dataset <- function(...) Dataset$create(list(...)) +#' @export +as.data.frame.Dataset <- function(x, row.names = NULL, optional = FALSE, ...) { + collect.Dataset(x) +} + #' @export head.Dataset <- function(x, n = 6L, ...) { head(Scanner$create(x), n) diff --git a/r/R/dplyr-collect.R b/r/R/dplyr-collect.R index 8bf22728d6..395026ce78 100644 --- a/r/R/dplyr-collect.R +++ b/r/R/dplyr-collect.R @@ -19,19 +19,8 @@ # The following S3 methods are registered on load if dplyr is present collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) { - tryCatch( -out <- as_arrow_table(x), -# n = 4 because we want the error to show up as being from collect() -# and not augment_io_error_msg() -error = function(e, call = caller_env(n = 4)) { - augment_io_error_msg(e, call, schema = x$.data$schema) -} - ) - - if (as_data_frame) { -out <- as.data.frame(out) - } - restore_dplyr_features(out, x) + out <- compute.arrow_dplyr_query(x) + collect.ArrowTabular(out, as_data_frame) } collect.ArrowTabular <- function(x, as_data_frame = TRUE, ...) { if (as_data_frame) { @@ -40,10 +29,27 @@ collect.ArrowTabular <- function(x, as_data_frame = TRUE, ...) { x } } -collect.Dataset <- collect.RecordBatchReader <- function(x, ...) dplyr::collect(as_adq(x), ...) +collect.Dataset <- function(x, as_data_frame = TRUE, ...) { + collect.ArrowTabular(compute.Dataset(x), as_data_frame) +} +collect.RecordBatchReader <- collect.Dataset -compute.arrow_dplyr_query <- function(x, ...) dplyr::collect(x, as_data_frame = FALSE) compute.ArrowTabular <- function(x, ...) x +compute.arrow_dplyr_query <- function(x, ...) { + # TODO: should this tryCatch move down into as_arrow_table()? + tryCatch( +as_arrow_table(x), +# n = 4 because we want the error to show up as being from compute() +# and not augment_io_error_msg() +error = function(e, call = caller_env(n = 4)) { + # Use a dummy schema() here because the CSV file reader handler is only + # valid when you read_csv_arrow() with a schema, but Dataset always has + # schema + # TODO: clean up this + augment_io_error_msg(e, call, schema = schema()) +} + ) +} compute.Dataset <- compute.RecordBatchReader <- compute.arrow_dplyr_query pull.Dataset <- function(.data, @@ -93,27 +99,6 @@ handle_pull_as_vector <- f
[arrow] branch master updated (0e162a5499 -> f29be8020e)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 0e162a5499 ARROW-18183: [C++] cpp-micro benchmarks are failing on mac arm machine (#14562) add f29be8020e ARROW-18203: [R] Refactor to remove unnecessary uses of build_expr (#14553) No new revisions were added by this update. Summary of changes: r/DESCRIPTION | 2 + r/R/arrow-datum.R | 11 +- r/R/compute.R | 184 -- r/R/dplyr-datetime-helpers.R| 75 +++--- r/R/dplyr-eval.R| 2 +- r/R/dplyr-funcs-conditional.R | 55 +++-- r/R/dplyr-funcs-datetime.R | 141 +-- r/R/dplyr-funcs-math.R | 29 +-- r/R/{expression.R => dplyr-funcs-simple.R} | 211 ++-- r/R/dplyr-funcs-string.R| 8 + r/R/dplyr-funcs-type.R | 47 ++-- r/R/dplyr-funcs.R | 24 +- r/R/expression.R| 310 +--- r/R/udf.R | 200 +++ r/man/Expression.Rd | 8 +- r/man/register_binding.Rd | 20 +- r/man/register_scalar_function.Rd | 2 +- r/tests/testthat/_snaps/{compute.md => udf.md} | 0 r/tests/testthat/test-dplyr-funcs-datetime.R| 47 ++-- r/tests/testthat/{test-compute.R => test-udf.R} | 0 20 files changed, 514 insertions(+), 862 deletions(-) copy r/R/{expression.R => dplyr-funcs-simple.R} (50%) create mode 100644 r/R/udf.R rename r/tests/testthat/_snaps/{compute.md => udf.md} (100%) rename r/tests/testthat/{test-compute.R => test-udf.R} (100%)
[arrow] branch master updated (8066c5e1f2 -> d045fc5d65)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 8066c5e1f2 ARROW-13980: [Go] Implement Scalar ApproxEquals (#14543) add d045fc5d65 ARROW-17462: [R] Cast scalars to type of field in Expression building (#13985) No new revisions were added by this update. Summary of changes: r/R/compute.R | 2 +- r/R/expression.R | 125 ++--- r/tests/testthat/test-dataset-dplyr.R | 6 +- r/tests/testthat/test-dplyr-collapse.R | 4 +- r/tests/testthat/test-dplyr-filter.R | 30 r/tests/testthat/test-dplyr-mutate.R | 2 +- r/tests/testthat/test-dplyr-query.R| 87 +++ r/tests/testthat/test-expression.R | 5 +- 8 files changed, 228 insertions(+), 33 deletions(-)
[arrow] branch master updated (2e84cb8f24 -> eb45b86fe8)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 2e84cb8f24 ARROW-18132: [R] Add deprecation cycle for pull() change (#14475) add eb45b86fe8 ARROW-18132: [R] Add deprecation cycle for pull() change (#14475) No new revisions were added by this update. Summary of changes:
[arrow] branch master updated (24c0fce142 -> 3a0ee3f391)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 24c0fce142 ARROW-17871: [Go] initial binary arithmetic implementation (#14255) add 3a0ee3f391 ARROW-17954: [R] Update news for 10.0 (#14337) No new revisions were added by this update. Summary of changes: r/NEWS.md | 65 +++ r/R/dplyr-funcs-doc.R | 2 +- r/man/acero.Rd| 2 +- 3 files changed, 67 insertions(+), 2 deletions(-)
[arrow] branch master updated (5f5ea7b0e1 -> cd33544533)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 5f5ea7b0e1 ARROW-18078: [Docs][R] Fix broken link in R documentation (#14437) add cd33544533 ARROW-17849: [R][Docs] Document changes due to C++17 for centos-7 users (#14440) No new revisions were added by this update. Summary of changes: .github/workflows/r.yml | 1 - ci/scripts/r_docker_configure.sh | 15 -- ci/scripts/r_test.sh | 13 - ci/scripts/r_windows_build.sh| 44 +--- dev/tasks/r/github.packages.yml | 13 ++--- r/README.md | 23 +--- r/configure | 10 +++- r/tools/nixlibs.R| 110 +++ r/tools/test-nixlibs.R | 17 +- r/vignettes/developers/setup.Rmd | 8 +-- r/vignettes/install.Rmd | 93 - 11 files changed, 156 insertions(+), 191 deletions(-)
[arrow] branch master updated (f5e592eb5e -> 0b86e40622)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from f5e592eb5e ARROW-15540: [C++] Allow the substrait consumer to accept plans with hints and nullable literals (#14402) add 0b86e40622 ARROW-18053: [Dev] Fix a bug that merge_arrow_pr.py doesn't detect Co-authored-by: (#14416) No new revisions were added by this update. Summary of changes: dev/archery/archery/utils/lint.py | 1 + dev/merge_arrow_pr.py | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-)
[arrow] branch master updated (8b8841d4d7 -> f5e592eb5e)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 8b8841d4d7 ARROW-18055: [C++] arrow-dataset-dataset-writer-test still times out occassionally (#14428) add f5e592eb5e ARROW-15540: [C++] Allow the substrait consumer to accept plans with hints and nullable literals (#14402) No new revisions were added by this update. Summary of changes: .../arrow/engine/substrait/expression_internal.cc | 3 +- .../arrow/engine/substrait/relation_internal.cc| 16 +- cpp/src/arrow/engine/substrait/serde.cc| 5 +- cpp/src/arrow/engine/substrait/serde.h | 9 +- cpp/src/arrow/engine/substrait/serde_test.cc | 246 +++-- 5 files changed, 199 insertions(+), 80 deletions(-)
[arrow] branch master updated (d809c28508 -> 8b8841d4d7)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from d809c28508 ARROW-17965: [C++] ExecBatch support for ChunkedArray values (#14348) add 8b8841d4d7 ARROW-18055: [C++] arrow-dataset-dataset-writer-test still times out occassionally (#14428) No new revisions were added by this update. Summary of changes: cpp/src/arrow/dataset/dataset_writer.cc | 5 - cpp/src/arrow/util/async_util.cc| 3 +++ cpp/src/arrow/util/async_util_test.cc | 23 +-- 3 files changed, 28 insertions(+), 3 deletions(-)
[arrow] branch master updated (99b40926c7 -> f0cf5c2033)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 99b40926c7 ARROW-18058: [Dev][Archery] Remove removed ARROW_JNI related code (#14419) add f0cf5c2033 ARROW-18062: [R] error in CI jobs for R 3.5 and 3.6 when R package being installed (#14424) No new revisions were added by this update. Summary of changes: r/R/dplyr-funcs.R | 12 r/R/dplyr-slice.R | 14 +- r/tests/testthat/test-dplyr-slice.R | 4 +++- 3 files changed, 16 insertions(+), 14 deletions(-)
[arrow] branch master updated (ee1f763084 -> 2f57194fd3)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from ee1f763084 ARROW-15838: [R] Coalesce join keys in full outer join (#14286) add 2f57194fd3 ARROW-18061: [CI][R] Reduce number of jobs on every commit (#14420) No new revisions were added by this update. Summary of changes: .github/workflows/r.yml | 18 +- 1 file changed, 1 insertion(+), 17 deletions(-)
[arrow] branch master updated (81e1fbc1de -> ee1f763084)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 81e1fbc1de ARROW-17665: [R] Document dplyr and compute functionality (#14387) add ee1f763084 ARROW-15838: [R] Coalesce join keys in full outer join (#14286) No new revisions were added by this update. Summary of changes: r/R/arrowExports.R | 4 +- r/R/dplyr-collect.R| 29 + r/R/dplyr-join.R | 89 -- r/R/query-engine.R | 7 ++- r/src/arrowExports.cpp | 8 ++-- r/src/arrow_types.h| 2 + r/src/compute-exec.cpp | 32 +++--- r/tests/testthat/test-dplyr-join.R | 81 -- 8 files changed, 178 insertions(+), 74 deletions(-)
[arrow] branch master updated (8972ebd812 -> 81e1fbc1de)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 8972ebd812 ARROW-17556: [C++] Unbound scan projection expression leads to all fields being loaded (#14264) add 81e1fbc1de ARROW-17665: [R] Document dplyr and compute functionality (#14387) No new revisions were added by this update. Summary of changes: r/R/arrow-package.R | 26 +- r/R/dplyr-funcs-datetime.R | 520 +++ r/R/dplyr-funcs-doc.R| 104 +++--- r/R/dplyr-funcs-string.R | 196 +- r/R/dplyr-funcs-type.R | 67 ++-- r/R/dplyr-funcs.R| 7 +- r/R/dplyr-summarize.R| 75 ++-- r/data-raw/docgen.R | 18 +- r/man/acero.Rd | 104 +++--- r/tests/testthat/test-dplyr-filter.R | 4 +- r/tests/testthat/test-dplyr-funcs-datetime.R | 4 +- 11 files changed, 621 insertions(+), 504 deletions(-)
[arrow] branch master updated: ARROW-18057: [R] test for slice functions fail on builds without Datasets capability (#14418)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 82c26c8ebe ARROW-18057: [R] test for slice functions fail on builds without Datasets capability (#14418) 82c26c8ebe is described below commit 82c26c8ebe71def3461365a1c974ee6eccd11a06 Author: Nic Crane AuthorDate: Fri Oct 14 18:04:39 2022 +0100 ARROW-18057: [R] test for slice functions fail on builds without Datasets capability (#14418) Authored-by: Nic Crane Signed-off-by: Neal Richardson --- r/tests/testthat/test-dplyr-slice.R | 1 + 1 file changed, 1 insertion(+) diff --git a/r/tests/testthat/test-dplyr-slice.R b/r/tests/testthat/test-dplyr-slice.R index c12dd97aa4..5b577e0388 100644 --- a/r/tests/testthat/test-dplyr-slice.R +++ b/r/tests/testthat/test-dplyr-slice.R @@ -119,6 +119,7 @@ test_that("slice_sample, ungrouped", { expect_lte(sampled_n, 2) # Test with dataset, which matters for the UDF HACK + skip_if_not_available("dataset") sampled_n <- tab %>% InMemoryDataset$create() %>% slice_sample(n = 2) %>%
[arrow] branch master updated (31f2a01275 -> 883580883a)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 31f2a01275 MINOR: [R][Docs] Fix the note about to read timestamp with timezone column from csv (#14413) add 883580883a ARROW-17485: [R] Allow TRUE/FALSE to the compression option of `write_feather` (`write_ipc_file`) (#13935) No new revisions were added by this update. Summary of changes: r/R/feather.R | 7 ++- r/man/write_feather.Rd | 4 +++- r/tests/testthat/test-feather.R | 6 ++ 3 files changed, 15 insertions(+), 2 deletions(-)
[arrow] branch master updated (2cbf489158 -> 31f2a01275)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 2cbf489158 ARROW-12105: [R] Replace vars_select, vars_rename with eval_select, eval_rename (#14371) add 31f2a01275 MINOR: [R][Docs] Fix the note about to read timestamp with timezone column from csv (#14413) No new revisions were added by this update. Summary of changes: r/R/csv.R | 4 ++-- r/man/read_delim_arrow.Rd | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-)
[arrow] branch master updated (d1a8f4ba19 -> d008c17e24)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from d1a8f4ba19 ARROW-18048: [Dev][Archery][Crossbow] Comment bot waits for a while before generate a report (#14412) add d008c17e24 ARROW-17737: [R] Groups before conversion to a Table must not be restored after `collect()` (#14175) No new revisions were added by this update. Summary of changes: r/R/dplyr-collect.R| 13 +++-- r/R/dplyr.R| 8 +++- r/tests/testthat/test-dplyr-group-by.R | 33 + 3 files changed, 47 insertions(+), 7 deletions(-)
[arrow] branch master updated: ARROW-15602: [R][Docs] Update docs to explain how to read timestamp with timezone columns (#13877)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 7ef4b4a0ae ARROW-15602: [R][Docs] Update docs to explain how to read timestamp with timezone columns (#13877) 7ef4b4a0ae is described below commit 7ef4b4a0ae0c6c15a45ec439e348e26e1e80523d Author: eitsupi <50911393+eits...@users.noreply.github.com> AuthorDate: Fri Oct 14 10:00:10 2022 +0900 ARROW-15602: [R][Docs] Update docs to explain how to read timestamp with timezone columns (#13877) If users expect `read_csv_arrow` to behave the same as `readr::read_csv`, they will be confused by the presence or absence of a time zone, so adds a note is provided in the example. Adds the same example to the test to verify that the error occurs. Also update the type description to link to the Arrow type documentation. Authored-by: SHIMA Tatsuya Signed-off-by: Neal Richardson --- r/R/csv.R | 33 ++--- r/man/read_delim_arrow.Rd | 33 ++--- r/tests/testthat/test-csv.R | 19 +-- 3 files changed, 61 insertions(+), 24 deletions(-) diff --git a/r/R/csv.R b/r/R/csv.R index 71e01971f4..7b474c137e 100644 --- a/r/R/csv.R +++ b/r/R/csv.R @@ -54,17 +54,17 @@ #' single string, one character per column, where the characters map to Arrow #' types analogously to the `readr` type mapping: #' -#' * "c": `utf8()` -#' * "i": `int32()` -#' * "n": `float64()` -#' * "d": `float64()` -#' * "l": `bool()` -#' * "f": `dictionary()` -#' * "D": `date32()` -#' * "T": `timestamp(unit = "ns")` -#' * "t": `time32()` (The `unit` arg is set to the default value `"ms"`) -#' * "_": `null()` -#' * "-": `null()` +#' * "c": [utf8()] +#' * "i": [int32()] +#' * "n": [float64()] +#' * "d": [float64()] +#' * "l": [bool()] +#' * "f": [dictionary()] +#' * "D": [date32()] +#' * "T": [`timestamp(unit = "ns")`][timestamp()] +#' * "t": [time32()] (The `unit` arg is set to the default value `"ms"`) +#' * "_": [null()] +#' * "-": [null()] #' * "?": infer the type from the data #' #' If you use the compact string representation for `col_types`, you must also @@ -143,6 +143,17 @@ #' read_csv_arrow(tf, schema = schema(x = int32(), y = utf8()), skip = 1) #' read_csv_arrow(tf, col_types = schema(y = utf8())) #' read_csv_arrow(tf, col_types = "ic", col_names = c("x", "y"), skip = 1) +#' +#' # Note that if a timestamp column contains time zones, type inference won't work, +#' # whether automatic or via the string "T" `col_types` specification. +#' # To parse timestamps with time zones, provide a [Schema] to `col_types` +#' # and specify the time zone in the type object: +#' tf <- tempfile() +#' write.csv(data.frame(x = "1970-01-01T12:00:00+12:00"), file = tf, row.names = FALSE) +#' read_csv_arrow( +#' tf, +#' col_types = schema(x = timestamp(unit = "us", timezone = "UTC")) +#' ) read_delim_arrow <- function(file, delim = ",", quote = '"', diff --git a/r/man/read_delim_arrow.Rd b/r/man/read_delim_arrow.Rd index f322c56c17..5b91fc0ec9 100644 --- a/r/man/read_delim_arrow.Rd +++ b/r/man/read_delim_arrow.Rd @@ -180,17 +180,17 @@ that \code{readr} uses to the \code{col_types} argument. This means you provide single string, one character per column, where the characters map to Arrow types analogously to the \code{readr} type mapping: \itemize{ -\item "c": \code{utf8()} -\item "i": \code{int32()} -\item "n": \code{float64()} -\item "d": \code{float64()} -\item "l": \code{bool()} -\item "f": \code{dictionary()} -\item "D": \code{date32()} -\item "T": \code{timestamp(unit = "ns")} -\item "t": \code{time32()} (The \code{unit} arg is set to the default value \code{"ms"}) -\item "_": \code{null()} -\item "-": \code{null()} +\item "c": \code{\link[=utf8]{utf8()}} +\item "i": \code{\link[=int32]{int32()}} +\item "n": \code{\link[=float64]{float64()}} +\item "d": \code{\link[=float64]{float64()}} +\item "l": \code{\link[=bool]{bool()}} +\item "f": \code{\link[=dictionary]{dictionary()}} +\item "D": \code{\link[=date32]{date32()}} +\item "T": \code{\link[=timestamp]{timestamp(unit = "ns")}} +\item "t": \code{\link[=time32]{time32()}} (The \code{unit}
[arrow] branch master updated: ARROW-13766: [R] Add slice_*() methods (#14361)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 80e398623d ARROW-13766: [R] Add slice_*() methods (#14361) 80e398623d is described below commit 80e398623d956304acaeb3922e367d45ed96ddec Author: Neal Richardson AuthorDate: Thu Oct 13 19:59:32 2022 -0400 ARROW-13766: [R] Add slice_*() methods (#14361) This PR implements `slice_head,()` `slice_tail()`, `slice_min()`, `slice_max()` and `slice_sample()`. `slice_sample()` requires a clever hack using a UDF because the `random()` C++ function apparently does not work; see ARROW-17974. Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/.lintr| 1 + r/DESCRIPTION | 1 + r/NAMESPACE | 3 + r/R/array.R | 4 +- r/R/arrow-datum.R | 6 ++ r/R/arrow-package.R | 27 - r/R/dataset-scan.R | 16 ++- r/R/dplyr-funcs-doc.R | 17 ++-- r/R/dplyr-funcs-type.R | 4 +- r/R/dplyr-funcs.R | 14 ++- r/R/dplyr-slice.R | 158 + r/R/dplyr.R | 10 ++ r/R/expression.R| 7 +- r/R/record-batch-reader.R | 5 + r/R/util.R | 3 +- r/data-raw/docgen.R | 3 + r/man/acero.Rd | 17 ++-- r/tests/testthat/test-dplyr-slice.R | 192 18 files changed, 464 insertions(+), 24 deletions(-) diff --git a/r/.lintr b/r/.lintr index 619339afca..1bd80aff4c 100644 --- a/r/.lintr +++ b/r/.lintr @@ -27,5 +27,6 @@ linters: linters_with_defaults( ) exclusions: list( "R/arrowExports.R", + "R/dplyr-funcs-doc.R", "data-raw/codegen.R" ) diff --git a/r/DESCRIPTION b/r/DESCRIPTION index 4b526e8b8a..5a69d46896 100644 --- a/r/DESCRIPTION +++ b/r/DESCRIPTION @@ -116,6 +116,7 @@ Collate: 'dplyr-join.R' 'dplyr-mutate.R' 'dplyr-select.R' +'dplyr-slice.R' 'dplyr-summarize.R' 'dplyr-union.R' 'record-batch.R' diff --git a/r/NAMESPACE b/r/NAMESPACE index e20e61c0e3..59055ff2b7 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -421,6 +421,8 @@ importFrom(rlang,as_quosure) importFrom(rlang,call2) importFrom(rlang,call_args) importFrom(rlang,caller_env) +importFrom(rlang,check_dots_empty) +importFrom(rlang,dots_list) importFrom(rlang,dots_n) importFrom(rlang,enexpr) importFrom(rlang,enexprs) @@ -472,6 +474,7 @@ importFrom(stats,na.fail) importFrom(stats,na.omit) importFrom(stats,na.pass) importFrom(stats,quantile) +importFrom(stats,runif) importFrom(tidyselect,all_of) importFrom(tidyselect,contains) importFrom(tidyselect,ends_with) diff --git a/r/R/array.R b/r/R/array.R index 7c2fb5c783..c730bd742b 100644 --- a/r/R/array.R +++ b/r/R/array.R @@ -349,7 +349,7 @@ stop_cant_convert_array <- function(x, type) { "Can't create Array from object of type %s", paste(class(x), collapse = " / ") ), - call = rlang::caller_env() + call = caller_env() ) } else { abort( @@ -358,7 +358,7 @@ stop_cant_convert_array <- function(x, type) { format(type$code()), paste(class(x), collapse = " / ") ), - call = rlang::caller_env() + call = caller_env() ) } } diff --git a/r/R/arrow-datum.R b/r/R/arrow-datum.R index 33c67a5285..cb3bfa57f6 100644 --- a/r/R/arrow-datum.R +++ b/r/R/arrow-datum.R @@ -299,6 +299,9 @@ head.ArrowDatum <- function(x, n = 6L, ...) { } else { n <- min(len, n) } + if (!is.integer(n)) { +n <- floor(n) + } if (n == len) { return(x) } @@ -310,6 +313,9 @@ head.ArrowDatum <- function(x, n = 6L, ...) { tail.ArrowDatum <- function(x, n = 6L, ...) { assert_is(n, c("numeric", "integer")) assert_that(length(n) == 1) + if (!is.integer(n)) { +n <- floor(n) + } len <- NROW(x) if (n < 0) { # tail(x, negative) means all but the first n rows diff --git a/r/R/arrow-package.R b/r/R/arrow-package.R index 143f4c191b..477fa67e7c 100644 --- a/r/R/arrow-package.R +++ b/r/R/arrow-package.R @@ -26,7 +26,7 @@ #' @importFrom rlang expr caller_env is_character quo_name is_quosure enexpr enexprs as_quosure #' @importFrom rlang is_list call2 is_empty as_function as_label arg_match is_symbol is_call call_args #' @importFrom rlang quo_set_env quo_get_env is_formula quo_is_call f_rhs parse_expr f_env new_quosure -#' @importFrom rlang new_quosures expr_text +#' @importFrom rlang new_quosures expr_text caller_env check_dots_empty dots_list #' @importFrom tidyselect vars_pull vars_renam
[arrow] branch master updated (66e8ba5a1e -> 959a9d5dee)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 66e8ba5a1e MINOR: [R][Docs] Add note about conversion from JSON types to Arrow types (#13871) add 959a9d5dee ARROW-17788: [R][Doc] Add example of using Scanner (#14184) No new revisions were added by this update. Summary of changes: r/R/dataset-scan.R| 24 +++- r/R/dataset.R | 5 ++--- r/man/Scanner.Rd | 27 ++- r/man/open_dataset.Rd | 5 ++--- 4 files changed, 53 insertions(+), 8 deletions(-)
[arrow] branch master updated: MINOR: [R][Docs] Add note about conversion from JSON types to Arrow types (#13871)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 66e8ba5a1e MINOR: [R][Docs] Add note about conversion from JSON types to Arrow types (#13871) 66e8ba5a1e is described below commit 66e8ba5a1e07eaee19f040aa4df5a840614ed790 Author: eitsupi <50911393+eits...@users.noreply.github.com> AuthorDate: Thu Oct 13 22:18:49 2022 +0900 MINOR: [R][Docs] Add note about conversion from JSON types to Arrow types (#13871) Add note about conversion from JSON types to Arrow types. These documents were copied from `docs/source/python/json.rst` with modifications. Also, show the data frame in the example to make it easier to understand how the conversion is performed. Authored-by: SHIMA Tatsuya Signed-off-by: Neal Richardson --- r/R/json.R | 16 ++-- r/man/read_json_arrow.Rd | 18 -- 2 files changed, 30 insertions(+), 4 deletions(-) diff --git a/r/R/json.R b/r/R/json.R index 2b1f4916cb..c4061f066b 100644 --- a/r/R/json.R +++ b/r/R/json.R @@ -21,7 +21,19 @@ #' data frame or Arrow Table. #' #' If passed a path, will detect and handle compression from the file extension -#' (e.g. `.json.gz`). Accepts explicit or implicit nulls. +#' (e.g. `.json.gz`). +#' +#' If `schema` is not provided, Arrow data types are inferred from the data: +#' - JSON null values convert to the [null()] type, but can fall back to any other type. +#' - JSON booleans convert to [boolean()]. +#' - JSON numbers convert to [int64()], falling back to [float64()] if a non-integer is encountered. +#' - JSON strings of the kind "-MM-DD" and "-MM-DD hh:mm:ss" convert to [`timestamp(unit = "s")`][timestamp()], +#' falling back to [utf8()] if a conversion error occurs. +#' - JSON arrays convert to a [list_of()] type, and inference proceeds recursively on the JSON arrays' values. +#' - Nested JSON objects convert to a [struct()] type, and inference proceeds recursively on the JSON objects' values. +#' +#' When `as_data_frame = FALSE`, Arrow types are further converted to R types. +#' See `vignette("arrow", package = "arrow")` for details. #' #' @inheritParams read_delim_arrow #' @param schema [Schema] that describes the table. @@ -37,7 +49,7 @@ #' { "hello": 3.25, "world": null } #' { "hello": 0.0, "world": true, "yo": null } #' ', tf, useBytes = TRUE) -#' df <- read_json_arrow(tf) +#' read_json_arrow(tf) read_json_arrow <- function(file, col_select = NULL, as_data_frame = TRUE, diff --git a/r/man/read_json_arrow.Rd b/r/man/read_json_arrow.Rd index 2ad600725f..cc821c3301 100644 --- a/r/man/read_json_arrow.Rd +++ b/r/man/read_json_arrow.Rd @@ -41,7 +41,21 @@ data frame or Arrow Table. } \details{ If passed a path, will detect and handle compression from the file extension -(e.g. \code{.json.gz}). Accepts explicit or implicit nulls. +(e.g. \code{.json.gz}). + +If \code{schema} is not provided, Arrow data types are inferred from the data: +\itemize{ +\item JSON null values convert to the \code{\link[=null]{null()}} type, but can fall back to any other type. +\item JSON booleans convert to \code{\link[=boolean]{boolean()}}. +\item JSON numbers convert to \code{\link[=int64]{int64()}}, falling back to \code{\link[=float64]{float64()}} if a non-integer is encountered. +\item JSON strings of the kind "-MM-DD" and "-MM-DD hh:mm:ss" convert to \code{\link[=timestamp]{timestamp(unit = "s")}}, +falling back to \code{\link[=utf8]{utf8()}} if a conversion error occurs. +\item JSON arrays convert to a \code{\link[=list_of]{list_of()}} type, and inference proceeds recursively on the JSON arrays' values. +\item Nested JSON objects convert to a \code{\link[=struct]{struct()}} type, and inference proceeds recursively on the JSON objects' values. +} + +When \code{as_data_frame = FALSE}, Arrow types are further converted to R types. +See \code{vignette("arrow", package = "arrow")} for details. } \examples{ \dontshow{if (arrow_with_json()) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf} @@ -52,6 +66,6 @@ writeLines(' { "hello": 3.25, "world": null } { "hello": 0.0, "world": true, "yo": null } ', tf, useBytes = TRUE) -df <- read_json_arrow(tf) +read_json_arrow(tf) \dontshow{\}) # examplesIf} }
[arrow] branch master updated: MINOR: [R][Docs] Add note about use Schema as the `col_types` argument of `read_csv_arrow` (#13872)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new a47cd526d7 MINOR: [R][Docs] Add note about use Schema as the `col_types` argument of `read_csv_arrow` (#13872) a47cd526d7 is described below commit a47cd526d7cfd28632c0ff92c97b59920f4ebb01 Author: eitsupi <50911393+eits...@users.noreply.github.com> AuthorDate: Thu Oct 13 22:18:30 2022 +0900 MINOR: [R][Docs] Add note about use Schema as the `col_types` argument of `read_csv_arrow` (#13872) Authored-by: SHIMA Tatsuya Signed-off-by: Neal Richardson --- r/R/csv.R | 4 ++-- r/man/read_delim_arrow.Rd | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/r/R/csv.R b/r/R/csv.R index 4523298416..71e01971f4 100644 --- a/r/R/csv.R +++ b/r/R/csv.R @@ -98,8 +98,8 @@ #' column names and will not be included in the data frame. If `FALSE`, column #' names will be generated by Arrow, starting with "f0", "f1", ..., "fN". #' Alternatively, you can specify a character vector of column names. -#' @param col_types A compact string representation of the column types, or -#' `NULL` (the default) to infer types from the data. +#' @param col_types A compact string representation of the column types, +#' an Arrow [Schema], or `NULL` (the default) to infer types from the data. #' @param col_select A character vector of column names to keep, as in the #' "select" argument to `data.table::fread()`, or a #' [tidy selection specification][tidyselect::vars_select()] diff --git a/r/man/read_delim_arrow.Rd b/r/man/read_delim_arrow.Rd index 997a7f4101..f322c56c17 100644 --- a/r/man/read_delim_arrow.Rd +++ b/r/man/read_delim_arrow.Rd @@ -96,8 +96,8 @@ column names and will not be included in the data frame. If \code{FALSE}, column names will be generated by Arrow, starting with "f0", "f1", ..., "fN". Alternatively, you can specify a character vector of column names.} -\item{col_types}{A compact string representation of the column types, or -\code{NULL} (the default) to infer types from the data.} +\item{col_types}{A compact string representation of the column types, +an Arrow \link{Schema}, or \code{NULL} (the default) to infer types from the data.} \item{col_select}{A character vector of column names to keep, as in the "select" argument to \code{data.table::fread()}, or a
[arrow] branch master updated (093a4fe346 -> 20626f833b)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 093a4fe346 ARROW-17971: [Format][Docs] Add ADBC (#14079) add 20626f833b ARROW-17439: [R] Change behavior of pull to compute instead of collect (#14330) No new revisions were added by this update. Summary of changes: r/R/dplyr-collect.R | 8 -- r/tests/testthat/test-dataset-write.R| 4 ++- r/tests/testthat/test-dataset.R | 41 r/tests/testthat/test-dplyr-arrange.R| 3 +- r/tests/testthat/test-dplyr-funcs-datetime.R | 3 +- r/tests/testthat/test-dplyr-query.R | 9 +++--- 6 files changed, 47 insertions(+), 21 deletions(-)
[arrow] branch master updated (e8afe800aa -> fa3cf78e3f)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from e8afe800aa ARROW-17988: [C++] Remove index_sequence_for and aligned_union backports (#14372) add fa3cf78e3f MINOR: [R][CI] Fix typo in docker configure script (#14374) No new revisions were added by this update. Summary of changes: ci/scripts/r_docker_configure.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[arrow] branch master updated: ARROW-17885: [R] Return BLOB data as list of raw instead of a list of integers (#14277)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 73cfd2d0d0 ARROW-17885: [R] Return BLOB data as list of raw instead of a list of integers (#14277) 73cfd2d0d0 is described below commit 73cfd2d0d0e1e5a2192fb73e5262c77953664f81 Author: Dewey Dunnington AuthorDate: Mon Oct 10 17:08:34 2022 -0300 ARROW-17885: [R] Return BLOB data as list of raw instead of a list of integers (#14277) This PR adds support for `blob::blob()`, which is common in R database land to denote "binary", and `vctrs::list_of()`, which is similar, easy, and helps a bit with list of things that happen to be all NULL. We have our own infrastructure for binary and lists of things too, which I assume pre-dates the mature vctrs and blob? Should we consider having `as.vector()` output those objects instead of the custom `arrow_list/large_list/binary` classes we implement here? Lead-authored-by: Dewey Dunnington Co-authored-by: Dewey Dunnington Signed-off-by: Neal Richardson --- r/DESCRIPTION| 1 + r/NAMESPACE | 4 +++ r/R/array.R | 20 + r/R/type.R | 14 + r/src/r_to_arrow.cpp | 2 +- r/src/type_infer.cpp | 29 +++--- r/tests/testthat/_snaps/Array.md | 8 + r/tests/testthat/test-Array.R| 64 +++- r/tests/testthat/test-type.R | 32 9 files changed, 161 insertions(+), 13 deletions(-) diff --git a/r/DESCRIPTION b/r/DESCRIPTION index cf83f56390..4b526e8b8a 100644 --- a/r/DESCRIPTION +++ b/r/DESCRIPTION @@ -45,6 +45,7 @@ RoxygenNote: 7.2.1 Config/testthat/edition: 3 VignetteBuilder: knitr Suggests: +blob, cli, DBI, dbplyr, diff --git a/r/NAMESPACE b/r/NAMESPACE index 8b08b940b3..24a9e14bb6 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -41,9 +41,11 @@ S3method(as.vector,ArrowDatum) S3method(as_arrow_array,Array) S3method(as_arrow_array,ChunkedArray) S3method(as_arrow_array,Scalar) +S3method(as_arrow_array,blob) S3method(as_arrow_array,data.frame) S3method(as_arrow_array,default) S3method(as_arrow_array,pyarrow.lib.Array) +S3method(as_arrow_array,vctrs_list_of) S3method(as_arrow_table,RecordBatch) S3method(as_arrow_table,RecordBatchReader) S3method(as_arrow_table,Table) @@ -100,7 +102,9 @@ S3method(head,Scanner) S3method(head,arrow_dplyr_query) S3method(infer_type,ArrowDatum) S3method(infer_type,Expression) +S3method(infer_type,blob) S3method(infer_type,default) +S3method(infer_type,vctrs_list_of) S3method(is.finite,ArrowDatum) S3method(is.infinite,ArrowDatum) S3method(is.na,ArrowDatum) diff --git a/r/R/array.R b/r/R/array.R index 938c8e4b04..7c2fb5c783 100644 --- a/r/R/array.R +++ b/r/R/array.R @@ -322,6 +322,26 @@ as_arrow_array.data.frame <- function(x, ..., type = NULL) { } } +#' @export +as_arrow_array.vctrs_list_of <- function(x, ..., type = NULL) { + type <- type %||% infer_type(x) + if (!inherits(type, "ListType") && !inherits(type, "LargeListType")) { +stop_cant_convert_array(x, type) + } + + as_arrow_array(unclass(x), type = type) +} + +#' @export +as_arrow_array.blob <- function(x, ..., type = NULL) { + type <- type %||% infer_type(x) + if (!type$Equals(binary()) && !type$Equals(large_binary())) { +stop_cant_convert_array(x, type) + } + + as_arrow_array(unclass(x), type = type) +} + stop_cant_convert_array <- function(x, type) { if (is.null(type)) { abort( diff --git a/r/R/type.R b/r/R/type.R index d4d7d52ad5..5089789f6c 100644 --- a/r/R/type.R +++ b/r/R/type.R @@ -111,6 +111,20 @@ infer_type.default <- function(x, ..., from_array_infer_type = FALSE) { } } +#' @export +infer_type.vctrs_list_of <- function(x, ...) { + list_of(infer_type(attr(x, "ptype"))) +} + +#' @export +infer_type.blob <- function(x, ...) { + if (sum(lengths(x)) > .Machine$integer.max) { +large_binary() + } else { +binary() + } +} + #' @export infer_type.ArrowDatum <- function(x, ...) x$type diff --git a/r/src/r_to_arrow.cpp b/r/src/r_to_arrow.cpp index aa51799585..c472d8286f 100644 --- a/r/src/r_to_arrow.cpp +++ b/r/src/r_to_arrow.cpp @@ -743,7 +743,7 @@ Status check_binary(SEXP x, int64_t size) { // check this is a list of raw vectors const SEXP* p_x = VECTOR_PTR_RO(x); for (R_xlen_t i = 0; i < size; i++, ++p_x) { -if (TYPEOF(*p_x) != RAWSXP) { +if (TYPEOF(*p_x) != RAWSXP && (*p_x != R_NilValue)) { return Status::Invalid("invalid R type to convert to binary"); } } diff --git a/r/src/type_infer.cpp b/r/src/type_infer.cpp index e30d0e1288.
[arrow] branch master updated (7f63ee5033 -> 76d6cbb5c5)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 7f63ee5033 ARROW-17976: [C++] Use generic lambdas in arrow/compare.cc (#14363) add 76d6cbb5c5 ARROW-17594: [R][Packaging] Build binaries with devtoolset 8 on CentOS 7 (#14243) No new revisions were added by this update. Summary of changes: ci/docker/centos-7-cpp.dockerfile | 29 ci/scripts/r_docker_configure.sh | 6 + dev/tasks/macros.jinja| 7 +++--- dev/tasks/r/github.packages.yml | 47 ++- dev/tasks/tasks.yml | 4 +--- docker-compose.yml| 7 ++ r/inst/build_arrow_static.sh | 4 +++- r/tools/nixlibs.R | 2 +- 8 files changed, 69 insertions(+), 37 deletions(-)
[arrow] branch master updated (5aff7a5b76 -> c93a10b3d2)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 5aff7a5b76 ARROW-17930: [CI][C++] Valgrind failure in PrintValue (#14317) add c93a10b3d2 MINOR: [R] Adapt stringr::str_c mapping for upcoming release (#14296) No new revisions were added by this update. Summary of changes: r/R/dplyr-funcs-string.R | 3 +++ r/tests/testthat/test-dplyr-funcs-string.R | 10 -- 2 files changed, 7 insertions(+), 6 deletions(-)
[arrow] branch master updated (b7f9dfc2b1 -> 776626e56b)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from b7f9dfc2b1 ARROW-16879: [R][CI] Test R GCS bindings with testbench (#13542) add 776626e56b ARROW-17903: [JS] Update dependencies (#14285) No new revisions were added by this update. Summary of changes: js/package.json| 42 +- js/src/builder/list.ts |2 +- js/src/io/adapters.ts |6 +- js/src/util/buffer.ts |2 +- js/yarn.lock | 2373 +--- 5 files changed, 1251 insertions(+), 1174 deletions(-)
[arrow] branch master updated (4660180848 -> b7f9dfc2b1)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 4660180848 ARROW-17450 : [C++][Parquet] Support RLE decode for boolean datatype (#14147) add b7f9dfc2b1 ARROW-16879: [R][CI] Test R GCS bindings with testbench (#13542) No new revisions were added by this update. Summary of changes: .github/workflows/r.yml | 13 ++ ci/scripts/r_test.sh | 8 - r/DESCRIPTION | 1 + r/R/filesystem.R | 9 +- r/tests/testthat/helper-filesystems.R | 190 r/tests/testthat/helper-skip.R| 15 +- r/tests/testthat/test-gcs.R | 48 + r/tests/testthat/test-s3-minio.R | 329 +- 8 files changed, 360 insertions(+), 253 deletions(-) create mode 100644 r/tests/testthat/helper-filesystems.R
[arrow] branch master updated (2748f3d9fa -> d60d8c6dd4)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 2748f3d9fa MINOR: [CI] Use secrets for bucket name in preview-docs job (#14270) add d60d8c6dd4 ARROW-17848: [R] Skip lubridate::format_ISO8601 tests until next release (#14282) No new revisions were added by this update. Summary of changes: r/tests/testthat/test-dplyr-funcs-datetime.R | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-)
[arrow] branch master updated (7a3d801095 -> 7a56846811)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 7a3d801095 ARROW-17669: [Go] Take Function kernels for Record batch, Tables and Chunked Arrays (#14214) add 7a56846811 MINOR: [R] Import the missing `rlang::quo` function (#14091) No new revisions were added by this update. Summary of changes: r/NAMESPACE | 1 + r/R/arrow-package.R | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-)
[arrow] branch master updated (4f31bfc2ff -> 2577ac1a10)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 4f31bfc2ff ARROW-17318: [C++][Dataset] Support async streaming interface for getting fragments in Dataset (#13804) add 2577ac1a10 ARROW-17690: [R] Implement dplyr::across() inside distinct() (#14154) No new revisions were added by this update. Summary of changes: r/R/dplyr-funcs-doc.R | 2 +- r/data-raw/docgen.R| 4 ++-- r/man/acero.Rd | 2 +- r/tests/testthat/test-dplyr-distinct.R | 10 ++ 4 files changed, 14 insertions(+), 4 deletions(-)
[arrow] branch master updated (529f653dfa -> 7969164930)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 529f653dfa ARROW-17517: [C++] Remove internal headers from substrait API (#14131) add 7969164930 MINOR: [R] Forward compatibility for tidyselect 1.2 (#14170) No new revisions were added by this update. Summary of changes: r/tests/testthat/test-dplyr-filter.R | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-)
[arrow] branch master updated: MINOR: [R] Fix lint warnings and run styler over everything (#14153)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 6bc2e010d9 MINOR: [R] Fix lint warnings and run styler over everything (#14153) 6bc2e010d9 is described below commit 6bc2e010d9fb4e50d8a9490ec5fa092f2f8783b4 Author: Neal Richardson AuthorDate: Fri Sep 16 13:18:38 2022 -0400 MINOR: [R] Fix lint warnings and run styler over everything (#14153) Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/DESCRIPTION| 2 +- r/R/arrowExports.R | 1 - r/R/dplyr-datetime-helpers.R | 8 +- r/R/dplyr-funcs-doc.R| 28 +++ r/R/dplyr.R | 3 +- r/data-raw/docgen.R | 12 +-- r/man/acero.Rd | 4 +- r/man/show_exec_plan.Rd | 2 +- r/tests/testthat/test-Table.R| 1 - r/tests/testthat/test-compute.R | 2 +- r/tests/testthat/test-dataset-dplyr.R| 40 +- r/tests/testthat/test-dataset.R | 4 +- r/tests/testthat/test-dplyr-across.R | 1 - r/tests/testthat/test-dplyr-funcs-datetime.R | 109 ++- r/tests/testthat/test-dplyr-funcs-math.R | 3 +- r/tests/testthat/test-dplyr-funcs-string.R | 3 +- r/tests/testthat/test-dplyr-funcs-type.R | 2 +- r/tools/winlibs.R| 2 +- 18 files changed, 100 insertions(+), 127 deletions(-) diff --git a/r/DESCRIPTION b/r/DESCRIPTION index 7b60f0c510..90e84d34bc 100644 --- a/r/DESCRIPTION +++ b/r/DESCRIPTION @@ -41,7 +41,7 @@ Imports: utils, vctrs Roxygen: list(markdown = TRUE, r6 = FALSE, load = "source") -RoxygenNote: 7.2.0 +RoxygenNote: 7.2.1 Config/testthat/edition: 3 VignetteBuilder: knitr Suggests: diff --git a/r/R/arrowExports.R b/r/R/arrowExports.R index 6e76cd6468..35c73e547c 100644 --- a/r/R/arrowExports.R +++ b/r/R/arrowExports.R @@ -2043,4 +2043,3 @@ SetIOThreadPoolCapacity <- function(threads) { Array__infer_type <- function(x) { .Call(`_arrow_Array__infer_type`, x) } - diff --git a/r/R/dplyr-datetime-helpers.R b/r/R/dplyr-datetime-helpers.R index 4c9a8d1bf0..ba9bb0d543 100644 --- a/r/R/dplyr-datetime-helpers.R +++ b/r/R/dplyr-datetime-helpers.R @@ -442,8 +442,10 @@ parse_period_unit <- function(x) { str_unit <- substr(x, capture_start[[2]], capture_end[[2]]) str_multiple <- substr(x, capture_start[[1]], capture_end[[1]]) - known_units <- c("nanosecond", "microsecond", "millisecond", "second", - "minute", "hour", "day", "week", "month", "quarter", "year") + known_units <- c( +"nanosecond", "microsecond", "millisecond", "second", +"minute", "hour", "day", "week", "month", "quarter", "year" + ) # match the period unit str_unit_start <- substr(str_unit, 1, 3) @@ -464,7 +466,7 @@ parse_period_unit <- function(x) { if (capture_length[[1]] == 0) { multiple <- 1L - # otherwise parse the multiple +# otherwise parse the multiple } else { multiple <- as.numeric(str_multiple) diff --git a/r/R/dplyr-funcs-doc.R b/r/R/dplyr-funcs-doc.R index cac0310f49..cbfe475232 100644 --- a/r/R/dplyr-funcs-doc.R +++ b/r/R/dplyr-funcs-doc.R @@ -88,12 +88,12 @@ #' as `arrow_ascii_is_decimal`. #' #' ## arrow -#' +#' #' * [`add_filename()`][arrow::add_filename()] #' * [`cast()`][arrow::cast()] #' #' ## base -#' +#' #' * [`-`][-()] #' * [`!`][!()] #' * [`!=`][!=()] @@ -179,13 +179,15 @@ #' * [`trunc()`][base::trunc()] #' #' ## bit64 -#' +#' #' * [`as.integer64()`][bit64::as.integer64()] #' * [`is.integer64()`][bit64::is.integer64()] #' #' ## dplyr -#' -#' * [`across()`][dplyr::across()]: only supported inside `mutate()`, `summarize()`, and `arrange()`; purrr-style lambda functions and use of `where()` selection helper not yet supported +#' +#' * [`across()`][dplyr::across()]: supported inside `mutate()`, `summarize()`, `group_by()`, and `arrange()`; +#' purrr-style lambda functions +#' and use of `where()` selection helper not yet supported #' * [`between()`][dplyr::between()] #' * [`case_when()`][dplyr::case_when()] #' * [`coalesce()`][dplyr::coalesce()] @@ -195,7 +197,7 @@ #' * [`n_distinct()`][dplyr::n_distinct()] #' #' ## lubridate -#' +#' #' * [`am()`][lubridate::am()] #' * [`as_date()`][lubridate::as_date()] #' * [`as_datetime()`][lubridate::as_datetime()] @@ -270,11 +272,11 @@ #' * [`yq()`][lubridate::yq()] #' #' ## methods -#' +#' #' * [`i
[arrow] branch master updated (b48d2287be -> 6926672147)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from b48d2287be ARROW-17704: [Java][FlightRPC] Update to Junit 5 (#14103) add 6926672147 ARROW-17643: [R] Latest duckdb release is causing test failure (#14149) No new revisions were added by this update. Summary of changes: r/tests/testthat/test-duckdb.R | 4 1 file changed, 4 insertions(+)
[arrow] branch master updated (2e72e0a808 -> 93626eebd0)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 2e72e0a808 ARROW-17407: [Doc][FlightRPC] Flight/gRPC best practices (#13873) add 93626eebd0 ARROW-15011: [R] Generate documentation for dplyr function bindings (#14014) No new revisions were added by this update. Summary of changes: r/DESCRIPTION | 1 + r/Makefile | 1 + r/R/arrow-package.R | 51 +-- r/R/dplyr-funcs-augmented.R | 19 ++- r/R/dplyr-funcs-datetime.R | 53 --- r/R/dplyr-funcs-doc.R | 332 +++ r/R/dplyr-funcs-string.R| 86 ++- r/R/dplyr-funcs-type.R | 43 +++--- r/R/dplyr-funcs.R | 17 ++- r/R/expression.R| 11 +- r/_pkgdown.yml | 1 + r/data-raw/docgen.R | 192 + r/man/acero.Rd | 339 r/man/add_filename.Rd | 23 +++ r/man/cast.Rd | 38 + r/man/register_binding.Rd | 11 +- 16 files changed, 1109 insertions(+), 109 deletions(-) create mode 100644 r/R/dplyr-funcs-doc.R create mode 100644 r/data-raw/docgen.R create mode 100644 r/man/acero.Rd create mode 100644 r/man/add_filename.Rd create mode 100644 r/man/cast.Rd
[arrow] branch master updated (5c773bb922 -> 5c13049d97)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 5c773bb922 ARROW-17673: [R] `desc` in `dplyr::arrange` should allow `dplyr::` prefix (#14090) add 5c13049d97 ARROW-16190: [CI][R] Implement CI on Apple M1 for R (#14099) No new revisions were added by this update. Summary of changes: dev/tasks/macros.jinja | 4 ++-- dev/tasks/python-wheels/github.osx.arm64.yml | 16 dev/tasks/r/github.macos.autobrew.yml| 4 ++-- dev/tasks/r/github.packages.yml | 28 +--- dev/tasks/tasks.yml | 3 ++- dev/tasks/verify-rc/github.macos.arm64.yml | 2 +- 6 files changed, 32 insertions(+), 25 deletions(-)
[arrow] branch master updated (d8f64eecf3 -> 5c773bb922)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from d8f64eecf3 ARROW-17172: [C++][Python] test_cython_api fails on windows (#14133) add 5c773bb922 ARROW-17673: [R] `desc` in `dplyr::arrange` should allow `dplyr::` prefix (#14090) No new revisions were added by this update. Summary of changes: r/R/dplyr-arrange.R | 2 +- r/tests/testthat/test-dplyr-arrange.R | 26 ++ 2 files changed, 27 insertions(+), 1 deletion(-)
[arrow] branch master updated (05b7fe35cf -> 6c675c3534)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 05b7fe35cf ARROW-17674: [R] Implement dplyr::across() inside arrange() (#14092) add 6c675c3534 ARROW-15481: [R] [CI] Add a crossbow job that mimics CRAN's old macOS (#13925) No new revisions were added by this update. Summary of changes: dev/tasks/macros.jinja| 3 +++ dev/tasks/r/github.macos.autobrew.yml | 2 +- dev/tasks/r/github.packages.yml | 34 +++--- r/tools/autobrew | 2 +- 4 files changed, 28 insertions(+), 13 deletions(-)
[arrow] branch master updated: ARROW-17674: [R] Implement dplyr::across() inside arrange() (#14092)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 05b7fe35cf ARROW-17674: [R] Implement dplyr::across() inside arrange() (#14092) 05b7fe35cf is described below commit 05b7fe35cf7c0dbba4d3c86882bb93560e606a13 Author: eitsupi <50911393+eits...@users.noreply.github.com> AuthorDate: Tue Sep 13 00:16:13 2022 +0900 ARROW-17674: [R] Implement dplyr::across() inside arrange() (#14092) Authored-by: SHIMA Tatsuya Signed-off-by: Neal Richardson --- r/R/dplyr-arrange.R | 3 ++- r/tests/testthat/test-dplyr-arrange.R | 15 +++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/r/R/dplyr-arrange.R b/r/R/dplyr-arrange.R index 247a539f52..2f9ef61bb3 100644 --- a/r/R/dplyr-arrange.R +++ b/r/R/dplyr-arrange.R @@ -20,7 +20,8 @@ arrange.arrow_dplyr_query <- function(.data, ..., .by_group = FALSE) { call <- match.call() - exprs <- quos(...) + exprs <- expand_across(.data, quos(...)) + if (.by_group) { # when the data is is grouped and .by_group is TRUE, order the result by # the grouping columns first diff --git a/r/tests/testthat/test-dplyr-arrange.R b/r/tests/testthat/test-dplyr-arrange.R index fee1475a44..edec572d10 100644 --- a/r/tests/testthat/test-dplyr-arrange.R +++ b/r/tests/testthat/test-dplyr-arrange.R @@ -201,3 +201,18 @@ test_that("arrange() with bad inputs", { fixed = TRUE ) }) + +test_that("Can use across() within arrange()", { + compare_dplyr_binding( +.input %>% + arrange(across(starts_with("d"))) %>% + collect(), +example_data + ) + compare_dplyr_binding( +.input %>% + arrange(across(starts_with("d"), desc)) %>% + collect(), +example_data + ) +})
[arrow] branch master updated (1b9c57e208 -> 80bba29961)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 1b9c57e208 ARROW-17453: [Go][C++][Parquet] Inconsistent Data with Repetition Levels (#13982) add 80bba29961 ARROW-17463: [R] Avoid unnecessary projections (#13954) No new revisions were added by this update. Summary of changes: r/R/query-engine.R | 24 -- r/tests/testthat/test-dplyr-collapse.R | 36 +++ r/tests/testthat/test-dplyr-query.R | 82 + r/tests/testthat/test-dplyr-summarize.R | 41 - 4 files changed, 147 insertions(+), 36 deletions(-)
[arrow] branch master updated: ARROW-15260: [R] open_dataset - add file_name as column (#12826)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 838687178f ARROW-15260: [R] open_dataset - add file_name as column (#12826) 838687178f is described below commit 838687178fda7f82e31668f502e2f94071ce8077 Author: Nic Crane AuthorDate: Wed Aug 10 01:19:40 2022 +0100 ARROW-15260: [R] open_dataset - add file_name as column (#12826) Authored-by: Nic Crane Signed-off-by: Neal Richardson --- r/DESCRIPTION | 1 + r/R/dataset.R | 1 + r/R/dplyr-collect.R | 11 + r/R/dplyr-funcs-augmented.R | 22 ++ r/R/dplyr-funcs.R | 1 + r/R/dplyr.R | 3 ++ r/R/util.R | 31 +- r/src/compute-exec.cpp | 8 ++-- r/tests/testthat/test-dataset.R | 94 - 9 files changed, 164 insertions(+), 8 deletions(-) diff --git a/r/DESCRIPTION b/r/DESCRIPTION index 308a7ec3fa..95c1405869 100644 --- a/r/DESCRIPTION +++ b/r/DESCRIPTION @@ -98,6 +98,7 @@ Collate: 'dplyr-distinct.R' 'dplyr-eval.R' 'dplyr-filter.R' +'dplyr-funcs-augmented.R' 'dplyr-funcs-conditional.R' 'dplyr-funcs-datetime.R' 'dplyr-funcs-math.R' diff --git a/r/R/dataset.R b/r/R/dataset.R index 12765fbfc0..d86962cc1d 100644 --- a/r/R/dataset.R +++ b/r/R/dataset.R @@ -224,6 +224,7 @@ open_dataset <- function(sources, # and not handle_parquet_io_error() error = function(e, call = caller_env(n = 4)) { handle_parquet_io_error(e, format, call) + abort(conditionMessage(e), call = call) } ) } diff --git a/r/R/dplyr-collect.R b/r/R/dplyr-collect.R index 3e83475a8c..8049e46eb5 100644 --- a/r/R/dplyr-collect.R +++ b/r/R/dplyr-collect.R @@ -25,6 +25,8 @@ collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) { # and not handle_csv_read_error() error = function(e, call = caller_env(n = 4)) { handle_csv_read_error(e, x$.data$schema, call) + handle_augmented_field_misuse(e, call) + abort(conditionMessage(e), call = call) } ) @@ -104,10 +106,18 @@ add_suffix <- function(fields, common_cols, suffix) { } implicit_schema <- function(.data) { + # Get the source data schema so that we can evaluate expressions to determine + # the output schema. Note that we don't use source_data() because we only + # want to go one level up (where we may have called implicit_schema() before) .data <- ensure_group_vars(.data) old_schm <- .data$.data$schema + # Add in any augmented fields that may exist in the query but not in the + # real data, in case we have FieldRefs to them + old_schm[["__filename"]] <- string() if (is.null(.data$aggregations)) { +# .data$selected_columns is a named list of Expressions (FieldRefs or +# something more complex). Bind them in order to determine their output type new_fields <- map(.data$selected_columns, ~ .$type(old_schm)) if (!is.null(.data$join) && !(.data$join$type %in% JoinType[1:4])) { # Add cols from right side, except for semi/anti joins @@ -128,6 +138,7 @@ implicit_schema <- function(.data) { new_fields <- c(left_fields, right_fields) } } else { +# The output schema is based on the aggregations and any group_by vars new_fields <- map(summarize_projection(.data), ~ .$type(old_schm)) # * Put group_by_vars first (this can't be done by summarize, # they have to be last per the aggregate node signature, diff --git a/r/R/dplyr-funcs-augmented.R b/r/R/dplyr-funcs-augmented.R new file mode 100644 index 00..6e751d49f6 --- /dev/null +++ b/r/R/dplyr-funcs-augmented.R @@ -0,0 +1,22 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +register_bindings_augmented <- function() { + register_binding("add_filename", function() { +Expression$field_ref("__filename") + }) +} diff --git a/r/R/dplyr-funcs.R b/r/R/dplyr-funcs.R index c1dcdd1774..4dadff54b4 1
[arrow] branch master updated: ARROW-17252: [R] Intermittent valgrind failure (#13773)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 7448322ebe ARROW-17252: [R] Intermittent valgrind failure (#13773) 7448322ebe is described below commit 7448322ebe34c6efae413a52338ebf7efa1a6069 Author: Dewey Dunnington AuthorDate: Tue Aug 9 07:57:57 2022 -0300 ARROW-17252: [R] Intermittent valgrind failure (#13773) This PR fixes intermittent leaks that occur after one of the changes from ARROW-16444: when we drain the `RecordBatchReader` that is emitted from the plan too quickly, it seems, some parts of the plan can leak (I don't know why this happens). I tried removing various pieces of the `RunWithCapturedR()` changes (see #13746) but the only thing that removes the errors completely is draining the resulting `RecordBatchReader` from R (i.e., `reader$read_table()`) instead of in C++ (i.e., `reader->ToTable()`). Unfortunately, for user-defined functions to work in a plan we need a C++ level `reader->ToTable()`. I took the approach here of disabling the C++ level read by default, requiring a user to opt in to the version of `collect( [...] I was able to replicate the original leaks but they are few and far between...our tests just happen to create and destroy many, many exec plans and something about the CI environment seems to trigger these more reliably (although the errors don't always occur there, either). Most of the leaks are small but there were some instances where an entire `Table` leaked. Authored-by: Dewey Dunnington Signed-off-by: Neal Richardson --- r/R/compute.R | 9 ++- r/R/table.R | 9 ++- r/man/register_scalar_function.Rd | 2 +- r/tests/testthat/test-compute.R | 51 ++- 4 files changed, 57 insertions(+), 14 deletions(-) diff --git a/r/R/compute.R b/r/R/compute.R index 0985e73a5f..636c9146ca 100644 --- a/r/R/compute.R +++ b/r/R/compute.R @@ -344,7 +344,7 @@ cast_options <- function(safe = TRUE, ...) { #' @return `NULL`, invisibly #' @export #' -#' @examplesIf arrow_with_dataset() +#' @examplesIf arrow_with_dataset() && identical(Sys.getenv("NOT_CRAN"), "true") #' library(dplyr, warn.conflicts = FALSE) #' #' some_model <- lm(mpg ~ disp + cyl, data = mtcars) @@ -385,6 +385,13 @@ register_scalar_function <- function(name, fun, in_type, out_type, update_cache = TRUE ) + # User-defined functions require some special handling + # in the query engine which currently require an opt-in using + # the R_ARROW_COLLECT_WITH_UDF environment variable while this + # behaviour is stabilized. + # TODO(ARROW-17178) remove the need for this! + Sys.setenv(R_ARROW_COLLECT_WITH_UDF = "true") + invisible(NULL) } diff --git a/r/R/table.R b/r/R/table.R index 5579c676d5..d7e276415c 100644 --- a/r/R/table.R +++ b/r/R/table.R @@ -331,5 +331,12 @@ as_arrow_table.arrow_dplyr_query <- function(x, ...) { # See query-engine.R for ExecPlan/Nodes plan <- ExecPlan$create() final_node <- plan$Build(x) - plan$Run(final_node, as_table = TRUE) + + run_with_event_loop <- identical( +Sys.getenv("R_ARROW_COLLECT_WITH_UDF", ""), +"true" + ) + + result <- plan$Run(final_node, as_table = run_with_event_loop) + as_arrow_table(result) } diff --git a/r/man/register_scalar_function.Rd b/r/man/register_scalar_function.Rd index 4da8f54f64..324dd5fad1 100644 --- a/r/man/register_scalar_function.Rd +++ b/r/man/register_scalar_function.Rd @@ -48,7 +48,7 @@ stateless and return output with the same shape (i.e., the same number of rows) as the input. } \examples{ -\dontshow{if (arrow_with_dataset()) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf} +\dontshow{if (arrow_with_dataset() && identical(Sys.getenv("NOT_CRAN"), "true")) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf} library(dplyr, warn.conflicts = FALSE) some_model <- lm(mpg ~ disp + cyl, data = mtcars) diff --git a/r/tests/testthat/test-compute.R b/r/tests/testthat/test-compute.R index 9e487169f4..5821c0fa2d 100644 --- a/r/tests/testthat/test-compute.R +++ b/r/tests/testthat/test-compute.R @@ -81,6 +81,9 @@ test_that("arrow_scalar_function() works with auto_convert = TRUE", { test_that("register_scalar_function() adds a compute function to the registry", { skip_if_not(CanRunWithCapturedR()) + # TODO(ARROW-17178): User-defined function-friendly ExecPlan execution has + # occasional valgrind errors + skip_on_linux_devel() register_scalar_function( "times_32", @@ -88,7 +91,11 @@ test_that("register_scalar_function() adds a c
[arrow] branch master updated: ARROW-17088: [R] Use `.arrow` as extension of IPC files of datasets (#13690)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 8cac69c809 ARROW-17088: [R] Use `.arrow` as extension of IPC files of datasets (#13690) 8cac69c809 is described below commit 8cac69c809e2ae9d4ba9c10c7b22869c1fd11323 Author: mopcup <40266799+mop...@users.noreply.github.com> AuthorDate: Wed Aug 3 06:35:10 2022 +0900 ARROW-17088: [R] Use `.arrow` as extension of IPC files of datasets (#13690) Lead-authored-by: mopcup Co-authored-by: mopcup <40266799+mop...@users.noreply.github.com> Signed-off-by: Neal Richardson --- r/R/dataset-write.R | 8 +-- r/man/write_dataset.Rd| 5 +++-- r/tests/testthat/test-dataset-write.R | 42 --- 3 files changed, 48 insertions(+), 7 deletions(-) diff --git a/r/R/dataset-write.R b/r/R/dataset-write.R index 496aaad205..e0181ee74f 100644 --- a/r/R/dataset-write.R +++ b/r/R/dataset-write.R @@ -34,8 +34,9 @@ #' use the current `group_by()` columns. #' @param basename_template string template for the names of files to be written. #' Must contain `"{i}"`, which will be replaced with an autoincremented -#' integer to generate basenames of datafiles. For example, `"part-{i}.feather"` -#' will yield `"part-0.feather", ...`. +#' integer to generate basenames of datafiles. For example, `"part-{i}.arrow"` +#' will yield `"part-0.arrow", ...`. +#' If not specified, it defaults to `"part-{i}."`. #' @param hive_style logical: write partition segments as Hive-style #' (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is `TRUE`. #' @param existing_data_behavior The behavior to use when there is already data @@ -133,6 +134,9 @@ write_dataset <- function(dataset, max_rows_per_group = bitwShiftL(1, 20), ...) { format <- match.arg(format) + if (format %in% c("feather", "ipc")) { +format <- "arrow" + } if (inherits(dataset, "arrow_dplyr_query")) { # partitioning vars need to be in the `select` schema dataset <- ensure_group_vars(dataset) diff --git a/r/man/write_dataset.Rd b/r/man/write_dataset.Rd index 8fc07d5cc7..1bc940697c 100644 --- a/r/man/write_dataset.Rd +++ b/r/man/write_dataset.Rd @@ -38,8 +38,9 @@ use the current \code{group_by()} columns.} \item{basename_template}{string template for the names of files to be written. Must contain \code{"{i}"}, which will be replaced with an autoincremented -integer to generate basenames of datafiles. For example, \code{"part-{i}.feather"} -will yield \verb{"part-0.feather", ...}.} +integer to generate basenames of datafiles. For example, \code{"part-{i}.arrow"} +will yield \verb{"part-0.arrow", ...}. +If not specified, it defaults to \code{"part-{i}."}.} \item{hive_style}{logical: write partition segments as Hive-style (\code{key1=value1/key2=value2/file.ext}) or as just bare values. Default is \code{TRUE}.} diff --git a/r/tests/testthat/test-dataset-write.R b/r/tests/testthat/test-dataset-write.R index 2f4ff7e649..7a5f861ca5 100644 --- a/r/tests/testthat/test-dataset-write.R +++ b/r/tests/testthat/test-dataset-write.R @@ -63,7 +63,7 @@ test_that("Writing a dataset: CSV->IPC", { # Check whether "int" is present in the files or just in the dirs first <- read_feather( -dir(dst_dir, pattern = ".feather$", recursive = TRUE, full.names = TRUE)[1], +dir(dst_dir, pattern = ".arrow$", recursive = TRUE, full.names = TRUE)[1], as_data_frame = FALSE ) # It shouldn't be there @@ -139,6 +139,40 @@ test_that("Writing a dataset: Parquet->Parquet (default)", { ) }) +test_that("Writing a dataset: `basename_template` default behavier", { + ds <- open_dataset(csv_dir, partitioning = "part", format = "csv") + + dst_dir <- make_temp_dir() + write_dataset(ds, dst_dir, format = "parquet", max_rows_per_file = 5L) + expect_identical( +dir(dst_dir, full.names = FALSE, recursive = TRUE), +paste0("part-", 0:3, ".parquet") + ) + dst_dir <- make_temp_dir() + write_dataset(ds, dst_dir, format = "parquet", basename_template = "{i}.data", max_rows_per_file = 5L) + expect_identical( +dir(dst_dir, full.names = FALSE, recursive = TRUE), +paste0(0:3, ".data") + ) + dst_dir <- make_temp_dir() + expect_error( +write_dataset(ds, dst_dir, format = "parquet", basename_template = "part-i.parquet"), +"basename_template did not contain '\\{i\\}'" + ) + fe
[arrow] branch master updated (95aec82bd6 -> cc63a5da02)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 95aec82bd6 ARROW-12693: [R] add unique() methods for ArrowTabular, datasets (#13641) add cc63a5da02 ARROW-16612: [R] Fix compression inference from filename (#13625) No new revisions were added by this update. Summary of changes: r/R/csv.R | 40 +++- r/R/feather.R | 21 +++ r/R/io.R | 76 -- r/R/ipc-stream.R | 10 - r/R/json.R | 5 +++ r/R/parquet.R | 9 + r/man/make_readable_file.Rd| 11 +- r/man/read_feather.Rd | 6 +-- r/man/read_ipc_stream.Rd | 6 --- r/man/write_feather.Rd | 9 +++-- r/man/write_ipc_stream.Rd | 6 --- r/tests/testthat/test-compressed.R | 8 r/tests/testthat/test-csv.R| 25 - r/tests/testthat/test-feather.R| 16 r/tests/testthat/test-parquet.R| 16 15 files changed, 145 insertions(+), 119 deletions(-)
[arrow] branch master updated: ARROW-14821: [R] Implement bindings for lubridate's floor_date, ceiling_date, and round_date (#12154)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new b0734e68d6 ARROW-14821: [R] Implement bindings for lubridate's floor_date, ceiling_date, and round_date (#12154) b0734e68d6 is described below commit b0734e68d6f57fb22869df0d0aa2ae4dd75765dc Author: Danielle Navarro AuthorDate: Fri Jul 22 08:31:02 2022 +1000 ARROW-14821: [R] Implement bindings for lubridate's floor_date, ceiling_date, and round_date (#12154) This patch provides dplyr bindings to for lubridate functions `floor_date()`, `ceiling_date()`, and `round_date()`. This is my first attempt at writing a patch, so my apologies if I've made any errors ### Supported functionality: - Allows rounding to integer multiples of common time units (second, minutes, days, etc) - Mirrors the lubridate syntax allowing fractional units such as `unit = .001 seconds` as an alias for `unit = 1 millisecond` - Allows partial matching of date units based on first three characters: e.g. `sec`, `second`, `seconds` all match `second` - Mirrors lubridate in throwing errors when unit exceeds thresholds: 60 seconds, 60 minutes, 24 hours ~~### Major problems not yet addressed:~~ ~~- Does not yet support the `week_start` argument, and implicitly fixes `week_start = 4`~~ ~~- Does not yet mirror lubridate handling of timezones~~ ~~I'd prefer to fix these two issues before merging, but I'm uncertain how best to handle them. Any advice would be appreciated!~~ ~~### Minor things not yet addressed~~ ~~- During rounding lubridate sometimes coerces Date objects to POSIXct. This is not mirrored in the arrow bindings: date32 classes remain date32 classes. This introduces minor differences in rounding in some cases~~ ~~- Does not yet support the `change_on_boundary` argument to `ceiling_date()`. It's a small discrepancy, but it means that the default behaviour of the arrow dplyr binding mirrors lubridate prior to v1.6.0~~ EDIT: issues now addressed! Authored-by: Danielle Navarro Signed-off-by: Neal Richardson --- r/R/dplyr-datetime-helpers.R | 158 r/R/dplyr-funcs-datetime.R | 52 +++ r/tests/testthat/test-dplyr-funcs-datetime.R | 578 +++ 3 files changed, 788 insertions(+) diff --git a/r/R/dplyr-datetime-helpers.R b/r/R/dplyr-datetime-helpers.R index 9199ce0dd5..efcc62ff4e 100644 --- a/r/R/dplyr-datetime-helpers.R +++ b/r/R/dplyr-datetime-helpers.R @@ -417,3 +417,161 @@ build_strptime_exprs <- function(x, formats) { ) ) } + +# This function parses the "unit" argument to round_date, floor_date, and +# ceiling_date. The input x is a single string like "second", "3 seconds", +# "10 microseconds" or "2 secs" used to specify the size of the unit to +# which the temporal data should be rounded. The matching rules implemented +# are designed to mirror lubridate exactly: it extracts the numeric multiple +# from the start of the string (presumed to be 1 if no number is present) +# and selects the unit by looking at the first 3 characters only. This choice +# ensures that "secs", "second", "microsecs" etc are all valid, but it is +# very permissive and would interpret "mickeys" as microseconds. This +# permissive implementation mirrors the corresponding implementation in +# lubridate. The return value is a list with integer-valued components +# "multiple" and "unit" +parse_period_unit <- function(x) { + # the regexp matches against fractional units, but per lubridate + # supports integer multiples of a known unit only + match_info <- regexpr( +pattern = " *(?[0-9.,]+)? *(?[^ \t\n]+)", +text = x[[1]], +perl = TRUE + ) + + capture_start <- attr(match_info, "capture.start") + capture_length <- attr(match_info, "capture.length") + capture_end <- capture_start + capture_length - 1L + + str_unit <- substr(x, capture_start[[2]], capture_end[[2]]) + str_multiple <- substr(x, capture_start[[1]], capture_end[[1]]) + + known_units <- c("nanosecond", "microsecond", "millisecond", "second", + "minute", "hour", "day", "week", "month", "quarter", "year") + + # match the period unit + str_unit_start <- substr(str_unit, 1, 3) + unit <- as.integer(pmatch(str_unit_start, known_units)) - 1L + + if (any(is.na(unit))) { +abort( + sprintf( +"Invalid period name: '%s'", +str_unit, +". Known units are", +oxford_past
[arrow] branch master updated: ARROW-8324: [R] Add read/write_ipc_file separate from _feather (#13626)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new d81d8451a0 ARROW-8324: [R] Add read/write_ipc_file separate from _feather (#13626) d81d8451a0 is described below commit d81d8451a0ff1c5108bc04e727ae053365950551 Author: eitsupi <50911393+eits...@users.noreply.github.com> AuthorDate: Tue Jul 19 05:15:50 2022 +0900 ARROW-8324: [R] Add read/write_ipc_file separate from _feather (#13626) Add `read_ipc_file()` and `write_ipc_file()` to read and write Arrow IPC files (Feather V2). These are much the same as `read_feather()`/`write_feather()` for now, but in the future *_feather functions may move to a different implementation to accommodate Feather V1 format. Authored-by: SHIMA Tatsuya Signed-off-by: Neal Richardson --- r/NAMESPACE | 2 ++ r/NEWS.md | 7 +- r/R/feather.R | 56 - r/man/read_feather.Rd | 13 +++--- r/man/write_feather.Rd | 38 ++-- r/tests/testthat/test-feather.R | 33 r/vignettes/arrow.Rmd | 3 ++- 7 files changed, 126 insertions(+), 26 deletions(-) diff --git a/r/NAMESPACE b/r/NAMESPACE index c7d2657bae..750a815f9f 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -335,6 +335,7 @@ export(open_dataset) export(read_csv_arrow) export(read_delim_arrow) export(read_feather) +export(read_ipc_file) export(read_ipc_stream) export(read_json_arrow) export(read_message) @@ -370,6 +371,7 @@ export(vctrs_extension_type) export(write_csv_arrow) export(write_dataset) export(write_feather) +export(write_ipc_file) export(write_ipc_stream) export(write_parquet) export(write_to_raw) diff --git a/r/NEWS.md b/r/NEWS.md index fca55b047e..59245b971d 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -24,7 +24,12 @@ * `lubridate::parse_date_time()` datetime parser: * `orders` with year, month, day, hours, minutes, and seconds components are supported. * the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). -* `read_arrow()` and `write_arrow()`, deprecated since 1.0.0 (July 2020), have been removed. Use the `read/write_feather()` and `read/write_ipc_stream()` functions depending on whether you're working with the Arrow IPC file or stream format, respectively. +* New functions `read_ipc_file()` and `write_ipc_file()` are added. + These functions are almost the same as `read_feather()` and `write_feather()`, + but differ in that they only target IPC files (Feather V2 files), not Feather V1 files. +* `read_arrow()` and `write_arrow()`, deprecated since 1.0.0 (July 2020), have been removed. + Instead of these, use the `read_ipc_file()` and `write_ipc_file()` for IPC files, or, + `read_ipc_stream()` and `write_ipc_stream()` for IPC streams. * `write_parquet()` now defaults to writing Parquet format version 2.4 (was 1.0). Previously deprecated arguments `properties` and `arrow_properties` have been removed; if you need to deal with these lower-level properties objects directly, use `ParquetFileWriter`, which `write_parquet()` wraps. # arrow 8.0.0 diff --git a/r/R/feather.R b/r/R/feather.R index 02871396fa..46863c98a1 100644 --- a/r/R/feather.R +++ b/r/R/feather.R @@ -15,19 +15,23 @@ # specific language governing permissions and limitations # under the License. -#' Write data in the Feather format +#' Write a Feather file (an Arrow IPC file) #' #' Feather provides binary columnar serialization for data frames. #' It is designed to make reading and writing data frames efficient, #' and to make sharing data across data analysis languages easy. -#' This function writes both the original, limited specification of the format -#' and the version 2 specification, which is the Apache Arrow IPC file format. +#' [write_feather()] can write both the Feather Version 1 (V1), +#' a legacy version available starting in 2016, and the Version 2 (V2), +#' which is the Apache Arrow IPC file format. +#' The default version is V2. +#' V1 files are distinct from Arrow IPC files and lack many feathures, +#' such as the ability to store all Arrow data tyeps, and compression support. +#' [write_ipc_file()] can only write V2 files. #' #' @param x `data.frame`, [RecordBatch], or [Table] #' @param sink A string file path, URI, or [OutputStream], or path in a file #' system (`SubTreeFileSystem`) -#' @param version integer Feather file version. Version 2 is the current. -#' Version 1 is the more limited legacy format. +#' @param version integer Feather file version, Version 1 or Ver
[arrow] branch master updated: ARROW-17102: [R] Test fails on R minimal nightly builds due to Parquet writing (#13631)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 72d2d24851 ARROW-17102: [R] Test fails on R minimal nightly builds due to Parquet writing (#13631) 72d2d24851 is described below commit 72d2d248517c0d6f42ef921ed996c92e634e7a81 Author: Nic Crane AuthorDate: Mon Jul 18 20:15:56 2022 +0100 ARROW-17102: [R] Test fails on R minimal nightly builds due to Parquet writing (#13631) Authored-by: Nic Crane Signed-off-by: Neal Richardson --- r/tests/testthat/test-dplyr-summarize.R | 2 ++ 1 file changed, 2 insertions(+) diff --git a/r/tests/testthat/test-dplyr-summarize.R b/r/tests/testthat/test-dplyr-summarize.R index 3711b49975..f799fcbf38 100644 --- a/r/tests/testthat/test-dplyr-summarize.R +++ b/r/tests/testthat/test-dplyr-summarize.R @@ -237,6 +237,8 @@ test_that("Group by any/all", { }) test_that("n_distinct() with many batches", { + skip_if_not_available("parquet") + tf <- tempfile() write_parquet(dplyr::starwars, tf, chunk_size = 20)
[arrow] branch master updated: ARROW-14575: [R] Allow functions with `pkg::` prefixes (#13160)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 3e0eea1244 ARROW-14575: [R] Allow functions with `pkg::` prefixes (#13160) 3e0eea1244 is described below commit 3e0eea1244a066a6aee3262440093df021c37882 Author: Dragoș Moldovan-Grünfeld AuthorDate: Fri Jul 15 22:23:50 2022 +0100 ARROW-14575: [R] Allow functions with `pkg::` prefixes (#13160) This PR will allow the use of namespacing with bindings: ``` r library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) library(lubridate, warn.conflicts = FALSE) test_df <- tibble( date = as.Date(c("2022-03-22", "2021-07-30", NA)) ) test_df %>% mutate(ddate = lubridate::as_datetime(date)) %>% collect() #> # A tibble: 3 × 2 #> date ddate #> #> 1 2022-03-22 2022-03-22 00:00:00 #> 2 2021-07-30 2021-07-30 00:00:00 #> 3 NA NA test_df %>% arrow_table() %>% mutate(ddate = lubridate::as_datetime(date)) %>% collect() #> # A tibble: 3 × 2 #> date ddate #> #> 1 2022-03-22 2022-03-22 00:00:00 #> 2 2021-07-30 2021-07-30 00:00:00 #> 3 NA NA ``` Created on 2022-05-14 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1) The approach (option 1 from the [design doc](https://docs.google.com/document/d/1Om-vYb31b6p_u4tyl86SGW1DrtWBfksq8NYG1Seqaxg/edit#)): - [x] add functionality to allow binding registration with the `pkg::fun()` name; - [x] Modify `register_binding()` to register 2 identical copies for each `pkg::fun` binding, namely `fun` and `pkg::fun`. - [x] Add a binding for the `::` operator, which helps with retrieving bindings from the function registry. - [x] Add generic unit tests for the `pkg::fun` functionality. - [x] Warn for a duplicated binding registration. - [x] register `nse_funcs` requiring _indirect_ mapping - [x] register each binding with and without the `pkg::` prefix. - [x] add / update unit tests for the `nse_funcs` bindings to include at least one `pkg::fun()` call for each binding unit tests for conditional bindings - [x] `"dplyr::coalesce"` - [x] `"dplyr::if_else"` - [x] `"base::ifelse"` - [x] `"dplyr::case_when"` unit tests for date/time bindings - [x] `"base::strptime"` - [x] `"base::strftime"` - [x] `"lubridate::format_ISO8601"` - [x] `"lubridate::is.Date"` - [x] `"lubridate::is.instant"` - [x] `"lubridate::is.timepoint"` - [x] `"lubridate::is.POSIXct"` - [x] `"lubridate::date"` - [x] `"lubridate::second"` - [x] `"lubridate::wday"` - [x] `"lubridate::week"` - [x] `"lubridate::month"` - [x] `"lubridate::am"` - [x] `"lubridate::pm"` - [x] `"lubridate::tz"` - [x] `"lubridate::semester"` - [x] `"lubridate::make_datetime"` - [x] `"lubridate::make_date"` - [x] `"base::ISOdatetime"` - [x] `"base::ISOdate"` - [x] `"base::as.Date"` - [x] `"lubridate::as_date"` - [x] `"lubridate::as_datetime"` - [x] `"lubridate::decimal_date"` - [x] `"lubridate::date_decimal"` - [x] `"base::difftime"` - [x] `"base::as.difftime"` - [x] `"lubridate::make_difftime"` - [x] `"lubridate::dminutes"` - [x] `"lubridate::dhours"` - [x] `"lubridate::ddays"` - [x] `"lubridate::dweeks"` - [x] `"lubridate::dmonths"` - [x] `"lubridate::dyears"` - [x] `"lubridate::dseconds"` - [x] `"lubridate::dmilliseconds"` - [x] `"lubridate::dmicroseconds"` - [x] `"lubridate::dnanoseconds"` - [x] `"lubridate::dpicoseconds"` - [x] `"lubridate::parse_date_time"` - [x] `"lubridate::ymd"` - [x] `"lubridate::ydm"` - [x] `"lubridate::mdy"` - [x] `"lubridate::myd"` - [x] `"lubridate::dmy"` - [x] `"lubridate::dym"` - [x] `&
[arrow] branch master updated: ARROW-17085: [R] group_vars() should not return NULL (#13621)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 29cc263068 ARROW-17085: [R] group_vars() should not return NULL (#13621) 29cc263068 is described below commit 29cc263068b983e690879d4d768025439a0fdd47 Author: eitsupi <50911393+eits...@users.noreply.github.com> AuthorDate: Sat Jul 16 01:06:57 2022 +0900 ARROW-17085: [R] group_vars() should not return NULL (#13621) If an ungrouped data.frame or an `arrow_dplyr_query` is given to `dplyr::group_vars()`, `character()` returns. But for an ungrouped Table, `NULL` is returned. ```r mtcars |> dplyr::group_vars() #> character(0) mtcars |> arrow:::as_adq() |> dplyr::group_vars() #> character(0) mtcars |> arrow::arrow_table() |> dplyr::group_vars() #> NULL ``` Therefore, functions that expect `group_vars` to return character, such as the following, will fail. ```r mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt() #> Error in new_step(parent, vars = names(parent), groups = groups, locals = list(), : is.character(groups) is not TRUE ``` This PR modifies `dplyr::group_vars()` and `dplyr::groups()` for Arrow objects to work the same as for data.frame. (Note that `arrow_dplyr_query` already works the same way as data.frame.) Lead-authored-by: SHIMA Tatsuya Co-authored-by: eitsupi <50911393+eits...@users.noreply.github.com> Signed-off-by: Neal Richardson --- r/R/dplyr-group-by.R| 8 r/R/dplyr.R | 2 +- r/tests/testthat/test-RecordBatch.R | 10 -- r/tests/testthat/test-Table.R | 8 +++- r/tests/testthat/test-metadata.R| 2 +- 5 files changed, 21 insertions(+), 9 deletions(-) diff --git a/r/R/dplyr-group-by.R b/r/R/dplyr-group-by.R index 250dbedb18..c650799e8d 100644 --- a/r/R/dplyr-group-by.R +++ b/r/R/dplyr-group-by.R @@ -58,13 +58,13 @@ group_by.arrow_dplyr_query <- function(.data, group_by.Dataset <- group_by.ArrowTabular <- group_by.RecordBatchReader <- group_by.arrow_dplyr_query groups.arrow_dplyr_query <- function(x) syms(dplyr::group_vars(x)) -groups.Dataset <- groups.ArrowTabular <- groups.RecordBatchReader <- function(x) NULL +groups.Dataset <- groups.ArrowTabular <- groups.RecordBatchReader <- groups.arrow_dplyr_query group_vars.arrow_dplyr_query <- function(x) x$group_by_vars -group_vars.Dataset <- function(x) NULL -group_vars.RecordBatchReader <- function(x) NULL +group_vars.Dataset <- function(x) character() +group_vars.RecordBatchReader <- function(x) character() group_vars.ArrowTabular <- function(x) { - x$metadata$r$attributes$.group_vars + x$metadata$r$attributes$.group_vars %||% character() } # the logical literal in the two functions below controls the default value of diff --git a/r/R/dplyr.R b/r/R/dplyr.R index b048d98018..1296e60384 100644 --- a/r/R/dplyr.R +++ b/r/R/dplyr.R @@ -42,7 +42,7 @@ arrow_dplyr_query <- function(.data) { gv <- tryCatch( # If dplyr is not available, or if the input doesn't have a group_vars # method, assume no group vars -dplyr::group_vars(.data) %||% character(), +dplyr::group_vars(.data), error = function(e) character() ) diff --git a/r/tests/testthat/test-RecordBatch.R b/r/tests/testthat/test-RecordBatch.R index e7602d9f74..6b79325934 100644 --- a/r/tests/testthat/test-RecordBatch.R +++ b/r/tests/testthat/test-RecordBatch.R @@ -654,7 +654,7 @@ test_that("Handling string data with embedded nuls", { }) }) -test_that("ARROW-11769/ARROW-13860 - grouping preserved in record batch creation", { +test_that("ARROW-11769/ARROW-13860/ARROW-17085 - grouping preserved in record batch creation", { skip_if_not_available("dataset") library(dplyr, warn.conflicts = FALSE) @@ -670,6 +670,12 @@ test_that("ARROW-11769/ARROW-13860 - grouping preserved in record batch creation record_batch(), "RecordBatch" ) + expect_identical( +tbl %>% + record_batch() %>% + group_vars(), +group_vars(tbl) + ) expect_identical( tbl %>% group_by(fct, fct2) %>% @@ -683,7 +689,7 @@ test_that("ARROW-11769/ARROW-13860 - grouping preserved in record batch creation record_batch() %>% ungroup() %>% group_vars(), -NULL +character() ) expect_identical( tbl %>% diff --git a/r/tests/testthat/test-Table.R b/r/tests/testthat/test-Table.R index 5edba2cd4a..bafd183108 100644 --- a/r/tests/testthat/test-Table.R +++ b/r/tests/testthat/test-Table.R @@ -592,7 +592,7 @@ test_that("cbind.Table handles record batches and tables", { ) }) -test_t
[arrow] branch master updated: MINOR: [R] Conditionally skip some glimpse-related tests (#13610)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new f295da4cfd MINOR: [R] Conditionally skip some glimpse-related tests (#13610) f295da4cfd is described below commit f295da4cfdcf102d9ac2d16bbca6f8342fc3e6a8 Author: Neal Richardson AuthorDate: Thu Jul 14 19:17:54 2022 -0400 MINOR: [R] Conditionally skip some glimpse-related tests (#13610) Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/tests/testthat/helper-skip.R | 4 ++-- r/tests/testthat/test-Array.R| 2 +- r/tests/testthat/test-RecordBatch.R | 2 +- r/tests/testthat/test-altrep.R | 2 +- r/tests/testthat/test-chunked-array.R| 2 +- r/tests/testthat/test-csv.R | 2 +- r/tests/testthat/test-dplyr-funcs-datetime.R | 2 +- r/tests/testthat/test-dplyr-funcs-type.R | 2 +- r/tests/testthat/test-dplyr-glimpse.R| 5 + r/tests/testthat/test-dplyr-query.R | 3 +++ r/tests/testthat/test-feather.R | 2 +- r/tests/testthat/test-safe-call-into-r.R | 4 ++-- r/tests/testthat/test-scalar.R | 2 +- 13 files changed, 21 insertions(+), 13 deletions(-) diff --git a/r/tests/testthat/helper-skip.R b/r/tests/testthat/helper-skip.R index 24e5b3f7dc..fd1ce1a76c 100644 --- a/r/tests/testthat/helper-skip.R +++ b/r/tests/testthat/helper-skip.R @@ -92,12 +92,12 @@ skip_on_linux_devel <- function() { } } -skip_if_r_version <- function(r_version) { +skip_on_r_older_than <- function(r_version) { if (force_tests()) { return() } - if (getRversion() <= r_version) { + if (getRversion() < r_version) { skip(paste("R version:", getRversion())) } } diff --git a/r/tests/testthat/test-Array.R b/r/tests/testthat/test-Array.R index ebc6085095..56c7028d6a 100644 --- a/r/tests/testthat/test-Array.R +++ b/r/tests/testthat/test-Array.R @@ -785,7 +785,7 @@ test_that("Handling string data with embedded nuls", { # The behavior of the warnings/errors is slightly different with and without # altrep. Without it (i.e. 3.5.0 and below, the error would trigger immediately # on `as.vector()` where as with it, the error only happens on materialization) - skip_if_r_version("3.5.0") + skip_on_r_older_than("3.6") # no error on conversion, because altrep laziness v <- expect_error(as.vector(array_with_nul), NA) diff --git a/r/tests/testthat/test-RecordBatch.R b/r/tests/testthat/test-RecordBatch.R index a39aa0f0fb..e7602d9f74 100644 --- a/r/tests/testthat/test-RecordBatch.R +++ b/r/tests/testthat/test-RecordBatch.R @@ -626,7 +626,7 @@ test_that("Handling string data with embedded nuls", { # The behavior of the warnings/errors is slightly different with and without # altrep. Without it (i.e. 3.5.0 and below, the error would trigger immediately # on `as.vector()` where as with it, the error only happens on materialization) - skip_if_r_version("3.5.0") + skip_on_r_older_than("3.6") df <- as.data.frame(batch_with_nul) expect_error( diff --git a/r/tests/testthat/test-altrep.R b/r/tests/testthat/test-altrep.R index 082a3ea91f..cd1d841c42 100644 --- a/r/tests/testthat/test-altrep.R +++ b/r/tests/testthat/test-altrep.R @@ -15,7 +15,7 @@ # specific language governing permissions and limitations # under the License. -skip_if_r_version("3.5.0") +skip_on_r_older_than("3.6") test_that("is_arrow_altrep() does not include base altrep", { expect_false(is_arrow_altrep(1:10)) diff --git a/r/tests/testthat/test-chunked-array.R b/r/tests/testthat/test-chunked-array.R index 5f32184efc..ce43d84274 100644 --- a/r/tests/testthat/test-chunked-array.R +++ b/r/tests/testthat/test-chunked-array.R @@ -478,7 +478,7 @@ test_that("Handling string data with embedded nuls", { # The behavior of the warnings/errors is slightly different with and without # altrep. Without it (i.e. 3.5.0 and below, the error would trigger immediately # on `as.vector()` where as with it, the error only happens on materialization) - skip_if_r_version("3.5.0") + skip_on_r_older_than("3.6") v <- expect_error(as.vector(chunked_array_with_nul), NA) diff --git a/r/tests/testthat/test-csv.R b/r/tests/testthat/test-csv.R index 8e463d3abe..fca717cc05 100644 --- a/r/tests/testthat/test-csv.R +++ b/r/tests/testthat/test-csv.R @@ -295,7 +295,7 @@ test_that("more informative error when reading a CSV with headers and schema", { test_that("read_csv_arrow() and write_csv_arrow() accept connection objects", { # connections with csv need RunWithCapturedR, which is not available # in R <= 3.4.4 - skip_if_r_version("3.4.4")
[arrow] branch master updated (5d86e9fc40 -> 87d1889092)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 5d86e9fc40 ARROW-16734: [C++] Bump vendored version of protobuf (#13581) add 87d1889092 ARROW-16977: [R] Update dataset row counting so no integer overflow on large datasets (#13514) No new revisions were added by this update. Summary of changes: r/NAMESPACE | 1 + r/R/arrow-package.R | 2 +- r/R/record-batch.R| 4 ++-- r/R/util.R| 2 +- r/src/array.cpp | 24 ++--- r/src/arrowExports.cpp| 50 +-- r/src/buffer.cpp | 8 +++ r/src/chunkedarray.cpp| 19 +--- r/src/dataset.cpp | 4 ++-- r/src/filesystem.cpp | 4 +++- r/src/io.cpp | 21 +- r/src/message.cpp | 9 r/src/parquet.cpp | 4 ++-- r/src/recordbatch.cpp | 8 +++ r/src/table.cpp | 8 +++ r/tests/testthat/test-Table.R | 25 ++ 16 files changed, 113 insertions(+), 80 deletions(-)
[arrow] branch master updated: ARROW-16776: [R] dplyr::glimpse method for arrow table and datasets (#13563)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new c6534a59a3 ARROW-16776: [R] dplyr::glimpse method for arrow table and datasets (#13563) c6534a59a3 is described below commit c6534a59a38acd31856284bcdfa36ecea7d11479 Author: Neal Richardson AuthorDate: Tue Jul 12 15:48:16 2022 -0400 ARROW-16776: [R] dplyr::glimpse method for arrow table and datasets (#13563) See reprex (sans terminal formatting) in [r/tests/testthat/_snaps/dplyr-glimpse.md](https://github.com/apache/arrow/pull/13563/files#diff-e8d50da600908f077796a43b7600c17d34448671c7975bb8c4056a484ac2999e) Not all queries can be glimpse()d: some would require evaluating the whole query, which may be expensive (and can't be interrupted yet, see ARROW-11841). Note that the existing `print()` methods aren't affected by this. There is still the idea that the print methods for Table/RecordBatch should print some data (ARROW-16777 and others), but that should probably be column-oriented instead of row-oriented like glimpse(). Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/DESCRIPTION| 3 + r/NAMESPACE | 2 + r/R/arrow-object.R | 6 +- r/R/arrow-package.R | 3 +- r/R/chunked-array.R | 3 +- r/R/dplyr-count.R| 2 +- r/R/dplyr-glimpse.R | 160 +++ r/R/dplyr.R | 47 - r/R/extension.R | 22 + r/R/filesystem.R | 1 - r/R/query-engine.R | 4 +- r/tests/testthat/_snaps/dplyr-glimpse.md | 152 + r/tests/testthat/test-chunked-array.txt | 4 + r/tests/testthat/test-data-type.R| 19 ++-- r/tests/testthat/test-dplyr-glimpse.R| 102 r/tests/testthat/test-dplyr-query.R | 140 +++ r/tests/testthat/test-extension.R| 2 +- r/tests/testthat/test-schema.R | 11 +-- 18 files changed, 637 insertions(+), 46 deletions(-) diff --git a/r/DESCRIPTION b/r/DESCRIPTION index 2cbbec054a..a7408d27d6 100644 --- a/r/DESCRIPTION +++ b/r/DESCRIPTION @@ -44,6 +44,7 @@ RoxygenNote: 7.2.0 Config/testthat/edition: 3 VignetteBuilder: knitr Suggests: +cli, DBI, dbplyr, decor, @@ -53,6 +54,7 @@ Suggests: hms, knitr, lubridate, +pillar, pkgload, reticulate, rmarkdown, @@ -103,6 +105,7 @@ Collate: 'dplyr-funcs-type.R' 'expression.R' 'dplyr-funcs.R' +'dplyr-glimpse.R' 'dplyr-group-by.R' 'dplyr-join.R' 'dplyr-mutate.R' diff --git a/r/NAMESPACE b/r/NAMESPACE index 023e9bb831..86eb958471 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -453,6 +453,8 @@ importFrom(tidyselect,starts_with) importFrom(tidyselect,vars_pull) importFrom(tidyselect,vars_rename) importFrom(tidyselect,vars_select) +importFrom(utils,capture.output) +importFrom(utils,getFromNamespace) importFrom(utils,head) importFrom(utils,install.packages) importFrom(utils,modifyList) diff --git a/r/R/arrow-object.R b/r/R/arrow-object.R index 0a82f85877..ac067d4aa5 100644 --- a/r/R/arrow-object.R +++ b/r/R/arrow-object.R @@ -31,14 +31,16 @@ ArrowObject <- R6Class("ArrowObject", } assign(".:xp:.", xp, envir = self) }, -print = function(...) { +class_title = function() { if (!is.null(self$.class_title)) { # Allow subclasses to override just printing the class name first class_title <- self$.class_title() } else { class_title <- class(self)[[1]] } - cat(class_title, "\n", sep = "") +}, +print = function(...) { + cat(self$class_title(), "\n", sep = "") if (!is.null(self$ToString)) { cat(self$ToString(), "\n", sep = "") } diff --git a/r/R/arrow-package.R b/r/R/arrow-package.R index 05270ef6bb..a2c37d0ce3 100644 --- a/r/R/arrow-package.R +++ b/r/R/arrow-package.R @@ -41,7 +41,7 @@ "group_vars", "group_by_drop_default", "ungroup", "mutate", "transmute", "arrange", "rename", "pull", "relocate", "compute", "collapse", "distinct", "left_join", "right_join", "inner_join", "full_join", - "semi_join", "anti_join", "count", "tally", "rename_with", "union", "union_all" + "semi_join", "anti_jo
[arrow] branch master updated: MINOR: [R] Cleanup skips and TODOs (#13576)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new a01b0c20c7 MINOR: [R] Cleanup skips and TODOs (#13576) a01b0c20c7 is described below commit a01b0c20c7e2c3283cf195de38372b998dbf17d5 Author: Neal Richardson AuthorDate: Tue Jul 12 09:02:40 2022 -0400 MINOR: [R] Cleanup skips and TODOs (#13576) Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/R/array.R | 6 --- r/R/arrow-datum.R| 16 +-- r/R/chunked-array.R | 12 -- r/R/compute.R| 1 - r/R/dplyr-datetime-helpers.R | 20 - r/R/dplyr-distinct.R | 3 +- r/R/dplyr-funcs-datetime.R | 5 +-- r/R/dplyr-summarize.R| 1 - r/src/altrep.cpp | 1 - r/tests/testthat/test-compute-arith.R| 6 +-- r/tests/testthat/test-compute-sort.R | 4 +- r/tests/testthat/test-dplyr-collapse.R | 12 -- r/tests/testthat/test-dplyr-distinct.R | 2 +- r/tests/testthat/test-dplyr-filter.R | 10 - r/tests/testthat/test-dplyr-funcs-datetime.R | 63 +++- r/tests/testthat/test-dplyr-funcs-type.R | 3 +- r/tests/testthat/test-dplyr-mutate.R | 2 +- r/tests/testthat/test-dplyr-summarize.R | 2 +- r/tools/autobrew | 3 +- 19 files changed, 76 insertions(+), 96 deletions(-) diff --git a/r/R/array.R b/r/R/array.R index 89e9fbfa33..9ae7631e7d 100644 --- a/r/R/array.R +++ b/r/R/array.R @@ -155,12 +155,6 @@ Array <- R6Class("Array", assert_is(i, "Array") call_function("filter", self, i, options = list(keep_na = keep_na)) }, -SortIndices = function(descending = FALSE) { - assert_that(is.logical(descending)) - assert_that(length(descending) == 1L) - assert_that(!is.na(descending)) - call_function("array_sort_indices", self, options = list(order = descending)) -}, RangeEquals = function(other, start_idx, end_idx, other_start_idx = 0L) { assert_is(other, "Array") Array__RangeEquals(self, other, start_idx, end_idx, other_start_idx) diff --git a/r/R/arrow-datum.R b/r/R/arrow-datum.R index 39362628bb..8632ca3053 100644 --- a/r/R/arrow-datum.R +++ b/r/R/arrow-datum.R @@ -26,6 +26,16 @@ ArrowDatum <- R6Class("ArrowDatum", opts <- cast_options(safe, ...) opts$to_type <- as_type(target_type) call_function("cast", self, options = opts) +}, +SortIndices = function(descending = FALSE) { + assert_that(is.logical(descending)) + assert_that(length(descending) == 1L) + assert_that(!is.na(descending)) + call_function( +"sort_indices", +self, +options = list(names = "", orders = as.integer(descending)) + ) } ) ) @@ -55,8 +65,8 @@ is.na.ArrowDatum <- function(x) { #' @export is.nan.ArrowDatum <- function(x) { if (x$type_id() %in% TYPES_WITH_NAN) { -# TODO: if an option is added to the is_nan kernel to treat NA as NaN, -# use that to simplify the code here (ARROW-13366) +# TODO(ARROW-13366): if an option is added to the is_nan kernel to treat NA +# as NaN, use that to simplify the code here call_function("is_nan", x) & call_function("is_valid", x) } else { Scalar$create(FALSE)$as_array(length(x)) @@ -336,7 +346,7 @@ sort.ArrowDatum <- function(x, decreasing = FALSE, na.last = NA, ...) { # Arrow always sorts nulls at the end of the array. This corresponds to # sort(na.last = TRUE). For the other two cases (na.last = NA and # na.last = FALSE) we need to use workarounds. - # TODO: Implement this more cleanly after ARROW-12063 + # TODO(ARROW-14085): use NullPlacement ArraySortOptions instead of this workaround if (is.na(na.last)) { # Filter out NAs before sorting x <- x$Filter(!is.na(x)) diff --git a/r/R/chunked-array.R b/r/R/chunked-array.R index 24ca7e6e58..c16f562017 100644 --- a/r/R/chunked-array.R +++ b/r/R/chunked-array.R @@ -113,18 +113,6 @@ ChunkedArray <- R6Class("ChunkedArray", } call_function("filter", self, i, options = list(keep_na = keep_na)) }, -SortIndices = function(descending = FALSE) { - assert_that(is.logical(descending)) - assert_that(length(descending) == 1L) - assert_that(!is.na(descending)) - # TODO: after ARROW-12042 is closed, review whether this and the - # Array$SortIndices definition can be consolidated - call_function( -"sort_indices", -self, -opti
[arrow] branch master updated: ARROW-16715: [R] Bump default parquet version (#13555)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new f0ff8d015a ARROW-16715: [R] Bump default parquet version (#13555) f0ff8d015a is described below commit f0ff8d015a26a780426a13b556d9db082daed200 Author: Neal Richardson AuthorDate: Mon Jul 11 11:26:51 2022 -0400 ARROW-16715: [R] Bump default parquet version (#13555) Also removes deprecated args to `write_parquet()` Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/NAMESPACE | 1 + r/NEWS.md| 1 + r/R/arrow-package.R | 2 +- r/R/enums.R | 2 +- r/R/parquet.R| 99 r/man/enums.Rd | 2 +- r/man/write_parquet.Rd | 48 r/tests/testthat/_snaps/dataset-write.md | 2 +- r/tests/testthat/test-parquet.R | 52 ++--- 9 files changed, 122 insertions(+), 87 deletions(-) diff --git a/r/NAMESPACE b/r/NAMESPACE index 5762df9eb0..023e9bb831 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -395,6 +395,7 @@ importFrom(rlang,"%||%") importFrom(rlang,":=") importFrom(rlang,.data) importFrom(rlang,abort) +importFrom(rlang,arg_match) importFrom(rlang,as_function) importFrom(rlang,as_label) importFrom(rlang,as_quosure) diff --git a/r/NEWS.md b/r/NEWS.md index 119974f74a..fca55b047e 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -25,6 +25,7 @@ * `orders` with year, month, day, hours, minutes, and seconds components are supported. * the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). * `read_arrow()` and `write_arrow()`, deprecated since 1.0.0 (July 2020), have been removed. Use the `read/write_feather()` and `read/write_ipc_stream()` functions depending on whether you're working with the Arrow IPC file or stream format, respectively. +* `write_parquet()` now defaults to writing Parquet format version 2.4 (was 1.0). Previously deprecated arguments `properties` and `arrow_properties` have been removed; if you need to deal with these lower-level properties objects directly, use `ParquetFileWriter`, which `write_parquet()` wraps. # arrow 8.0.0 diff --git a/r/R/arrow-package.R b/r/R/arrow-package.R index 7b59854f1e..05270ef6bb 100644 --- a/r/R/arrow-package.R +++ b/r/R/arrow-package.R @@ -23,7 +23,7 @@ #' @importFrom rlang eval_tidy new_data_mask syms env new_environment env_bind set_names exec #' @importFrom rlang is_bare_character quo_get_expr quo_get_env quo_set_expr .data seq2 is_interactive #' @importFrom rlang expr caller_env is_character quo_name is_quosure enexpr enexprs as_quosure -#' @importFrom rlang is_list call2 is_empty as_function as_label +#' @importFrom rlang is_list call2 is_empty as_function as_label arg_match #' @importFrom tidyselect vars_pull vars_rename vars_select eval_select #' @useDynLib arrow, .registration = TRUE #' @keywords internal diff --git a/r/R/enums.R b/r/R/enums.R index 17d0484b99..727ca9388c 100644 --- a/r/R/enums.R +++ b/r/R/enums.R @@ -122,7 +122,7 @@ FileType <- enum("FileType", #' @export #' @rdname enums ParquetVersionType <- enum("ParquetVersionType", - PARQUET_1_0 = 0L, PARQUET_2_0 = 1L + PARQUET_1_0 = 0L, PARQUET_2_0 = 1L, PARQUET_2_4 = 2L, PARQUET_2_6 = 3L ) #' @export diff --git a/r/R/parquet.R b/r/R/parquet.R index 62da28fd1e..8cd9daa857 100644 --- a/r/R/parquet.R +++ b/r/R/parquet.R @@ -83,30 +83,29 @@ read_parquet <- function(file, #' @param sink A string file path, URI, or [OutputStream], or path in a file #' system (`SubTreeFileSystem`) #' @param chunk_size how many rows of data to write to disk at once. This -#' directly corresponds to how many rows will be in each row group in parquet. -#' If `NULL`, a best guess will be made for optimal size (based on the number of -#' columns and number of rows), though if the data has fewer than 250 million -#' cells (rows x cols), then the total number of rows is used. -#' @param version parquet version, "1.0" or "2.0". Default "1.0". Numeric values -#' are coerced to character. +#'directly corresponds to how many rows will be in each row group in +#'parquet. If `NULL`, a best guess will be made for optimal size (based on +#'the number of columns and number of rows), though if the data has fewer +#'than 250 million cells (rows x cols), then the total number of rows is +#'used. +#' @param version parquet version: "1.0", "2.0" (de
[arrow] branch master updated (fdcf63a1ed -> 8042f001fb)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from fdcf63a1ed ARROW-16828: [R][Packaging] Enable Brotli and BZ2 on MacOS and Windows (#13484) add 8042f001fb ARROW-16405: [R][CI] Use nightlies.apache.org as dev repo (#13241) No new revisions were added by this update. Summary of changes: docs/source/developers/guide/resources.rst | 2 +- r/NEWS.md | 4 +++- r/R/install-arrow.R| 2 +- r/README.md| 4 ++-- r/tools/nixlibs.R | 2 +- r/tools/winlibs.R | 2 +- r/vignettes/developers/install_details.Rmd | 15 --- r/vignettes/developers/setup.Rmd | 21 +++-- r/vignettes/install.Rmd| 2 +- 9 files changed, 25 insertions(+), 29 deletions(-)
[arrow] branch master updated: ARROW-16828: [R][Packaging] Enable Brotli and BZ2 on MacOS and Windows (#13484)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new fdcf63a1ed ARROW-16828: [R][Packaging] Enable Brotli and BZ2 on MacOS and Windows (#13484) fdcf63a1ed is described below commit fdcf63a1ed94a17a0f05ed78a82d8af730f048a4 Author: Will Jones AuthorDate: Fri Jul 8 10:46:01 2022 -0700 ARROW-16828: [R][Packaging] Enable Brotli and BZ2 on MacOS and Windows (#13484) MacOS was missing Brotli and BZ2. Windows was missing BZ2. After this, MacOS and Windows will have all compressions shipped in binaries. Authored-by: Will Jones Signed-off-by: Neal Richardson --- ci/scripts/PKGBUILD | 2 ++ ci/scripts/r_windows_build.sh| 6 +++--- dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb | 3 +++ r/configure.win | 2 +- r/tests/testthat/test-compressed.R | 2 ++ r/tools/autobrew | 2 +- 6 files changed, 12 insertions(+), 5 deletions(-) diff --git a/ci/scripts/PKGBUILD b/ci/scripts/PKGBUILD index ea17fba17e..428447d263 100644 --- a/ci/scripts/PKGBUILD +++ b/ci/scripts/PKGBUILD @@ -25,6 +25,7 @@ arch=("any") url="https://arrow.apache.org/; license=("Apache-2.0") depends=("${MINGW_PACKAGE_PREFIX}-aws-sdk-cpp" + "${MINGW_PACKAGE_PREFIX}-bzip2" "${MINGW_PACKAGE_PREFIX}-curl" # for google-cloud-cpp bundled build "${MINGW_PACKAGE_PREFIX}-libutf8proc" "${MINGW_PACKAGE_PREFIX}-re2" @@ -123,6 +124,7 @@ build() { -DARROW_WITH_ZLIB=ON \ -DARROW_WITH_ZSTD=ON \ -DARROW_WITH_BROTLI=ON \ +-DARROW_WITH_BZ2=ON \ -DARROW_ZSTD_USE_SHARED=OFF \ -DARROW_CXXFLAGS="${CPPFLAGS}" \ -DCMAKE_BUILD_TYPE="release" \ diff --git a/ci/scripts/r_windows_build.sh b/ci/scripts/r_windows_build.sh index 3334eab866..c361af1d26 100755 --- a/ci/scripts/r_windows_build.sh +++ b/ci/scripts/r_windows_build.sh @@ -87,7 +87,7 @@ if [ -d mingw64/lib/ ]; then # These may be from https://dl.bintray.com/rtools/backports/ cp $MSYS_LIB_DIR/mingw64/lib/lib{thrift,snappy}.a $DST_DIR/${RWINLIB_LIB_DIR}/x64 # These are from https://dl.bintray.com/rtools/mingw{32,64}/ - cp $MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a $DST_DIR/lib/x64 + cp $MSYS_LIB_DIR/mingw64/lib/lib{zstd,lz4,brotli*,bz2,crypto,curl,ss*,utf8proc,re2,aws*}.a $DST_DIR/lib/x64 fi # Same for the 32-bit versions @@ -97,7 +97,7 @@ if [ -d mingw32/lib/ ]; then mkdir -p $DST_DIR/lib/i386 mv mingw32/lib/*.a $DST_DIR/${RWINLIB_LIB_DIR}/i386 cp $MSYS_LIB_DIR/mingw32/lib/lib{thrift,snappy}.a $DST_DIR/${RWINLIB_LIB_DIR}/i386 - cp $MSYS_LIB_DIR/mingw32/lib/lib{zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a $DST_DIR/lib/i386 + cp $MSYS_LIB_DIR/mingw32/lib/lib{zstd,lz4,brotli*,bz2,crypto,curl,ss*,utf8proc,re2,aws*}.a $DST_DIR/lib/i386 fi # Do the same also for ucrt64 @@ -105,7 +105,7 @@ if [ -d ucrt64/lib/ ]; then ls $MSYS_LIB_DIR/ucrt64/lib/ mkdir -p $DST_DIR/lib/x64-ucrt mv ucrt64/lib/*.a $DST_DIR/lib/x64-ucrt - cp $MSYS_LIB_DIR/ucrt64/lib/lib{thrift,snappy,zstd,lz4,brotli*,crypto,curl,ss*,utf8proc,re2,aws*}.a $DST_DIR/lib/x64-ucrt + cp $MSYS_LIB_DIR/ucrt64/lib/lib{thrift,snappy,zstd,lz4,brotli*,bz2,crypto,curl,ss*,utf8proc,re2,aws*}.a $DST_DIR/lib/x64-ucrt fi # Create build artifact diff --git a/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb b/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb index 45c04463b6..dde994ab43 100644 --- a/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb +++ b/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb @@ -31,6 +31,7 @@ class ApacheArrow < Formula # NOTE: if you add something here, be sure to add to PKG_LIBS in r/tools/autobrew depends_on "boost" => :build + depends_on "brotli" depends_on "cmake" => :build depends_on "aws-sdk-cpp" depends_on "lz4" @@ -57,6 +58,8 @@ class ApacheArrow < Formula -DARROW_S3=ON -DARROW_USE_GLOG=OFF -DARROW_VERBOSE_THIRDPARTY_BUILD=ON + -DARROW_WITH_BROTLI=ON + -DARROW_WITH_BZ2=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON diff --git a/r/configure.win b/r/configure.win index dfd2c87ab4..7aa7e47fc1 100755 --- a/r/configure.win +++ b/r/configure.win @@ -64,7 +64,7 @@ function configure_release() { PKG_LIBS="-L${RWINLIB}/lib"'$(subst gcc,,$(COMPILED_BY))$(R_ARCH) ' PKG_LIBS="$PKG_LIBS -L${RWINLIB}/lib"'$(R_ARCH)$(CRT) ' PKG_LIBS="$PKG_LIBS -lparquet -larrow_dataset -larrow -larrow_bundled_depend
[arrow] branch master updated: ARROW-16268: [R] Remove long-deprecated functions (#13550)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new a48c09e6aa ARROW-16268: [R] Remove long-deprecated functions (#13550) a48c09e6aa is described below commit a48c09e6aa3180512354ef9c1ded2f479d09c25e Author: Neal Richardson AuthorDate: Fri Jul 8 13:27:26 2022 -0400 ARROW-16268: [R] Remove long-deprecated functions (#13550) Also has a fix for the check NOTE about union_all and distinct. Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/DESCRIPTION | 3 +-- r/NAMESPACE| 4 +-- r/NEWS.md | 11 r/R/dataset-scan.R | 18 - r/R/deprecated.R | 40 r/R/dplyr-union.R | 2 +- r/man/ArrayData.Rd | 6 +++-- r/man/FileSystem.Rd| 1 + r/man/Scalar.Rd| 6 +++-- r/man/Scanner.Rd | 3 --- r/man/array.Rd | 6 +++-- r/man/arrow-package.Rd | 2 +- r/man/arrow_info.Rd| 3 +++ r/man/read_ipc_stream.Rd | 11 +++- r/man/write_ipc_stream.Rd | 7 ++--- r/tests/testthat/test-Table.R | 53 +++--- r/tests/testthat/test-arrow-info.R | 4 +++ r/tests/testthat/test-dataset.R| 18 - r/tests/testthat/test-type.R | 9 +++ 19 files changed, 56 insertions(+), 151 deletions(-) diff --git a/r/DESCRIPTION b/r/DESCRIPTION index 5385877696..2cbbec054a 100644 --- a/r/DESCRIPTION +++ b/r/DESCRIPTION @@ -40,7 +40,7 @@ Imports: utils, vctrs Roxygen: list(markdown = TRUE, r6 = FALSE, load = "source") -RoxygenNote: 7.1.2 +RoxygenNote: 7.2.0 Config/testthat/edition: 3 VignetteBuilder: knitr Suggests: @@ -88,7 +88,6 @@ Collate: 'dataset-partition.R' 'dataset-scan.R' 'dataset-write.R' -'deprecated.R' 'dictionary.R' 'dplyr-arrange.R' 'dplyr-collect.R' diff --git a/r/NAMESPACE b/r/NAMESPACE index e98cdd51fb..5762df9eb0 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -195,6 +195,7 @@ export(FileType) export(FixedSizeListArray) export(FixedSizeListType) export(FragmentScanOptions) +export(GcsFileSystem) export(HivePartitioning) export(HivePartitioningFactory) export(InMemoryDataset) @@ -251,6 +252,7 @@ export(arrow_available) export(arrow_info) export(arrow_table) export(arrow_with_dataset) +export(arrow_with_gcs) export(arrow_with_json) export(arrow_with_parquet) export(arrow_with_s3) @@ -330,7 +332,6 @@ export(null) export(num_range) export(one_of) export(open_dataset) -export(read_arrow) export(read_csv_arrow) export(read_delim_arrow) export(read_feather) @@ -366,7 +367,6 @@ export(utf8) export(value_counts) export(vctrs_extension_array) export(vctrs_extension_type) -export(write_arrow) export(write_csv_arrow) export(write_dataset) export(write_feather) diff --git a/r/NEWS.md b/r/NEWS.md index d88be22964..45a963ca48 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -22,6 +22,7 @@ * `lubridate::parse_date_time()` datetime parser: * `orders` with year, month, day, hours, minutes, and seconds components are supported. * the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). +* `read_arrow()` and `write_arrow()`, deprecated since 1.0.0 (July 2020), have been removed. Use the `read/write_feather()` and `read/write_ipc_stream()` functions depending on whether you're working with the Arrow IPC file or stream format, respectively. # arrow 8.0.0 @@ -50,7 +51,7 @@ ## Enhancements to date and time support -* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` +* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`. * For Arrow dplyr queries, added additional `{lubridate}` features and fixes: * New component extraction functions: @@ -86,14 +87,14 @@ record batches, arrays, chunked arrays, record batch readers, schemas, and data types. This allows other packages to define custom conversions from their types to Arrow objects, including extension arrays. -* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) +* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) can be created and registered, allowing other packages to define their own array types. Extension arrays wrap regular Arrow array types and pr
[arrow] branch master updated: MINOR: [R][CI] Add all available package versions to PACKAGES (#13551)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 0fdb9cc08b MINOR: [R][CI] Add all available package versions to PACKAGES (#13551) 0fdb9cc08b is described below commit 0fdb9cc08be53ff374d45af109d5ce2d6bb29a82 Author: Jacob Wujciak-Jens AuthorDate: Fri Jul 8 16:42:38 2022 +0200 MINOR: [R][CI] Add all available package versions to PACKAGES (#13551) This overrides the default `latestOnly = TRUE` so all available R package versions are added to the repository index. Authored-by: Jacob Wujciak-Jens Signed-off-by: Neal Richardson --- .github/workflows/r_nightly.yml | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/.github/workflows/r_nightly.yml b/.github/workflows/r_nightly.yml index fc93dde017..a47f69136f 100644 --- a/.github/workflows/r_nightly.yml +++ b/.github/workflows/r_nightly.yml @@ -158,7 +158,11 @@ jobs: run: | # folder that we sync to nightlies.apache.org repo_root <- "repo" - tools::write_PACKAGES(file.path(repo_root, "src/contrib"), type = "source", verbose = TRUE) + tools::write_PACKAGES(file.path(repo_root, "src/contrib"), +type = "source", +verbose = TRUE, +latestOnly = FALSE + ) repo_dirs <- list.dirs(repo_root) # find dirs with binary R packages: e.g. */contrib/4.1 @@ -167,7 +171,11 @@ jobs: for (dir in pkg_dirs) { on_win <- grepl("windows", dir) -tools::write_PACKAGES(dir, type = ifelse(on_win, "win.binary", "mac.binary"), verbose = TRUE ) +tools::write_PACKAGES(dir, + type = ifelse(on_win, "win.binary", "mac.binary"), + verbose = TRUE, + latestOnly = FALSE +) } - name: Show repo contents run: tree repo
[arrow] branch master updated (1a35aa6c57 -> 2aa7923fb6)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 1a35aa6c57 ARROW-16679: [R] configure fails if CDPATH is not null (#13313) add 2aa7923fb6 MINOR: [R] Fix nightly failures with r_vec_size (#13538) No new revisions were added by this update. Summary of changes: r/src/arrowExports.cpp | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
[arrow] branch master updated: ARROW-16679: [R] configure fails if CDPATH is not null (#13313)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 1a35aa6c57 ARROW-16679: [R] configure fails if CDPATH is not null (#13313) 1a35aa6c57 is described below commit 1a35aa6c57379d922f3086da708077e8786aa06e Author: Jacob Wujciak-Jens AuthorDate: Thu Jul 7 22:44:02 2022 +0200 ARROW-16679: [R] configure fails if CDPATH is not null (#13313) Authored-by: Jacob Wujciak-Jens Signed-off-by: Neal Richardson --- r/configure | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/r/configure b/r/configure index d62c58eeda..68dfd5f5ee 100755 --- a/r/configure +++ b/r/configure @@ -177,7 +177,8 @@ else # Assume nixlibs.R has handled and messaged about its failure already # # TODO: what about non-bundled deps? -BUNDLED_LIBS=`cd $LIB_DIR && ls *.a` +# Set CDPATH locally to prevent interference from global CDPATH (if set) +BUNDLED_LIBS=`CDPATH=''; cd $LIB_DIR && ls *.a` BUNDLED_LIBS=`echo "$BUNDLED_LIBS" | sed -e "s/\\.a lib/ -l/g" | sed -e "s/\\.a$//" | sed -e "s/^lib/-l/" | tr '\n' ' ' | sed -e "s/ $//"` PKG_DIRS="-L`pwd`/$LIB_DIR"
[arrow] branch master updated: ARROW-16752: [R] Rework Linux binary installation (#13464)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new c492ef497a ARROW-16752: [R] Rework Linux binary installation (#13464) c492ef497a is described below commit c492ef497a62e600c9436f2a92dace1190c7a465 Author: Neal Richardson AuthorDate: Wed Jul 6 10:16:31 2022 -0400 ARROW-16752: [R] Rework Linux binary installation (#13464) See the jira for the main behavior changes here. Other changes of note: * There are more brief messages printed to the installation log, even in the default "quiet" mode, that indicate which branch of the logic in nixlibs.R you've gone through. They're factual and generally connected to the tests that are being run, but they are worded somewhat ambiguously or coded, so as not to run afoul of censors should they appear in the wrong context. This should help us in the triaging of installation failures, even in circumstances where we can't enable greater verbosity. * There is a start of a test suite for nixlibs.R, run separately from the package tests. It has been wired up to run in `ci/scripts/r_test.sh`. Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- ci/scripts/r_docker_configure.sh | 21 +-- ci/scripts/r_test.sh | 3 + dev/release/rat_exclude_files.txt | 1 + dev/tasks/macros.jinja| 18 +-- dev/tasks/r/github.packages.yml | 70 ++ r/tools/nixlibs-allowlist.txt | 4 + r/tools/nixlibs.R | 268 +++--- r/tools/test-nixlibs.R| 112 r/vignettes/install.Rmd | 112 9 files changed, 434 insertions(+), 175 deletions(-) diff --git a/ci/scripts/r_docker_configure.sh b/ci/scripts/r_docker_configure.sh index 9f93ba2b61..2bc5a4806f 100755 --- a/ci/scripts/r_docker_configure.sh +++ b/ci/scripts/r_docker_configure.sh @@ -19,12 +19,14 @@ set -ex : ${R_BIN:=R} +# This is where our docker setup puts things; set this to run outside of docker +: ${ARROW_SOURCE_HOME:=/arrow} # The Dockerfile should have put this file here -if [ -f "/arrow/ci/etc/rprofile" ]; then +if [ -f "${ARROW_SOURCE_HOME}/ci/etc/rprofile" ]; then # Ensure parallel R package installation, set CRAN repo mirror, # and use pre-built binaries where possible - cat /arrow/ci/etc/rprofile >> $(${R_BIN} RHOME)/etc/Rprofile.site + cat ${ARROW_SOURCE_HOME}/ci/etc/rprofile >> $(${R_BIN} RHOME)/etc/Rprofile.site fi # Ensure parallel compilation of C/C++ code @@ -74,6 +76,9 @@ if [ "$RHUB_PLATFORM" = "linux-x86_64-fedora-clang" ]; then sed -i.bak -E -e 's/(CXX1?1? =.*)/\1 -stdlib=libc++/g' $(${R_BIN} RHOME)/etc/Makeconf rm -rf $(${R_BIN} RHOME)/etc/Makeconf.bak + sed -i.bak -E -e 's/(\-std=gnu\+\+)/-std=c++/g' $(${R_BIN} RHOME)/etc/Makeconf + rm -rf $(${R_BIN} RHOME)/etc/Makeconf.bak + sed -i.bak -E -e 's/(CXXFLAGS = )(.*)/\1 -g -O3 -Wall -pedantic -frtti -fPIC/' $(${R_BIN} RHOME)/etc/Makeconf rm -rf $(${R_BIN} RHOME)/etc/Makeconf.bak @@ -88,8 +93,8 @@ if [[ "$DEVTOOLSET_VERSION" -gt 0 ]]; then $PACKAGE_MANAGER install -y "devtoolset-$DEVTOOLSET_VERSION" fi -if [ "$ARROW_S3" == "ON" ] || [ "$ARROW_R_DEV" == "TRUE" ]; then - # Install curl and openssl for S3 support +if [ "$ARROW_S3" == "ON" ] || [ "$ARROW_GCS" == "ON" ] || [ "$ARROW_R_DEV" == "TRUE" ]; then + # Install curl and openssl for S3/GCS support if [ "$PACKAGE_MANAGER" = "apt-get" ]; then apt-get install -y libcurl4-openssl-dev libssl-dev else @@ -97,12 +102,12 @@ if [ "$ARROW_S3" == "ON" ] || [ "$ARROW_R_DEV" == "TRUE" ]; then fi # The Dockerfile should have put this file here - if [ -f "/arrow/ci/scripts/install_minio.sh" ] && [ "`which wget`" ]; then -/arrow/ci/scripts/install_minio.sh latest /usr/local + if [ -f "${ARROW_SOURCE_HOME}/ci/scripts/install_minio.sh" ] && [ "`which wget`" ]; then +${ARROW_SOURCE_HOME}/ci/scripts/install_minio.sh latest /usr/local fi - if [ -f "/arrow/ci/scripts/install_gcs_testbench.sh" ] && [ "`which pip`" ]; then -/arrow/ci/scripts/install_gcs_testbench.sh default + if [ -f "${ARROW_SOURCE_HOME}/ci/scripts/install_gcs_testbench.sh" ] && [ "`which pip`" ]; then +${ARROW_SOURCE_HOME}/ci/scripts/install_gcs_testbench.sh default fi fi diff --git a/ci/scripts/r_test.sh b/ci/scripts/r_test.sh index 8429187d88..0328df2384 100755 --- a/ci/scripts/r_test.sh +++ b/ci/scripts/r_test.sh @@ -26,6 +26,
[arrow] branch master updated: ARROW-16871: [R] Implement exp() and sqrt() in Arrow dplyr queries (#13517)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 7d1d03f05a ARROW-16871: [R] Implement exp() and sqrt() in Arrow dplyr queries (#13517) 7d1d03f05a is described below commit 7d1d03f05ada61aa11b2ac432faf349eda8f030e Author: Christopher D. Higgins <40569964+higg...@users.noreply.github.com> AuthorDate: Tue Jul 5 16:49:14 2022 -0400 ARROW-16871: [R] Implement exp() and sqrt() in Arrow dplyr queries (#13517) In response to https://issues.apache.org/jira/browse/ARROW-16871 - implement `sqrt` and `exp` bindings for `dplyr` - change `sqrt` in `arrow-datum.R` to use `sqrt_checked` rather than `power_checked` - write tests for `sqrt` and `exp` Authored-by: Christopher D. Higgins <40569964+higg...@users.noreply.github.com> Signed-off-by: Neal Richardson --- r/R/arrow-datum.R| 2 +- r/R/dplyr-funcs-math.R | 15 +++ r/tests/testthat/test-dplyr-funcs-math.R | 22 ++ 3 files changed, 38 insertions(+), 1 deletion(-) diff --git a/r/R/arrow-datum.R b/r/R/arrow-datum.R index 4ec5f8f9d6..39362628bb 100644 --- a/r/R/arrow-datum.R +++ b/r/R/arrow-datum.R @@ -123,7 +123,7 @@ Math.ArrowDatum <- function(x, ..., base = exp(1), digits = 0) { x, options = list(ndigits = digits, round_mode = RoundMode$HALF_TO_EVEN) ), -sqrt = eval_array_expression("power_checked", x, 0.5), +sqrt = eval_array_expression("sqrt_checked", x), exp = eval_array_expression("power_checked", exp(1), x), signif = , expm1 = , diff --git a/r/R/dplyr-funcs-math.R b/r/R/dplyr-funcs-math.R index b92c202d04..0ba2ddc856 100644 --- a/r/R/dplyr-funcs-math.R +++ b/r/R/dplyr-funcs-math.R @@ -80,4 +80,19 @@ register_bindings_math <- function() { options = list(ndigits = digits, round_mode = RoundMode$HALF_TO_EVEN) ) }) + + register_binding("sqrt", function(x) { +build_expr( + "sqrt_checked", + x +) + }) + + register_binding("exp", function(x) { +build_expr( + "power_checked", + exp(1), + x +) + }) } diff --git a/r/tests/testthat/test-dplyr-funcs-math.R b/r/tests/testthat/test-dplyr-funcs-math.R index dd982c9942..47a9f0b7c0 100644 --- a/r/tests/testthat/test-dplyr-funcs-math.R +++ b/r/tests/testthat/test-dplyr-funcs-math.R @@ -330,3 +330,25 @@ test_that("floor division maintains type consistency with R", { df ) }) + +test_that("exp()", { + df <- tibble(x = c(1:5, NA)) + + compare_dplyr_binding( +.input %>% + mutate(y = exp(x)) %>% + collect(), +df + ) +}) + +test_that("sqrt()", { + df <- tibble(x = c(1:5, NA)) + + compare_dplyr_binding( +.input %>% + mutate(y = sqrt(x)) %>% + collect(), +df + ) +})
[arrow] branch master updated: ARROW-16912: [R][CI] Fix nightly centos package without GCS (#13441)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 2c67e72f3a ARROW-16912: [R][CI] Fix nightly centos package without GCS (#13441) 2c67e72f3a is described below commit 2c67e72f3aa75029f277653c9c32af29c485721f Author: Neal Richardson AuthorDate: Wed Jun 29 17:46:52 2022 -0400 ARROW-16912: [R][CI] Fix nightly centos package without GCS (#13441) cc @assignUser Most of the diff seems to be my editor trimming whitespace. The actual changes: * Rename `r-nightly-packages` to `r-binary-packages` since they can be run on demand (not only nightly) * Add it to the `r` crossbow group * Turn ARROW_GCS=OFF in the centos-7 package. Where this setting happens is not obvious. Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- .github/workflows/r_nightly.yml | 10 +- dev/tasks/r/github.packages.yml | 19 +-- dev/tasks/tasks.yml | 5 +++-- docker-compose.yml | 14 -- 4 files changed, 25 insertions(+), 23 deletions(-) diff --git a/.github/workflows/r_nightly.yml b/.github/workflows/r_nightly.yml index e4693a155f..9ee0968d85 100644 --- a/.github/workflows/r_nightly.yml +++ b/.github/workflows/r_nightly.yml @@ -17,7 +17,7 @@ name: Upload R Nightly builds # This workflow downloads the (nightly) binaries created in crossbow and uploads them -# to nightlies.apache.org. Due to authorization requirements, this upload can't be done +# to nightlies.apache.org. Due to authorization requirements, this upload can't be done # from the crossbow repository. @@ -51,7 +51,7 @@ jobs: fetch-depth: 0 path: crossbow repository: ursacomputing/crossbow - ref: master + ref: master - name: Set up Python uses: actions/setup-python@v3 with: @@ -70,7 +70,7 @@ jobs: fi echo $PREFIX - archery crossbow download-artifacts -f r-nightly-packages -t binaries $PREFIX + archery crossbow download-artifacts -f r-binary-packages -t binaries $PREFIX if [ -n "$(ls -A binaries/*/*/)" ]; then echo "Found files!" @@ -83,12 +83,12 @@ jobs: run: | # folder that we rsync to nightlies.apache.org repo_root <- "repo" - # The binaries are in a nested dir + # The binaries are in a nested dir # so we need to find the correct path. art_path <- list.files("binaries", recursive = TRUE, include.dirs = TRUE, -pattern = "r-nightly-packages$", +pattern = "r-binary-packages$", full.names = TRUE ) diff --git a/dev/tasks/r/github.packages.yml b/dev/tasks/r/github.packages.yml index 4f5caa0e1c..76beb6400c 100644 --- a/dev/tasks/r/github.packages.yml +++ b/dev/tasks/r/github.packages.yml @@ -18,7 +18,7 @@ {% import 'macros.jinja' as macros with context %} # This allows us to set a custom version via param: -# crossbow submit --param custom_version=8.5.3 r-nightly-packages +# crossbow submit --param custom_version=8.5.3 r-binary-packages # if the param is unset defaults to the usual Ymd naming scheme {% set package_version = custom_version|default("\\2.\'\"$(date +%Y%m%d)\"\'") %} # We need this as boolean and string @@ -44,7 +44,7 @@ jobs: - name: Save Version id: save-version shell: bash -run: | +run: | echo "::set-output name=pkg_version::$(grep ^Version arrow/r/DESCRIPTION | sed s/Version:\ //)" - uses: r-lib/actions/setup-r@v2 @@ -99,7 +99,7 @@ jobs: cd arrow/r/libarrow/dist # These files were created by the docker user so we have to sudo to get them sudo -E zip -r $PKG_FILE lib/ include/ - + - name: Upload binary artifact uses: actions/upload-artifact@v3 with: @@ -131,7 +131,7 @@ jobs: uses: actions/upload-artifact@v3 with: name: r-lib__libarrow__bin__windows - path: build/arrow-*.zip + path: build/arrow-*.zip r-packages: needs: [source, windows-cpp] @@ -158,7 +158,7 @@ jobs: - name: Build Binary id: build shell: Rscript {0} -env: +env: ARROW_R_DEV: TRUE run: | on_windows <- tolower(Sys.info()[["sysname"]]) == "windows" @@ -171,7 +171,7 @@ jobs: cat("Remove old arrow version.\n") remove.packages("arrow") - + # Build Sys.setenv(MAKEFLAGS = paste0("-j", parallel::detect
[arrow] branch dont-r-nightly-on-fork created (now bc99176fe5)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch dont-r-nightly-on-fork in repository https://gitbox.apache.org/repos/asf/arrow.git at bc99176fe5 Update r_nightly.yml This branch includes the following new commits: new bc99176fe5 Update r_nightly.yml The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[arrow] 01/01: Update r_nightly.yml
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch dont-r-nightly-on-fork in repository https://gitbox.apache.org/repos/asf/arrow.git commit bc99176fe5b26a6ec45ee3a877b8c74bf6036a79 Author: Neal Richardson AuthorDate: Tue Jun 28 20:21:11 2022 -0400 Update r_nightly.yml --- .github/workflows/r_nightly.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/r_nightly.yml b/.github/workflows/r_nightly.yml index 0f657a85ad..e4693a155f 100644 --- a/.github/workflows/r_nightly.yml +++ b/.github/workflows/r_nightly.yml @@ -34,6 +34,7 @@ on: jobs: upload: +if: github.repository == 'apache/arrow' runs-on: ubuntu-latest steps: - name: Checkout Arrow
[arrow] branch master updated: ARROW-16510: [R] Add bindings for GCS filesystem (#13404)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 3ac0959ac1 ARROW-16510: [R] Add bindings for GCS filesystem (#13404) 3ac0959ac1 is described below commit 3ac0959ac168caebb19dfbfbc8881323e694a4ae Author: Neal Richardson AuthorDate: Sun Jun 26 09:43:31 2022 -0400 ARROW-16510: [R] Add bindings for GCS filesystem (#13404) This adds basic bindings for GcsFileSystem to R, turns it on in the macOS, Windows, and Linux packaging (same handling as ARROW_S3), and basic R tests. Followups: - Bindings for FromImpersonatedServiceAccount (ARROW-16885) - Set up testbench for fuller tests, like how we do with minio (ARROW-16879) - GcsFileSystem::Make should return Result (ARROW-16884) - Explore auth integration/compatibility with `gargle`, `googleAuthR`, etc.: can we pick up the same credentials they use (ARROW-16880) - macOS binary packaging: push dependencies upstream (ARROW-16883) - Windows binary packaging: push dependencies upstream (ARROW-16878) - Update cloud/filesystem documentation (ARROW-16887) Lead-authored-by: Neal Richardson Co-authored-by: Sutou Kouhei Signed-off-by: Neal Richardson --- .github/workflows/cpp.yml | 8 +- .github/workflows/r.yml| 2 +- ci/scripts/PKGBUILD| 5 + ci/scripts/r_windows_build.sh | 6 +- .../google-cloud-cpp-curl-static-windows.patch | 31 +++ cpp/cmake_modules/ThirdpartyToolchain.cmake| 276 + cpp/src/arrow/filesystem/gcsfs.h | 1 + cpp/src/arrow/filesystem/type_fwd.h| 1 + cpp/thirdparty/versions.txt| 4 +- .../homebrew-formulae/autobrew/apache-arrow.rb | 1 + dev/tasks/r/github.macos.brew.yml | 2 + dev/tasks/tasks.yml| 2 +- r/R/arrow-info.R | 11 +- r/R/arrowExports.R | 5 + r/R/filesystem.R | 79 +- r/configure| 36 +-- r/configure.win| 13 +- r/data-raw/codegen.R | 63 ++--- r/inst/build_arrow_static.sh | 1 + r/src/arrowExports.cpp | 27 ++ r/src/filesystem.cpp | 81 ++ r/tests/testthat/test-gcs.R| 60 + r/tools/autobrew | 1 + r/tools/nixlibs.R | 42 +++- r/vignettes/developers/setup.Rmd | 2 + r/vignettes/install.Rmd| 101 26 files changed, 627 insertions(+), 234 deletions(-) diff --git a/.github/workflows/cpp.yml b/.github/workflows/cpp.yml index b914b7df52..acb3270a5d 100644 --- a/.github/workflows/cpp.yml +++ b/.github/workflows/cpp.yml @@ -276,8 +276,12 @@ jobs: ARROW_DATASET: ON ARROW_FLIGHT: ON ARROW_GANDIVA: ON - # google-could-cpp uses _dupenv_s() but it can't be used with msvcrt. - # We need to use ucrt to use _dupenv_s(). + # With GCS on, + # * MinGW 32 build OOMs (maybe turn off unity build?) + # * MinGW 64 fails to compile the GCS filesystem tests, some conflict + # with boost. First error says: + # D:/a/_temp/msys64/mingw64/include/boost/asio/detail/socket_types.hpp:24:4: error: #error WinSock.h has already been included + # TODO(ARROW-16906) # ARROW_GCS: ON ARROW_HDFS: OFF ARROW_HOME: /mingw${{ matrix.mingw-n-bits }} diff --git a/.github/workflows/r.yml b/.github/workflows/r.yml index 48d9672c74..86e006d538 100644 --- a/.github/workflows/r.yml +++ b/.github/workflows/r.yml @@ -165,7 +165,7 @@ jobs: name: AMD64 Windows C++ RTools ${{ matrix.config.rtools }} ${{ matrix.config.arch }} runs-on: windows-2019 if: ${{ !contains(github.event.pull_request.title, 'WIP') }} -timeout-minutes: 60 +timeout-minutes: 90 strategy: fail-fast: false matrix: diff --git a/ci/scripts/PKGBUILD b/ci/scripts/PKGBUILD index b9b0194f5c..ea17fba17e 100644 --- a/ci/scripts/PKGBUILD +++ b/ci/scripts/PKGBUILD @@ -25,6 +25,7 @@ arch=("any") url="https://arrow.apache.org/; license=("Apache-2.0") depends=("${MINGW_PACKAGE_PREFIX}-aws-sdk-cpp" + "${MINGW_PACKAGE_PREFIX}-curl" # for google-cloud-cpp bundled build "${MINGW_PACKAGE_PREFIX}-libutf8proc" "${MINGW_PACKAGE_PREFIX}-re2" "${MINGW_PACKAGE_PREFIX}-thrift" @@ -79,11 +80,13 @@ build() { export
[arrow] branch master updated: ARROW-16900: [R] Upgrade lintr (#13432)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 241c8e6242 ARROW-16900: [R] Upgrade lintr (#13432) 241c8e6242 is described below commit 241c8e6242044530e4a9ea13661ca78a100f Author: Neal Richardson AuthorDate: Fri Jun 24 12:13:54 2022 -0400 ARROW-16900: [R] Upgrade lintr (#13432) Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- ci/docker/linux-apt-lint.dockerfile | 16 ++-- r/.lintr| 6 +++--- r/lint.sh | 2 +- r/vignettes/developers/workflow.Rmd | 4 +--- 4 files changed, 7 insertions(+), 21 deletions(-) diff --git a/ci/docker/linux-apt-lint.dockerfile b/ci/docker/linux-apt-lint.dockerfile index 249072ae32..8a679be2eb 100644 --- a/ci/docker/linux-apt-lint.dockerfile +++ b/ci/docker/linux-apt-lint.dockerfile @@ -56,20 +56,8 @@ COPY ci/etc/rprofile /arrow/ci/etc/ RUN cat /arrow/ci/etc/rprofile >> $(R RHOME)/etc/Rprofile.site # Also ensure parallel compilation of C/C++ code RUN echo "MAKEFLAGS=-j$(R -s -e 'cat(parallel::detectCores())')" >> $(R RHOME)/etc/Renviron.site - - -COPY ci/scripts/r_deps.sh /arrow/ci/scripts/ -COPY r/DESCRIPTION /arrow/r/ -# We need to install Arrow's dependencies in order for lintr's namespace searching to work. -# This could be removed if lintr no longer loads the dependency namespaces (see issues/PRs below) -RUN /arrow/ci/scripts/r_deps.sh /arrow -# This fork has a number of changes that have PRs and Issues to resolve upstream: -# https://github.com/jimhester/lintr/pull/843 -# https://github.com/jimhester/lintr/pull/841 -# https://github.com/jimhester/lintr/pull/845 -# https://github.com/jimhester/lintr/issues/842 -# https://github.com/jimhester/lintr/issues/846 -RUN R -e "remotes::install_github('jonkeane/lintr@arrow-branch')" +# We don't need arrow's dependencies, only lintr (and its dependencies) +RUN R -e "install.packages('lintr')" # Docker linter COPY --from=hadolint /bin/hadolint /usr/bin/hadolint diff --git a/r/.lintr b/r/.lintr index 0298fd7f99..619339afca 100644 --- a/r/.lintr +++ b/r/.lintr @@ -14,7 +14,7 @@ license: # Licensed to the Apache Software Foundation (ASF) under one # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. -linters: with_defaults( +linters: linters_with_defaults( line_length_linter = line_length_linter(120), object_name_linter = NULL, # Even with a liberal definition of name styles, some of our names cause issues due to `.`s for s3 classes or NA in the name @@ -22,8 +22,8 @@ linters: with_defaults( # object_name_linter = object_name_linter(styles = c("snake_case", "camelCase", "CamelCase", "symbols", "dotted.case", "UPPERCASE", "SNAKE_CASE")), object_length_linter = object_length_linter(40), object_usage_linter = NULL, # R6 methods are flagged, - cyclocomp_linter = cyclocomp_linter(26), # TODO: reduce to default of 15 - open_curly_linter = NULL # styler and lintr conflict on this (https://github.com/r-lib/styler/issues/549#issuecomment-537191536) + cyclocomp_linter = cyclocomp_linter(26) # TODO: reduce to default of 15 + # See also https://github.com/r-lib/lintr/issues/804 for cyclocomp issues with R6 ) exclusions: list( "R/arrowExports.R", diff --git a/r/lint.sh b/r/lint.sh index 91435e7e01..21e7374733 100755 --- a/r/lint.sh +++ b/r/lint.sh @@ -51,4 +51,4 @@ $CPP_BUILD_SUPPORT/run_cpplint.py \ # Run lintr R -e "if(!requireNamespace('lintr', quietly=TRUE)){stop('lintr is not installed, please install it with R -e \"install.packages(\'lintr\')\"')}" -NOT_CRAN=true R -e "lintr::lint_package('${SOURCE_DIR}', path_prefix = 'r')" +NOT_CRAN=true R -e "lintr::lint_package('${SOURCE_DIR}')" diff --git a/r/vignettes/developers/workflow.Rmd b/r/vignettes/developers/workflow.Rmd index b7e0a27d76..cb88a6af6c 100644 --- a/r/vignettes/developers/workflow.Rmd +++ b/r/vignettes/developers/workflow.Rmd @@ -7,7 +7,6 @@ knitr::opts_chunk$set(error = TRUE, eval = FALSE) The Arrow R package uses several additional development tools: * [`lintr`](https://github.com/r-lib/lintr) for code analysis - - for the time being, the R package uses a custom version of lintr - `jonkeane/lintr@arrow-branch` * [`styler`](https://styler.r-lib.org) for code styling * [`pkgdown`](https://pkgdown.r-lib.org) for building the website * [`roxygen2`](https://roxygen2.r-lib.org) for documenting the package @@ -16,8 +15,7 @@ The Arrow R package uses several additional development tools: You can install all these additional dependencies by running:
[arrow] branch master updated: ARROW-16899: [R][CI] R nightly builds used old libarrow (#13411)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 9e5d3e6f87 ARROW-16899: [R][CI] R nightly builds used old libarrow (#13411) 9e5d3e6f87 is described below commit 9e5d3e6f87a6a3a4ae8384f68f60b7b739a72e45 Author: Jacob Wujciak-Jens AuthorDate: Fri Jun 24 18:13:02 2022 +0200 ARROW-16899: [R][CI] R nightly builds used old libarrow (#13411) Authored-by: Jacob Wujciak-Jens Signed-off-by: Neal Richardson --- dev/tasks/macros.jinja | 19 --- dev/tasks/r/github.packages.yml | 31 +-- 2 files changed, 25 insertions(+), 25 deletions(-) diff --git a/dev/tasks/macros.jinja b/dev/tasks/macros.jinja index 4e7fc4cf35..03de66cbe1 100644 --- a/dev/tasks/macros.jinja +++ b/dev/tasks/macros.jinja @@ -293,6 +293,19 @@ on: shell: Rscript {0} run: | # getwd() is necessary as this macro is used within jobs using a docker container - tools::write_PACKAGES(file.path(getwd(), "/repo/src/contrib", fsep = "/"), type = "source", verbose = TRUE) - - run: ls -R repo -{% endmacro %} + tools::write_PACKAGES(file.path(getwd(), "repo/src/contrib", fsep = "/"), type = "source", verbose = TRUE) + - name: Show repo +shell: bash +# tree not available in git-bash on windows +run: | + ls -R repo + - name: Add dev repo to .Rprofile +shell: Rscript {0} +run: | + str <- paste0("options(arrow.dev_repo ='file://", getwd(), "/repo' )") + print(str) + profile_path <- file.path(getwd(), ".Rprofile") + write(str, file = profile_path, append = TRUE) + # Set envvar for later steps by appending to $GITHUB_ENV + write(paste0("R_PROFILE_USER=", profile_path), file = Sys.getenv("GITHUB_ENV"), append = TRUE) + {% endmacro %} diff --git a/dev/tasks/r/github.packages.yml b/dev/tasks/r/github.packages.yml index 30afda4ffa..4f5caa0e1c 100644 --- a/dev/tasks/r/github.packages.yml +++ b/dev/tasks/r/github.packages.yml @@ -158,6 +158,8 @@ jobs: - name: Build Binary id: build shell: Rscript {0} +env: + ARROW_R_DEV: TRUE run: | on_windows <- tolower(Sys.info()[["sysname"]]) == "windows" @@ -166,17 +168,9 @@ jobs: type = "binary", repos = c("https://nightlies.apache.org/arrow/r;, "https://cloud.r-project.org;) ) - remove.packages("arrow") - # Setup local repo - dev_repo <- paste0( -ifelse(on_windows, "file:", "file://"), -getwd(), -"/repo") - - # This is necessary to use the local folder as a repo in both - # install_arrow & tools/*libs.R - options(arrow.dev_repo = dev_repo) + cat("Remove old arrow version.\n") + remove.packages("arrow") # Build Sys.setenv(MAKEFLAGS = paste0("-j", parallel::detectCores())) @@ -186,11 +180,12 @@ jobs: INSTALL_opts <- c(INSTALL_opts, "--strip") } - + cat("Install arrow from dev repo.\n") install.packages( "arrow", type = "source", -repos = dev_repo, +# The sub is necessary to prevent an error on windows. +repos = sub("file://", "file:", getOption("arrow.dev_repo")),, INSTALL_opts = INSTALL_opts ) @@ -248,15 +243,11 @@ jobs: # Add R-devel to PATH echo "/opt/R-devel/bin" >> $GITHUB_PATH + {{ macros.github_setup_local_r_repo(true, false)|indent }} - - name: Set dev repo -shell: bash -run: | - # It is important to use pwd here as this happens inside a container so the - # normal github.workspace path is wrong. - echo "options(arrow.dev_repo = 'file://$(pwd)/repo')" >> ~/.Rprofile - name: Install arrow from our repo env: + ARROW_R_DEV: TRUE LIBARROW_BUILD: "FALSE" LIBARROW_BINARY: "TRUE" shell: Rscript {0} @@ -273,10 +264,6 @@ jobs: with: install-r: false {{ macros.github_setup_local_r_repo(false, false)|indent }} - - name: Set dev repo -shell: bash -run: | - echo "options(arrow.dev_repo = 'file://$(pwd)/repo')" >> ~/.Rprofile - name: Install arrow from nightly repo env: # Test source build so be sure not to download a binary
[arrow] branch master updated: ARROW-16689: [CI] Improve R Nightly Workflow (#13266)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 46116c48c8 ARROW-16689: [CI] Improve R Nightly Workflow (#13266) 46116c48c8 is described below commit 46116c48c8037117ba71f91b7d0f17d22de0b530 Author: Jacob Wujciak-Jens AuthorDate: Tue Jun 7 16:28:53 2022 +0200 ARROW-16689: [CI] Improve R Nightly Workflow (#13266) Lead-authored-by: Jacob Wujciak-Jens Co-authored-by: Neal Richardson Signed-off-by: Neal Richardson --- .github/workflows/r_nightly.yml| 83 +- LICENSE.txt| 8 +++ dev/tasks/macros.jinja | 28 .../r/{github.nightly.yml => github.packages.yml} | 64 ++--- dev/tasks/tasks.yml| 13 +++- 5 files changed, 119 insertions(+), 77 deletions(-) diff --git a/.github/workflows/r_nightly.yml b/.github/workflows/r_nightly.yml index 8fb96a2796..0f657a85ad 100644 --- a/.github/workflows/r_nightly.yml +++ b/.github/workflows/r_nightly.yml @@ -16,6 +16,10 @@ # under the License. name: Upload R Nightly builds +# This workflow downloads the (nightly) binaries created in crossbow and uploads them +# to nightlies.apache.org. Due to authorization requirements, this upload can't be done + +# from the crossbow repository. on: workflow_dispatch: @@ -25,13 +29,11 @@ on: required: false default: '' schedule: -#Crossbow packagin runs at 0 8 * * * +#Crossbow packaging runs at 0 8 * * * - cron: '0 14 * * *' jobs: upload: -env: - PREFIX: ${{ github.event.inputs.prefix || ''}} runs-on: ubuntu-latest steps: - name: Checkout Arrow @@ -59,59 +61,70 @@ jobs: run: pip install -e arrow/dev/archery[all] - run: mkdir -p binaries - name: Download Artifacts +env: + PREFIX: ${{ github.event.inputs.prefix || ''}} run: | if [ -z $PREFIX ]; then PREFIX=nightly-packaging-$(date +%Y-%m-%d)-0 fi echo $PREFIX - archery crossbow download-artifacts -f r-nightly-packages -t binaries --skip-pattern-validation $PREFIX + archery crossbow download-artifacts -f r-nightly-packages -t binaries $PREFIX + + if [ -n "$(ls -A binaries/*/*/)" ]; then +echo "Found files!" + else +echo "No files found. Stopping upload." +exit 1 + fi - name: Build Repository shell: Rscript {0} run: | + # folder that we rsync to nightlies.apache.org + repo_root <- "repo" + # The binaries are in a nested dir + # so we need to find the correct path. art_path <- list.files("binaries", - recursive = TRUE, - include.dirs = TRUE, - pattern = "r-nightly-packages$", - full.names = TRUE +recursive = TRUE, +include.dirs = TRUE, +pattern = "r-nightly-packages$", +full.names = TRUE ) - pkgs <- list.files(art_path, pattern = "r-pkg_*") - src_i <- grep("r-pkg_src", pkgs) - src_pkg <- pkgs[src_i] - pkgs <- pkgs[-src_i] - libs <- list.files(art_path, pattern = "r-libarrow*") + current_path <- list.files(art_path, full.names = TRUE, recursive = TRUE) + files <- sub("r-(pkg|lib)", repo_root, current_path) - new_names <- sub("r-pkg_", "", pkgs, fixed = T) - matches <- regmatches(new_names, regexec("(([a-z]+)-[\\.a-zA-Z0-9]+)_(\\d\\.\\d)-(arrow.+)$", new_names)) + # decode contrib.url from artifact name: + # bin__windows__contrib__4.1 -> bin/windows/contrib/4.1 + new_paths <- gsub("__", "/", files) + # strip superfluous nested dirs + new_paths <- sub(art_path, ".", new_paths) + dirs <- dirname(new_paths) + dir_result <- sapply(dirs, dir.create, recursive = TRUE) - dir.create("repo/src/contrib", recursive = TRUE) - file.copy(paste0(art_path, "/", src_pkg), paste0("repo/src/contrib/", sub("r-pkg_src-", "", src_pkg))) - tools::write_PACKAGES("repo/src/contrib", type = "source", verbose = TRUE) + if (!all(dir_result)) { +stop("There was an issue while creating the folders!") + } - for (match in matches) { - path <- paste0("repo/bin/", match[[3]],
[arrow] branch master updated: MINOR: [R] Fix duckdb test for dbplyr 2.2.0 internals change (#13323)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 8c63788ff7 MINOR: [R] Fix duckdb test for dbplyr 2.2.0 internals change (#13323) 8c63788ff7 is described below commit 8c63788ff7d52812599a546989b7df10887cb01e Author: Neal Richardson AuthorDate: Mon Jun 6 16:40:56 2022 -0400 MINOR: [R] Fix duckdb test for dbplyr 2.2.0 internals change (#13323) Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/tests/testthat/test-duckdb.R | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/r/tests/testthat/test-duckdb.R b/r/tests/testthat/test-duckdb.R index 82451017a4..088d7a4bbd 100644 --- a/r/tests/testthat/test-duckdb.R +++ b/r/tests/testthat/test-duckdb.R @@ -279,7 +279,8 @@ test_that("to_duckdb passing a connection", { table_four <- ds %>% select(int, lgl, dbl) %>% to_duckdb(con = con_separate, auto_disconnect = FALSE) - table_four_name <- table_four$ops$x + # dbplyr 2.2.0 renames this internal attribute to lazy_query + table_four_name <- table_four$ops$x %||% table_four$lazy_query$x result <- DBI::dbGetQuery( con_separate,
[arrow] branch master updated: MINOR: [R] Drop opensuse42 build and update opensuse15 (#13312)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new b821c16e97 MINOR: [R] Drop opensuse42 build and update opensuse15 (#13312) b821c16e97 is described below commit b821c16e976728e617599c8127c85c273cd069a1 Author: Neal Richardson AuthorDate: Sun Jun 5 16:52:51 2022 -0400 MINOR: [R] Drop opensuse42 build and update opensuse15 (#13312) The opensuse42 job has been failing for a while on nightlies, and it is EOL and RSPM is no longer doing anything for it, so we should drop it. Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- ci/etc/rprofile | 8 +--- dev/tasks/tasks.yml | 3 +-- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/ci/etc/rprofile b/ci/etc/rprofile index e9e98b12e4..2f64b17e5d 100644 --- a/ci/etc/rprofile +++ b/ci/etc/rprofile @@ -2,7 +2,9 @@ local({ .pick_cran <- function() { # Return a CRAN repo URL, preferring RSPM binaries if available for this OS rspm_template <- "https://packagemanager.rstudio.com/cran/__linux__/%s/latest; -supported_os <- c("focal", "xenial", "bionic", "centos7", "centos8", "opensuse42", "opensuse15", "opensuse152") +# See https://github.com/rstudio/r-docker#releases-and-tags, +# but note that RSPM still uses "centos8" +supported_os <- c("bionic", "focal", "jammy", "centos7", "centos8", "opensuse153") if (nzchar(Sys.which("lsb_release"))) { os <- tolower(system("lsb_release -cs", intern = TRUE)) @@ -19,8 +21,8 @@ local({ return(sprintf(rspm_template, os)) } else { names(vals) <- sub("^(.*)=.*$", "\\1", os_release) -if (vals["ID"] == "opensuse") { - version <- sub('^"?([0-9]+).*"?.*$', "\\1", vals["VERSION_ID"]) +if (grepl("opensuse", vals["ID"])) { + version <- sub('^"?([0-9]+)\\.?([0-9]+).*"?.*$', "\\1\\2", vals["VERSION_ID"]) os <- paste0("opensuse", version) if (os %in% supported_os) { return(sprintf(rspm_template, os)) diff --git a/dev/tasks/tasks.yml b/dev/tasks/tasks.yml index 11675e8bba..7a8fd83161 100644 --- a/dev/tasks/tasks.yml +++ b/dev/tasks/tasks.yml @@ -1331,8 +1331,7 @@ tasks: {% for r_org, r_image, r_tag in [("rhub", "ubuntu-gcc-release", "latest"), ("rocker", "r-base", "latest"), ("rstudio", "r-base", "4.2-focal"), - ("rstudio", "r-base", "4.1-opensuse15"), - ("rstudio", "r-base", "4.2-opensuse42")] %} + ("rstudio", "r-base", "4.1-opensuse153")] %} test-r-{{ r_org }}-{{ r_image }}-{{ r_tag }}: ci: azure template: r/azure.linux.yml
[arrow] branch master updated: ARROW-16607: [R] Improve KeyValueMetadata handling
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new a6025f1571 ARROW-16607: [R] Improve KeyValueMetadata handling a6025f1571 is described below commit a6025f15712aa0829aab748a8d3e776f335265cc Author: Neal Richardson AuthorDate: Thu May 26 13:14:31 2022 -0400 ARROW-16607: [R] Improve KeyValueMetadata handling * Pushes KVM handling into ExecPlan so that Run() preserves the R metadata we want. * Also pushes special handling for a kind of collapsed query from collect() into Build(). * Better encapsulate KVM for the the $metadata and $r_metadata so that as a user/developer, you never have to touch the serialize/deserialize functions, you just have a list to work with. This is a slight API change, most noticeable if you were to `print(tab$metadata)`; better is to `print(str(tab$metdata))`. * Factor out a common utility in r/src for taking cpp11::strings (named character vector) and producing arrow::KeyValueMetadata The upshot of all of this is that we can push the ExecPlan evaluation into `as_record_batch_reader()`, and all that `collect()` does on top is read the RBR into a Table/data.frame. This means that we can plug dplyr queries into anything else that expects a RecordBatchReader, and it will be (to the maximum extent possible, given the limitations of ExecPlan) streaming, not requiring you to `compute()` and materialize things first. Closes #13210 from nealrichardson/kvm Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/R/arrow-tabular.R | 17 + r/R/arrowExports.R | 4 +- r/R/dataset-scan.R | 4 +- r/R/dataset-write.R | 16 +--- r/R/dplyr-collect.R | 31 +--- r/R/dplyr-group-by.R | 4 +- r/R/metadata.R | 15 ++-- r/R/query-engine.R | 159 ++- r/R/record-batch-reader.R| 6 +- r/R/record-batch.R | 2 +- r/R/schema.R | 26 --- r/R/table.R | 2 +- r/src/arrowExports.cpp | 9 ++- r/src/compute-exec.cpp | 22 +++--- r/src/schema.cpp | 13 ++-- r/tests/testthat/test-metadata.R | 7 +- 16 files changed, 177 insertions(+), 160 deletions(-) diff --git a/r/R/arrow-tabular.R b/r/R/arrow-tabular.R index 43110ccf24..58a604ba61 100644 --- a/r/R/arrow-tabular.R +++ b/r/R/arrow-tabular.R @@ -70,7 +70,6 @@ ArrowTabular <- R6Class("ArrowTabular", self$schema$metadata } else { # Set the metadata -new <- prepare_key_value_metadata(new) out <- self$ReplaceSchemaMetadata(new) # ReplaceSchemaMetadata returns a new object but we're modifying in place, # so swap in that new C++ object pointer into our R6 object @@ -82,16 +81,10 @@ ArrowTabular <- R6Class("ArrowTabular", # Helper for the R metadata that handles the serialization # See also method on Schema if (missing(new)) { -out <- self$metadata$r -if (!is.null(out)) { - # Can't unserialize NULL - out <- .unserialize_arrow_r_metadata(out) -} -# Returns either NULL or a named list -out +self$metadata$r } else { # Set the R metadata -self$metadata$r <- .serialize_arrow_r_metadata(new) +self$metadata$r <- new self } } @@ -101,11 +94,7 @@ ArrowTabular <- R6Class("ArrowTabular", #' @export as.data.frame.ArrowTabular <- function(x, row.names = NULL, optional = FALSE, ...) { df <- x$to_data_frame() - - if (!is.null(r_metadata <- x$metadata$r)) { -df <- apply_arrow_r_metadata(df, .unserialize_arrow_r_metadata(r_metadata)) - } - df + apply_arrow_r_metadata(df, x$metadata$r) } #' @export diff --git a/r/R/arrowExports.R b/r/R/arrowExports.R index 3414c9b21c..8ad56f227f 100644 --- a/r/R/arrowExports.R +++ b/r/R/arrowExports.R @@ -404,8 +404,8 @@ ExecPlan_create <- function(use_threads) { .Call(`_arrow_ExecPlan_create`, use_threads) } -ExecPlan_run <- function(plan, final_node, sort_options, head) { - .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options, head) +ExecPlan_run <- function(plan, final_node, sort_options, metadata, head) { + .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options, metadata, head) } ExecPlan_StopProducing <- function(plan) { diff --git a/r/R/dataset-scan.R b/r/R/dataset-scan.R index 72f9dec276..a8da1fb60d 100644 --- a/r/R/dataset-scan.R +++ b/r/R/dataset-scan.R @@ -206,10 +206,8 @@ map_batches <- function(X, FUN, ..., .data.frame = NULL) { call. = FALSE ) } - plan <- Ex
[arrow] branch master updated (6576aa06fd -> d889adec54)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 6576aa06fd ARROW-16634: [Gandiva][C++] Add udfdegrees alias add d889adec54 ARROW-15622: [R] Implement union_all and union for arrow_dplyr_query No new revisions were added by this update. Summary of changes: r/DESCRIPTION | 1 + r/R/arrow-package.R| 2 +- r/R/arrowExports.R | 5 +- .../testthat/test-array-data.R => R/dplyr-union.R} | 28 r/R/query-engine.R | 7 ++ r/src/arrowExports.cpp | 10 +++ r/src/compute-exec.cpp | 7 ++ r/tests/testthat/test-dplyr-union.R| 74 ++ 8 files changed, 120 insertions(+), 14 deletions(-) copy r/{tests/testthat/test-array-data.R => R/dplyr-union.R} (59%) create mode 100644 r/tests/testthat/test-dplyr-union.R
[arrow] branch master updated: ARROW-16594: [R] Consistently use "getOption" to set nightly repo
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new b7507c34b7 ARROW-16594: [R] Consistently use "getOption" to set nightly repo b7507c34b7 is described below commit b7507c34b71a200c9f08597b83e935d1639dd85c Author: Jacob Wujciak-Jens AuthorDate: Fri May 20 07:52:44 2022 -0700 ARROW-16594: [R] Consistently use "getOption" to set nightly repo The behavior can be seen in action [here](https://github.com/assignUser/test-repo-a/actions/runs/2340358110) where I build this branch with the daily version number `20220517` which does not yet exists in the s3 bucket. ~~It actually looks like it is not working for linux binary builds https://github.com/assignUser/test-repo-a/runs/6472478941?check_suite_focus=true#step:4:153~~ This issue was due to .Rprofile configuration. Closes #13173 from assignUser/ARROW-16594-option-devrepo Authored-by: Jacob Wujciak-Jens Signed-off-by: Neal Richardson --- r/tools/nixlibs.R | 2 +- r/tools/winlibs.R | 5 - 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/r/tools/nixlibs.R b/r/tools/nixlibs.R index fc523f49ed..5b8cc3b72d 100644 --- a/r/tools/nixlibs.R +++ b/r/tools/nixlibs.R @@ -19,7 +19,7 @@ args <- commandArgs(TRUE) VERSION <- args[1] dst_dir <- paste0("libarrow/arrow-", VERSION) -arrow_repo <- "https://arrow-r-nightly.s3.amazonaws.com/libarrow/; +arrow_repo <- paste0(getOption("arrow.dev_repo", "https://arrow-r-nightly.s3.amazonaws.com;), "/libarrow/") options(.arrow.cleanup = character()) # To collect dirs to rm on exit on.exit(unlink(getOption(".arrow.cleanup"))) diff --git a/r/tools/winlibs.R b/r/tools/winlibs.R index 9435ac3c20..4adedbddb2 100644 --- a/r/tools/winlibs.R +++ b/r/tools/winlibs.R @@ -38,7 +38,10 @@ if (!file.exists(sprintf("windows/arrow-%s/include/arrow/api.h", VERSION))) { ) } # URL templates -nightly <- "https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/arrow-%s.zip; +nightly <- paste0( + getOption("arrow.dev_repo", "https://arrow-r-nightly.s3.amazonaws.com;), + "/libarrow/bin/windows/arrow-%s.zip" +) rwinlib <- "https://github.com/rwinlib/arrow/archive/v%s.zip; # First look for a nightly get_file(nightly, VERSION)
[arrow] branch master updated (663dc325de -> dc39f83e2f)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git from 663dc325de MINOR: [R] Clarify read_json_arrow() docs add dc39f83e2f ARROW-15271: [R] Refactor do_exec_plan to return a RecordBatchReader No new revisions were added by this update. Summary of changes: r/NAMESPACE | 2 ++ r/R/dataset-scan.R | 52 + r/R/dplyr-collect.R | 17 +- r/R/duckdb.R| 14 ++-- r/R/query-engine.R | 44 +++- r/R/record-batch-reader.R | 19 --- r/R/record-batch.R | 6 r/R/table.R | 7 r/man/as_record_batch.Rd| 3 ++ r/man/map_batches.Rd| 26 ++- r/man/to_arrow.Rd | 8 ++--- r/src/arrowExports.cpp | 2 +- r/src/recordbatchreader.cpp | 5 +-- r/tests/testthat/test-dataset-write.R | 10 +++--- r/tests/testthat/test-dataset.R | 31 + r/tests/testthat/test-duckdb.R | 2 +- r/tests/testthat/test-record-batch-reader.R | 10 -- 17 files changed, 155 insertions(+), 103 deletions(-)
[arrow] branch master updated: MINOR: [R] Clarify read_json_arrow() docs
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 663dc325de MINOR: [R] Clarify read_json_arrow() docs 663dc325de is described below commit 663dc325de1176a5caf32809942acae98abf7a8b Author: Edward Visel <1693477+alistair...@users.noreply.github.com> AuthorDate: Wed May 18 15:30:54 2022 -0700 MINOR: [R] Clarify read_json_arrow() docs A quick PR to clarify `read_json_arrow()` docs I found confusing while benchmarking. Specifically, specifies the function - is for ndjson (as opposed to say the many json formats to which pandas can write a dataframe) - handles compression - handles implicit and explicit nulls (was in the example, but not previously stated) Open to changes, but do feel these docs need to at least explicitly say "ndjson" somewhere. Closes #13133 from alistaire47/chore/read-json-docs Lead-authored-by: Edward Visel <1693477+alistair...@users.noreply.github.com> Co-authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/R/json.R | 6 +- r/man/read_json_arrow.Rd | 7 ++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/r/R/json.R b/r/R/json.R index 08798bb2e5..19cf6a9299 100644 --- a/r/R/json.R +++ b/r/R/json.R @@ -17,7 +17,11 @@ #' Read a JSON file #' -#' Using [JsonTableReader] +#' Wrapper around [JsonTableReader] to read a newline-delimited JSON (ndjson) file into a +#' data frame or Arrow Table. +#' +#' If passed a path, will detect and handle compression from the file extension +#' (e.g. `.json.gz`). Accepts explicit or implicit nulls. #' #' @inheritParams read_delim_arrow #' @param schema [Schema] that describes the table. diff --git a/r/man/read_json_arrow.Rd b/r/man/read_json_arrow.Rd index 610867ca40..2ad600725f 100644 --- a/r/man/read_json_arrow.Rd +++ b/r/man/read_json_arrow.Rd @@ -36,7 +36,12 @@ an Arrow \link{Table}?} A \code{data.frame}, or a Table if \code{as_data_frame = FALSE}. } \description{ -Using \link{JsonTableReader} +Wrapper around \link{JsonTableReader} to read a newline-delimited JSON (ndjson) file into a +data frame or Arrow Table. +} +\details{ +If passed a path, will detect and handle compression from the file extension +(e.g. \code{.json.gz}). Accepts explicit or implicit nulls. } \examples{ \dontshow{if (arrow_with_json()) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
[arrow] branch master updated: ARROW-16144: [R] Write compressed data streams (particularly over S3)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new d2cbe9e0e2 ARROW-16144: [R] Write compressed data streams (particularly over S3) d2cbe9e0e2 is described below commit d2cbe9e0e2ce206fba71d3d171babe36bada1a9d Author: Sam Albers AuthorDate: Wed May 18 14:24:55 2022 -0700 ARROW-16144: [R] Write compressed data streams (particularly over S3) This PR enables reading/writing compressed data streams over s3 and locally and adds some tests to test some of those round trips. For the filesystem path I had to do a little regex on the string for compression detection but any feedback on alternative approaches is very welcome. Previously supplying a file with a compression extension wrote out an uncompressed file. Here is a reprex of the updated writing behaviour: ```r library(arrow, warn.conflicts = FALSE) ## local write_csv_arrow(mtcars, file = file) write_csv_arrow(mtcars, file = comp_file) file.size(file) [1] 1303 file.size(comp_file) [1] 567 ## or with s3 dir <- tempfile() dir.create(dir) subdir <- file.path(dir, "bucket") dir.create(subdir) minio_server <- processx::process$new("minio", args = c("server", dir), supervise = TRUE) Sys.sleep(2) stopifnot(minio_server$is_alive()) s3_uri <- "s3://minioadmin:minioadmin@?scheme=http_override=localhost%3A9000" bucket <- s3_bucket(s3_uri) write_csv_arrow(mtcars, bucket$path("bucket/data.csv.gz")) write_csv_arrow(mtcars, bucket$path("bucket/data.csv")) file.size(file.path(subdir, "data.csv.gz")) [1] 567 file.size(file.path(subdir, "data.csv")) [1] 1303 ``` Closes #13183 from boshek/ARROW-16144 Lead-authored-by: Sam Albers Co-authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/R/io.R | 23 ++- r/R/util.R | 4 r/tests/testthat/test-csv.R | 17 + r/tests/testthat/test-s3-minio.R | 20 4 files changed, 59 insertions(+), 5 deletions(-) diff --git a/r/R/io.R b/r/R/io.R index 379dcf6f35..8e72187b43 100644 --- a/r/R/io.R +++ b/r/R/io.R @@ -270,7 +270,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem file <- ReadableFile$create(file) } -if (!identical(compression, "uncompressed")) { +if (is_compressed(compression)) { file <- CompressedInputStream$create(file, compression) } } else if (inherits(file, c("raw", "Buffer"))) { @@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem file } -make_output_stream <- function(x, filesystem = NULL) { +make_output_stream <- function(x, filesystem = NULL, compression = NULL) { if (inherits(x, "connection")) { if (!isOpen(x)) { open(x, "wb") @@ -309,11 +309,21 @@ make_output_stream <- function(x, filesystem = NULL) { filesystem <- fs_and_path$fs x <- fs_and_path$path } + + if (is.null(compression)) { +# Infer compression from sink +compression <- detect_compression(x) + } + assert_that(is.string(x)) - if (is.null(filesystem)) { -FileOutputStream$create(x) + if (is.null(filesystem) && is_compressed(compression)) { +CompressedOutputStream$create(x) ##compressed local + } else if (is.null(filesystem) && !is_compressed(compression)) { +FileOutputStream$create(x) ## uncompressed local + } else if (!is.null(filesystem) && is_compressed(compression)) { +CompressedOutputStream$create(filesystem$OpenOutputStream(x)) ## compressed remote } else { -filesystem$OpenOutputStream(x) +filesystem$OpenOutputStream(x) ## uncompressed remote } } @@ -322,6 +332,9 @@ detect_compression <- function(path) { return("uncompressed") } + # Remove any trailing slashes, which FileSystem$from_uri may add + path <- gsub("/$", "", path) + switch(tools::file_ext(path), bz2 = "bz2", gz = "gzip", diff --git a/r/R/util.R b/r/R/util.R index ff2bb070b8..4aff69e471 100644 --- a/r/R/util.R +++ b/r/R/util.R @@ -211,3 +211,7 @@ handle_csv_read_error <- function(e, schema, call) { } abort(msg, call = call) } + +is_compressed <- function(compression) { + !identical(compression, "uncompressed") +} diff --git a/r/tests/testthat/test-csv.R b/r/tests/testthat/test-csv.R index 631e75fd74..8e463d3abe 100644 --- a/r/tests/testthat/test-csv.R +++ b/r/tests/testthat/test-csv.R @@ -564,6 +564,23
[arrow] branch master updated: ARROW-16539: [C++] Bump bundled thrift to 0.16.0
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 235767d2db ARROW-16539: [C++] Bump bundled thrift to 0.16.0 235767d2db is described below commit 235767d2dbf1c6839057a21631680e021f3da3e3 Author: Sutou Kouhei AuthorDate: Fri May 13 16:02:18 2022 -0400 ARROW-16539: [C++] Bump bundled thrift to 0.16.0 Closes #13122 from nealrichardson/bump-thrift Lead-authored-by: Sutou Kouhei Co-authored-by: Neal Richardson Signed-off-by: Neal Richardson --- cpp/cmake_modules/ThirdpartyToolchain.cmake | 40 - cpp/thirdparty/versions.txt | 4 +-- 2 files changed, 24 insertions(+), 20 deletions(-) diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake index e8fcf33752..992c2102d2 100644 --- a/cpp/cmake_modules/ThirdpartyToolchain.cmake +++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake @@ -1422,21 +1422,24 @@ macro(build_thrift) ${EP_COMMON_CMAKE_ARGS} "-DCMAKE_INSTALL_PREFIX=${THRIFT_PREFIX}" "-DCMAKE_INSTALL_RPATH=${THRIFT_PREFIX}/lib" + # Work around https://gitlab.kitware.com/cmake/cmake/issues/18865 + -DBoost_NO_BOOST_CMAKE=ON -DBUILD_COMPILER=OFF + -DBUILD_EXAMPLES=OFF -DBUILD_SHARED_LIBS=OFF -DBUILD_TESTING=OFF - -DBUILD_EXAMPLES=OFF -DBUILD_TUTORIALS=OFF - -DWITH_QT4=OFF + -DCMAKE_DEBUG_POSTFIX= + -DWITH_AS3=OFF + -DWITH_CPP=ON -DWITH_C_GLIB=OFF -DWITH_JAVA=OFF - -DWITH_PYTHON=OFF - -DWITH_HASKELL=OFF - -DWITH_CPP=ON - -DWITH_STATIC_LIB=ON + -DWITH_JAVASCRIPT=OFF -DWITH_LIBEVENT=OFF - # Work around https://gitlab.kitware.com/cmake/cmake/issues/18865 - -DBoost_NO_BOOST_CMAKE=ON) + -DWITH_NODEJS=OFF + -DWITH_PYTHON=OFF + -DWITH_QT5=OFF + -DWITH_ZLIB=OFF) # Thrift also uses boost. Forward important boost settings if there were ones passed. if(DEFINED BOOST_ROOT) @@ -1446,21 +1449,22 @@ macro(build_thrift) list(APPEND THRIFT_CMAKE_ARGS "-DBoost_NAMESPACE=${Boost_NAMESPACE}") endif() - set(THRIFT_STATIC_LIB_NAME "${CMAKE_STATIC_LIBRARY_PREFIX}thrift") if(MSVC) if(ARROW_USE_STATIC_CRT) - set(THRIFT_STATIC_LIB_NAME "${THRIFT_STATIC_LIB_NAME}mt") + set(THRIFT_LIB_SUFFIX "mt") list(APPEND THRIFT_CMAKE_ARGS "-DWITH_MT=ON") else() - set(THRIFT_STATIC_LIB_NAME "${THRIFT_STATIC_LIB_NAME}md") + set(THRIFT_LIB_SUFFIX "md") list(APPEND THRIFT_CMAKE_ARGS "-DWITH_MT=OFF") endif() +set(THRIFT_LIB + "${THRIFT_PREFIX}/bin/${CMAKE_IMPORT_LIBRARY_PREFIX}thrift${THRIFT_LIB_SUFFIX}${CMAKE_IMPORT_LIBRARY_SUFFIX}" +) + else() +set(THRIFT_LIB + "${THRIFT_PREFIX}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}thrift${CMAKE_STATIC_LIBRARY_SUFFIX}" +) endif() - if(${UPPERCASE_BUILD_TYPE} STREQUAL "DEBUG") -set(THRIFT_STATIC_LIB_NAME "${THRIFT_STATIC_LIB_NAME}d") - endif() - set(THRIFT_STATIC_LIB - "${THRIFT_PREFIX}/lib/${THRIFT_STATIC_LIB_NAME}${CMAKE_STATIC_LIBRARY_SUFFIX}") if(BOOST_VENDORED) set(THRIFT_DEPENDENCIES ${THRIFT_DEPENDENCIES} boost_ep) @@ -1469,7 +1473,7 @@ macro(build_thrift) externalproject_add(thrift_ep URL ${THRIFT_SOURCE_URL} URL_HASH "SHA256=${ARROW_THRIFT_BUILD_SHA256_CHECKSUM}" - BUILD_BYPRODUCTS "${THRIFT_STATIC_LIB}" + BUILD_BYPRODUCTS "${THRIFT_LIB}" CMAKE_ARGS ${THRIFT_CMAKE_ARGS} DEPENDS ${THRIFT_DEPENDENCIES} ${EP_LOG_OPTIONS}) @@ -1477,7 +1481,7 @@ macro(build_thrift) # The include directory must exist before it is referenced by a target. file(MAKE_DIRECTORY "${THRIFT_INCLUDE_DIR}") set_target_properties(thrift::thrift -PROPERTIES IMPORTED_LOCATION "${THRIFT_STATIC_LIB}" +PROPERTIES IMPORTED_LOCATION "${THRIFT_LIB}" INTERFACE_INCLUDE_DIRECTORIES "${THRIFT_INCLUDE_DIR}") if(CMAKE_VERSION VERSION_LESS 3.11) set_target_properties(${BOOST_LIBRARY} PROPERTIES INTERFACE_LINK_LIBRARIES diff --git a/cpp/thirdparty/versions.txt b/cpp/thirdparty/versions.txt index 3aa3ebe90f..776527fc2e 100644 --- a/cpp/thirdparty/versions.txt +++ b/cpp/thirdparty/versions.txt @@ -89,8 +89,8 @@ ARROW_SNAPPY_OLD_BUILD_VERSION=1.1.8 ARROW_SNAPPY_OLD_BUILD_SHA256_CHECKSUM=16b677f07832a612b0836178db7f374e414f94657c138e6993cbfc5dcc58
[arrow-site] branch asf-site updated: Backfill R news for 8.0.0 release (#214)
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/arrow-site.git The following commit(s) were added to refs/heads/asf-site by this push: new f3e8fe1793 Backfill R news for 8.0.0 release (#214) f3e8fe1793 is described below commit f3e8fe179397b59f7375560dd69a970d7872e67c Author: Neal Richardson AuthorDate: Thu May 12 17:04:45 2022 -0400 Backfill R news for 8.0.0 release (#214) https://github.com/apache/arrow/commit/526fa070c82c0e1c6d26a4c1d06a591b37c05011 apparently did not make it into the release tag --- docs/r/news/index.html | 108 +++-- 1 file changed, 96 insertions(+), 12 deletions(-) diff --git a/docs/r/news/index.html b/docs/r/news/index.html index 545c27d780..bd4fc2effc 100644 --- a/docs/r/news/index.html +++ b/docs/r/news/index.html @@ -128,27 +128,111 @@ -arrow 7.0.0.9000 +arrow 8.0.02022-05-09 + +Enhancements to dplyr and datasets -read_csv_arrow()’s readr-style type T is now mapped to timestamp(unit = "ns") instead of timestamp(unit = "s"). +open_dataset():correctly supports the skip argument for skipping header rows in CSV datasets. +can take a list of datasets with differing schemas and attempt to unify the schemas to produce a UnionDataset. + +Arrow https://dplyr.tidyverse.org; class="external-link">dplyr queries:are supported on RecordBatchReader. This allows, for example, results from DuckDB to be streamed back into Arrow rather than materialized before continuing the pipeline. +no longer need to materialize the entire result table before writing to a dataset if the query contains contains aggregations or joins. +supports https://dplyr.tidyverse.org/reference/rename.html; class="external-link">dplyr::rename_with(). + +https://dplyr.tidyverse.org/reference/count.html; class="external-link">dplyr::count() returns an ungrouped dataframe. + + +write_dataset has more options for controlling row group and file sizes when writing partitioned datasets, such as max_open_files, max_rows_per_file, min_rows_per_group, and max_rows_per_group. -lubridate:component extraction functions: tz() (timezone), semester() (semester), dst() (daylight savings time indicator), https://rdrr.io/r/base/date.html; class="external-link">date() (extract date), epiyear() (epiyear), improvements to month(), which now works with integer inputs. -Added make_date() make_datetime() + https://rdrr.io/r/base/ISOdatetime.html; class="external-link">ISOdatetime() https://rdrr.io/r/base/ISOdatetime.html; class="external-link">ISOdate() to create date-times from numeric representations. -Added decimal_date() and date_decimal() +write_csv_arrow accepts a Dataset or an Arrow dplyr query. +Joining one or more datasets while option(use_threads = FALSE) no longer crashes R. That option is set by default on Windows. + +dplyr joins support the suffix argument to handle overlap in column names. +Filtering a Parquet dataset with https://rdrr.io/r/base/NA.html; class="external-link">is.na() no longer misses any rows. + +map_batches() correctly accepts Dataset objects. + + +Enhancements to date and time support + +read_csv_arrow()’s readr-style type T is mapped to timestamp(unit = "ns") instead of timestamp(unit = "s"). +For Arrow dplyr queries, added additional https://lubridate.tidyverse.org; class="external-link">lubridate features and fixes:New component extraction functions: +https://lubridate.tidyverse.org/reference/tz.html; class="external-link">lubridate::tz() (timezone), + +https://lubridate.tidyverse.org/reference/quarter.html; class="external-link">lubridate::semester(), + +https://lubridate.tidyverse.org/reference/dst.html; class="external-link">lubridate::dst() (daylight savings time boolean), + +https://lubridate.tidyverse.org/reference/date.html; class="external-link">lubridate::date(), + +https://lubridate.tidyverse.org/reference/year.html; class="external-link">lubridate::epiyear() (year according to epidemiological week calendar), + + +https://lubridate.tidyverse.org/reference/month.html; class="external-link">lubridate::month() works with integer inputs. + +https://lubridate.tidyverse.org/reference/make_datetime.html; class="external-link">lubridate::make_date() https://lubridate.tidyverse.org/reference/make_datetime.html; class="external-link">lubridate::make_datetime() + lubridate::ISOdatetime() lubridate::ISOdate() to create date-times from numeric representations. + +https://lubridate.tidyverse.org/reference/decimal_date.html; class="external-link">lubridate::decimal_date() and https://lubridate.tidyverse.org/re
[arrow-site] 01/01: Backfill R news for 8.0.0 release
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch nealrichardson-patch-1 in repository https://gitbox.apache.org/repos/asf/arrow-site.git commit 5df23bcc2f4ba19df45306184bf544963867fcda Author: Neal Richardson AuthorDate: Thu May 12 16:36:51 2022 -0400 Backfill R news for 8.0.0 release https://github.com/apache/arrow/commit/526fa070c82c0e1c6d26a4c1d06a591b37c05011 apparently did not make it into the release tag --- docs/r/news/index.html | 108 +++-- 1 file changed, 96 insertions(+), 12 deletions(-) diff --git a/docs/r/news/index.html b/docs/r/news/index.html index 545c27d780..bd4fc2effc 100644 --- a/docs/r/news/index.html +++ b/docs/r/news/index.html @@ -128,27 +128,111 @@ -arrow 7.0.0.9000 +arrow 8.0.02022-05-09 + +Enhancements to dplyr and datasets -read_csv_arrow()’s readr-style type T is now mapped to timestamp(unit = "ns") instead of timestamp(unit = "s"). +open_dataset():correctly supports the skip argument for skipping header rows in CSV datasets. +can take a list of datasets with differing schemas and attempt to unify the schemas to produce a UnionDataset. + +Arrow https://dplyr.tidyverse.org; class="external-link">dplyr queries:are supported on RecordBatchReader. This allows, for example, results from DuckDB to be streamed back into Arrow rather than materialized before continuing the pipeline. +no longer need to materialize the entire result table before writing to a dataset if the query contains contains aggregations or joins. +supports https://dplyr.tidyverse.org/reference/rename.html; class="external-link">dplyr::rename_with(). + +https://dplyr.tidyverse.org/reference/count.html; class="external-link">dplyr::count() returns an ungrouped dataframe. + + +write_dataset has more options for controlling row group and file sizes when writing partitioned datasets, such as max_open_files, max_rows_per_file, min_rows_per_group, and max_rows_per_group. -lubridate:component extraction functions: tz() (timezone), semester() (semester), dst() (daylight savings time indicator), https://rdrr.io/r/base/date.html; class="external-link">date() (extract date), epiyear() (epiyear), improvements to month(), which now works with integer inputs. -Added make_date() make_datetime() + https://rdrr.io/r/base/ISOdatetime.html; class="external-link">ISOdatetime() https://rdrr.io/r/base/ISOdatetime.html; class="external-link">ISOdate() to create date-times from numeric representations. -Added decimal_date() and date_decimal() +write_csv_arrow accepts a Dataset or an Arrow dplyr query. +Joining one or more datasets while option(use_threads = FALSE) no longer crashes R. That option is set by default on Windows. + +dplyr joins support the suffix argument to handle overlap in column names. +Filtering a Parquet dataset with https://rdrr.io/r/base/NA.html; class="external-link">is.na() no longer misses any rows. + +map_batches() correctly accepts Dataset objects. + + +Enhancements to date and time support + +read_csv_arrow()’s readr-style type T is mapped to timestamp(unit = "ns") instead of timestamp(unit = "s"). +For Arrow dplyr queries, added additional https://lubridate.tidyverse.org; class="external-link">lubridate features and fixes:New component extraction functions: +https://lubridate.tidyverse.org/reference/tz.html; class="external-link">lubridate::tz() (timezone), + +https://lubridate.tidyverse.org/reference/quarter.html; class="external-link">lubridate::semester(), + +https://lubridate.tidyverse.org/reference/dst.html; class="external-link">lubridate::dst() (daylight savings time boolean), + +https://lubridate.tidyverse.org/reference/date.html; class="external-link">lubridate::date(), + +https://lubridate.tidyverse.org/reference/year.html; class="external-link">lubridate::epiyear() (year according to epidemiological week calendar), + + +https://lubridate.tidyverse.org/reference/month.html; class="external-link">lubridate::month() works with integer inputs. + +https://lubridate.tidyverse.org/reference/make_datetime.html; class="external-link">lubridate::make_date() https://lubridate.tidyverse.org/reference/make_datetime.html; class="external-link">lubridate::make_datetime() + lubridate::ISOdatetime() lubridate::ISOdate() to create date-times from numeric representations. + +https://lubridate.tidyverse.org/reference/decimal_date.html; class="external-link">lubridate::decimal_date() and https://lubridate.tidyverse.org/reference/date_decimal.html; class="external-link">lubridate::date_decimal() + + +https://lubridate.tidyverse.org/reference/make_difftime.html; clas
[arrow-site] branch nealrichardson-patch-1 created (now 5df23bcc2f)
This is an automated email from the ASF dual-hosted git repository. npr pushed a change to branch nealrichardson-patch-1 in repository https://gitbox.apache.org/repos/asf/arrow-site.git at 5df23bcc2f Backfill R news for 8.0.0 release This branch includes the following new commits: new 5df23bcc2f Backfill R news for 8.0.0 release The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[arrow] branch master updated: ARROW-16414: [R] Remove ARROW_R_WITH_ARROW and arrow_available()
This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 824f58f7df ARROW-16414: [R] Remove ARROW_R_WITH_ARROW and arrow_available() 824f58f7df is described below commit 824f58f7df4043ba41351ee1b75d4293521e1ad8 Author: Neal Richardson AuthorDate: Tue May 10 12:48:22 2022 -0400 ARROW-16414: [R] Remove ARROW_R_WITH_ARROW and arrow_available() The diff looks bigger than that because * Sometimes those changes just resulted in reducing indentation * I moved arrow_info() and related functions to their own file, and did the same with ArrowObject while I was there * The way we were wrapping testthat::test_that to check whether arrow was available had a side effect of creating a closure that stored intermediate objects that we reused across tests, and that broke when I removed it. * I didn't have styler configured correctly in vscode when I started because I had upgraded R to 4.2, so to fix what I had already committed that was unstyled, I ran `make style-all` across everything, which reformatted a bunch of unrelated code. I tried to pull on all threads I noticed where we were doing things an unnatural way because we couldn't assume that arrow was present, but there may be more. Closes #13086 from nealrichardson/arrow-is-available Lead-authored-by: Neal Richardson Co-authored-by: Jonathan Keane Signed-off-by: Neal Richardson --- ci/scripts/r_test.sh | 6 +- dev/tasks/conda-recipes/r-arrow/configure.win | 2 +- r/DESCRIPTION | 4 +- r/R/array.R | 4 +- r/R/arrow-datum.R | 69 -- r/R/arrow-info.R | 185 r/R/arrow-object.R| 61 ++ r/R/arrow-package.R | 295 ++ r/R/arrowExports.R| 1 - r/R/buffer.R | 4 +- r/R/compression.R | 6 +- r/R/compute.R | 8 +- r/R/csv.R | 8 +- r/R/dataset.R | 2 +- r/R/dplyr-datetime-helpers.R | 7 +- r/R/dplyr-funcs-datetime.R| 8 +- r/R/dplyr-funcs-string.R | 16 +- r/R/dplyr-funcs-type.R| 12 +- r/R/dplyr-funcs.R | 18 +- r/R/dplyr-summarize.R | 2 +- r/R/extension.R | 27 +-- r/R/feather.R | 8 +- r/R/field.R | 4 +- r/R/filesystem.R | 2 +- r/R/install-arrow.R | 2 +- r/R/io.R | 19 +- r/R/ipc-stream.R | 4 +- r/R/json.R| 2 +- r/R/memory-pool.R | 2 +- r/R/message.R | 2 +- r/R/parquet.R | 4 +- r/R/record-batch-reader.R | 2 +- r/R/record-batch-writer.R | 4 +- r/R/record-batch.R| 15 +- r/R/scalar.R | 2 +- r/R/schema.R | 4 +- r/R/table.R | 2 +- r/R/type.R| 12 +- r/_pkgdown.yml| 1 - r/configure | 1 - r/configure.win | 4 +- r/data-raw/codegen.R | 31 +-- r/man/Field.Rd| 2 - r/man/RecordBatchWriter.Rd| 2 - r/man/Scalar.Rd | 2 - r/man/arrow_available.Rd | 50 - r/man/arrow_info.Rd | 32 ++- r/man/as_data_type.Rd | 3 +- r/man/buffer.Rd | 2 - r/man/call_function.Rd| 2 - r/man/codec_is_available.Rd | 2 - r/man/concat_tables.Rd| 2 - r/man/data-type.Rd| 2 - r/man/infer_type.Rd | 2 - r/man/install_arrow.Rd| 2 +- r/man/list_compute_functions.Rd | 2 - r/man/match_arrow.Rd | 2 - r/man/new_extension_type.Rd | 6 +- r/man/read_delim_arrow.Rd | 2 - r/man/read_feather.Rd