[
https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511592#comment-17511592
]
Andy Teucher edited comment on ARROW-16007 at 3/24/22, 5:07 AM:
----------------------------------------------------------------
I have pushed up my work so far trying to implement {{null_as_false}} in the
C++ code
[here|https://github.com/apache/arrow/compare/master...ateucher:r-grepl-na].
I am struggling with a couple of things:
# how to detect {{NULL}} values in a {{string_view}} (differentiated from an
empty string). Right now I am using {{string_view::empty()}} but I don't think
that's right.
# The code logic I've written is working in that the argument
{{null_as_false}} is going to the right place (tested with a bunch of
{{std::cout}} peppered around), but the {{return false;}}
[here|https://github.com/ateucher/arrow/blob/c9c07ae8170cd931d839a288f3c19ac9118eccde/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L1459]
is being shortcut somewhere that I can't find, as it is still returning
{{NULL}}.
I'm actually struggling to figure out where the R vector gets passed into (and
out of) the C++ innards, as I'm guessing that's where those NULLs are captured
and returned as NULLs, and probably where the casting to FALSE should happen.
This is my first foray into C++ and this is a big complex codebase, so I know
it's entirely possible I'm totally on the wrong track :)
was (Author: JIRAUSER279940):
I have pushed up my work so far trying to implement {{null_as_false}} in the
C++ code
[here|https://github.com/apache/arrow/compare/master...ateucher:r-grepl-na].
I am struggling with a couple of things:
# how to detect {{NULL}} values in a {{string_view}} (differentiated from an
empty string). ring now I am using {{string_view::empty()}} but I don't think
that's right.
# The code logic I've written is working in that the argument
{{null_as_false}} is going to the right place (tested with a bunch of
{{std::cout}} peppered around), but the {{return false;}}
[here|https://github.com/ateucher/arrow/blob/c9c07ae8170cd931d839a288f3c19ac9118eccde/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L1459]
is being shortcut somewhere that I can't find, as it is still returning
{{NULL}}.
I'm actually struggling to figure out where the R vector gets passed into the
C++ innards, as I'm guessing that's where those NULLs are captured and returned
as NULLs...
This is my first foray into C++ and this is a big complex codebase, so I know
it's entirely possible I'm totally on the wrong track :)
> [R] binding for grepl has different behaviour with NA compared to R base grepl
> ------------------------------------------------------------------------------
>
> Key: ARROW-16007
> URL: https://issues.apache.org/jira/browse/ARROW-16007
> Project: Apache Arrow
> Issue Type: Improvement
> Affects Versions: 7.0.0
> Reporter: Andy Teucher
> Priority: Minor
>
> The arrow binding to {{grepl}} behaves slightly differently than the base R
> {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base
> {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is
> consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and
> {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in
> arrow.
> I don't know if this is something you would want to change so that the
> {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this
> difference?
> Reprex:
>
> {code:r}
> library(arrow, warn.conflicts = FALSE, quietly = TRUE)
> library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
> library(stringr, quietly = TRUE)
> alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_))
> alpha_dataset <- InMemoryDataset$create(alpha_df)
> mutate(alpha_df,
> grepl_is_a = grepl("a", alpha),
> stringr_is_a = str_detect(alpha, "a"))
> #> alpha grepl_is_a stringr_is_a
> #> 1 alpha TRUE TRUE
> #> 2 bet FALSE FALSE
> #> 3 <NA> FALSE NA
> mutate(alpha_dataset,
> grepl_is_a = grepl("a", alpha),
> stringr_is_a = str_detect(alpha, "a")) |>
> collect()
> #> alpha grepl_is_a stringr_is_a
> #> 1 alpha TRUE TRUE
> #> 2 bet FALSE FALSE
> #> 3 <NA> NA NA
> # base R grepl returns FALSE for NA
> grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex
> #> [1] TRUE FALSE FALSE
> grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring
> #> [1] TRUE FALSE FALSE
> # stringr::str_dectect returns NA for NA
> str_detect(alpha_df$alpha, "a")
> #> [1] TRUE FALSE NA
> alpha_array <- Array$create(alpha_df$alpha)
> # arrow functions return null for null (NA)
> call_function("match_substring_regex", alpha_array, options = list(pattern =
> "a"))
> #> Array
> #> <bool>
> #> [
> #> true,
> #> false,
> #> null
> #> ]
> call_function("match_substring", alpha_array, options = list(pattern = "a"))
> #> Array
> #> <bool>
> #> [
> #> true,
> #> false,
> #> null
> #> ]
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)