giocomai opened a new issue, #36720:
URL: https://github.com/apache/arrow/issues/36720
### Describe the bug, including details regarding any error messages,
version, and platform.
When using stringr's `str_detect()` and `str_count()`, stringr's own
documentation recommends to use `stringr::regex()` and `stringr::fixed()` "for
finer control of the matching behaviour."
This can be used, for example, to set "ignore_case" to TRUE, which is not
available as an argument to `str_detect()` directly.
The resulting functions have the following structure:
``` r
stringr::str_detect(
string = "eXample",
pattern = stringr::regex("x", ignore_case = TRUE)
)
#> [1] TRUE
```
Unfortunately, arguments passed via `stringr::regex()` and
`stringr::fixed()` are silently ignored by `arrow`, which leads to unexpected
and quite possibly wrong results.
If one prints the arrow call, it is possible to see that indeed even if
`ignore_case` is set to TRUE, the call is passed with `ignore_case` as FALSE.
```
bool (match_substring_regex(text, {pattern="x", ignore_case=false}))
```
I suppose `arrow` should either get this right, or throw an error.
The following reprex (run with arrow version 12.0.1) shows:
- how the `ignore.case` argument works nicely when passed via the base
function `grepl`
- how it is simply ignored when passed to `stringr::str_detect()`,
`stringr::str_count()` (and possibly other stringr functions) through
`stringr::regex()` and `stringr::str_detect()`
- how it works nicely if the ignore_case is passed directly in the pattern
with `(?i)`
- how `arrow` throws an error when using `stringi::stri_detect_regex()`
(rather than `stringr`) with `case_insensitive = TRUE` (which is still
preferrable to ignoring the argument silently).
There are obviously many workarounds, but this has led to errors when I
applied functions that were not originally written and tested with `arrow` in
mind.
``` r
library("arrow")
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
apple_df <- tibble::tibble(
text = c(
"apple",
"APPLE"
)
)
arrow::write_dataset(dataset = apple_df, path = "apple.parquet")
apple_parquet <- arrow::open_dataset(sources = "apple.parquet")
## with grepl, it works
apple_parquet |>
dplyr::mutate(
a_check = grepl(
x = text,
pattern = "a",
ignore.case = TRUE
)
)
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (if_else(is_null(match_substring_regex(text, {pattern="a",
ignore_case=true}), {nan_is_null=true}), false, match_substring_regex(text,
{pattern="a", ignore_case=true})))
#>
#> See $.data for the source Arrow object
apple_parquet |>
dplyr::mutate(
a_check = grepl(x = text, pattern = "a", ignore.case = TRUE)
) |>
dplyr::collect()
#> # A tibble: 2 × 2
#> text a_check
#> <chr> <lgl>
#> 1 apple TRUE
#> 2 APPLE TRUE
## with stringr::str_detect it does not work
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = "a"
)
)
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (match_substring_regex(text, {pattern="a",
ignore_case=false}))
#>
#> See $.data for the source Arrow object
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = stringr::regex(
pattern = "a",
ignore_case = TRUE
)
)
)
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (match_substring_regex(text, {pattern="a",
ignore_case=false}))
#>
#> See $.data for the source Arrow object
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = stringr::regex(
pattern = "a",
ignore_case = TRUE
)
),
p_count = stringr::str_count(
string = text,
pattern = stringr::regex(
pattern = "p",
ignore_case = TRUE
)
)
) |>
dplyr::collect()
#> # A tibble: 2 × 3
#> text a_check p_count
#> <chr> <lgl> <int>
#> 1 apple TRUE 2
#> 2 APPLE FALSE 0
## Same result with stringr::fixed
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = stringr::fixed(
pattern = "a",
ignore_case = TRUE
)
),
p_count = stringr::str_count(
string = text,
pattern = stringr::fixed(
pattern = "p",
ignore_case = TRUE
)
)
) |>
dplyr::collect()
#> # A tibble: 2 × 3
#> text a_check p_count
#> <chr> <lgl> <int>
#> 1 apple TRUE 2
#> 2 APPLE FALSE 0
## it works nicely just including the case insensitive in the regex
apple_parquet |>
dplyr::mutate(
a_check = stringr::str_detect(
string = text,
pattern = "(?i)a"
),
p_count = stringr::str_count(
string = text,
pattern = "(?i)p"
)
) |>
dplyr::collect()
#> # A tibble: 2 × 3
#> text a_check p_count
#> <chr> <lgl> <int>
#> 1 apple TRUE 2
#> 2 APPLE TRUE 2
## With stringi
apple_df |>
dplyr::mutate(
a_check = stringi::stri_detect_regex(
str = text,
pattern = "a",
case_insensitive = TRUE
)
) |>
dplyr::collect()
#> # A tibble: 2 × 2
#> text a_check
#> <chr> <lgl>
#> 1 apple TRUE
#> 2 APPLE TRUE
apple_parquet |>
dplyr::mutate(
a_check = stringi::stri_detect_regex(
str = text,
pattern = "a",
case_insensitive = TRUE
)
) |>
dplyr::collect()
#> Error: Expression stringi::stri_detect_regex(str = text, pattern = "a",
case_insensitive = TRUE) not supported in Arrow
#> Call collect() first to pull data into R.
```
<sup>Created on 2023-07-17 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]