[GitHub] [arrow] giocomai opened a new issue, #36720: [R] Inconsistent results with stringr::str_detect and str_count, when case_insensitive is set to TRUE via stringr::regex or stringr::fixed

via GitHub Mon, 17 Jul 2023 05:12:46 -0700


giocomai opened a new issue, #36720:
URL: https://github.com/apache/arrow/issues/36720


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When using stringr's `str_detect()` and `str_count()`, stringr's own 
documentation recommends to use `stringr::regex()` and `stringr::fixed()` "for 
finer control of the matching behaviour."
   
   This can be used, for example, to set "ignore_case" to TRUE, which is not 
available as an argument to `str_detect()` directly. 
   
   The resulting functions have the following structure:
   
   ``` r
   stringr::str_detect(
     string = "eXample",
     pattern = stringr::regex("x", ignore_case = TRUE)
   )
   #> [1] TRUE
   ```
   
   Unfortunately, arguments passed via `stringr::regex()` and 
`stringr::fixed()` are silently ignored by `arrow`, which leads to unexpected 
and quite possibly wrong results.
   
   If one prints the arrow call, it is possible to see that indeed even if 
`ignore_case` is set to TRUE, the call is passed with `ignore_case` as FALSE. 
   
   ```
   bool (match_substring_regex(text, {pattern="x", ignore_case=false}))
   ```
   
   I suppose `arrow` should either get this right, or throw an error.
   
   The following reprex (run with arrow version 12.0.1) shows:
   
   - how the `ignore.case` argument works nicely when passed via the base 
function `grepl`
   - how it is simply ignored when passed to `stringr::str_detect()`, 
`stringr::str_count()` (and possibly other stringr functions) through 
`stringr::regex()` and `stringr::str_detect()`
   - how it works nicely if the ignore_case is passed directly in the pattern 
with `(?i)`
   - how `arrow` throws an error when using `stringi::stri_detect_regex()` 
(rather than `stringr`) with `case_insensitive = TRUE` (which is still 
preferrable to ignoring the argument silently).
   
   There are obviously many workarounds, but this has led to errors when I 
applied functions that were not originally written and tested with `arrow` in 
mind. 
   
   
   ``` r
   library("arrow")
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   
   apple_df <- tibble::tibble(
     text = c(
       "apple",
       "APPLE"
     )
   )
   
   arrow::write_dataset(dataset = apple_df, path = "apple.parquet")
   
   apple_parquet <- arrow::open_dataset(sources = "apple.parquet")
   
   
   
   ## with grepl, it works
   
   apple_parquet |>
     dplyr::mutate(
       a_check = grepl(
         x = text,
         pattern = "a",
         ignore.case = TRUE
       )
     )
   #> FileSystemDataset (query)
   #> text: string
   #> a_check: bool (if_else(is_null(match_substring_regex(text, {pattern="a", 
ignore_case=true}), {nan_is_null=true}), false, match_substring_regex(text, 
{pattern="a", ignore_case=true})))
   #> 
   #> See $.data for the source Arrow object
   
   apple_parquet |>
     dplyr::mutate(
       a_check = grepl(x = text, pattern = "a", ignore.case = TRUE)
     ) |>
     dplyr::collect()
   #> # A tibble: 2 × 2
   #>   text  a_check
   #>   <chr> <lgl>  
   #> 1 apple TRUE   
   #> 2 APPLE TRUE
   
   
   ## with stringr::str_detect it does not work
   
   apple_parquet |>
     dplyr::mutate(
       a_check = stringr::str_detect(
         string = text,
         pattern = "a"
       )
     )
   #> FileSystemDataset (query)
   #> text: string
   #> a_check: bool (match_substring_regex(text, {pattern="a", 
ignore_case=false}))
   #> 
   #> See $.data for the source Arrow object
   
   
   apple_parquet |>
     dplyr::mutate(
       a_check = stringr::str_detect(
         string = text,
         pattern = stringr::regex(
           pattern = "a",
           ignore_case = TRUE
         )
       )
     )
   #> FileSystemDataset (query)
   #> text: string
   #> a_check: bool (match_substring_regex(text, {pattern="a", 
ignore_case=false}))
   #> 
   #> See $.data for the source Arrow object
   
   
   apple_parquet |>
     dplyr::mutate(
       a_check = stringr::str_detect(
         string = text,
         pattern = stringr::regex(
           pattern = "a",
           ignore_case = TRUE
         )
       ),
       p_count = stringr::str_count(
         string = text,
         pattern = stringr::regex(
           pattern = "p",
           ignore_case = TRUE
         )
       )
     ) |>
     dplyr::collect()
   #> # A tibble: 2 × 3
   #>   text  a_check p_count
   #>   <chr> <lgl>     <int>
   #> 1 apple TRUE          2
   #> 2 APPLE FALSE         0
   
   ## Same result with stringr::fixed
   
   
   apple_parquet |>
     dplyr::mutate(
       a_check = stringr::str_detect(
         string = text,
         pattern = stringr::fixed(
           pattern = "a",
           ignore_case = TRUE
         )
       ),
       p_count = stringr::str_count(
         string = text,
         pattern = stringr::fixed(
           pattern = "p",
           ignore_case = TRUE
         )
       )
     ) |>
     dplyr::collect()
   #> # A tibble: 2 × 3
   #>   text  a_check p_count
   #>   <chr> <lgl>     <int>
   #> 1 apple TRUE          2
   #> 2 APPLE FALSE         0
   
   ## it works nicely just including the case insensitive in the regex
   
   apple_parquet |>
     dplyr::mutate(
       a_check = stringr::str_detect(
         string = text,
         pattern = "(?i)a"
       ),
       p_count = stringr::str_count(
         string = text,
         pattern = "(?i)p"
       )
     ) |>
     dplyr::collect()
   #> # A tibble: 2 × 3
   #>   text  a_check p_count
   #>   <chr> <lgl>     <int>
   #> 1 apple TRUE          2
   #> 2 APPLE TRUE          2
   
   
   
   ## With stringi
   
   apple_df |>
     dplyr::mutate(
       a_check = stringi::stri_detect_regex(
         str = text,
         pattern = "a",
         case_insensitive = TRUE
       )
     ) |>
     dplyr::collect()
   #> # A tibble: 2 × 2
   #>   text  a_check
   #>   <chr> <lgl>  
   #> 1 apple TRUE   
   #> 2 APPLE TRUE
   
   
   
   apple_parquet |>
     dplyr::mutate(
       a_check = stringi::stri_detect_regex(
         str = text,
         pattern = "a",
         case_insensitive = TRUE
       )
     ) |>
     dplyr::collect()
   #> Error: Expression stringi::stri_detect_regex(str = text, pattern = "a", 
case_insensitive = TRUE) not supported in Arrow
   #> Call collect() first to pull data into R.
   ```
   
   <sup>Created on 2023-07-17 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] giocomai opened a new issue, #36720: [R] Inconsistent results with stringr::str_detect and str_count, when case_insensitive is set to TRUE via stringr::regex or stringr::fixed

Reply via email to