Re: [I] [R] Filtering based on grepl/str_detect columns with long strings of text and occasional empty cells not working correctly when reading from disk with arrow [arrow]

via GitHub Mon, 15 Apr 2024 13:49:06 -0700


giocomai commented on issue #41175:
URL: https://github.com/apache/arrow/issues/41175#issuecomment-2057781993


   Here's a more revealing reprex. By creating a data frame with strings of 
growing length, the issue is much clearer. 
   
   Conditions for reproducing:
   - a dataset is written to parquet with `write_dataset`
   - it has a character column, with strings longer than about 4000 characters
   - in the same character column, at least some rows have empty values ("")
   - the dataset is read from disk with `open_dataset`
   - the dataset is filtered on that character column with 
`stringr::str_detect()`
   
   If all these conditions are met, then arrow returns an empty data frame. 
   
   If the dataset is stored partitioned, partitions where both conditions are 
met return zero rows, while other partitions return data as expected. 
   
   With non-ASCII characters, e.g. with cyrillic letters such as б, г, д, etc., 
the issue emerges with text size of just over 2000 characters. 
   
   A separate issue with non-ASCII characters that may lead to inconsistencies 
in testing related to how `arrow` parses regex (with 
[re2](https://github.com/google/re2/wiki/Syntax), if I understand well). 
compared to standard stringr::str_detect(), hence same code with/without arrow 
may give different results. Mentioning here just in case this may be somehow 
related.
   
   ``` r
   library("tibble")
   library("dplyr")
   library("stringr")
   library("arrow")
   
   set.seed(1)
   
   data_df <- tibble::tibble(size = 2:10000) |> 
     dplyr::mutate(text = paste(c("a", 
                           sample(x = c(letters, LETTERS),
                                  size = 10000,
                                  replace = TRUE)),
                           collapse = "")) |> 
     dplyr::group_by(size) |> 
     dplyr::mutate(text = stringr::str_trunc(text, width = size, ellipsis = 
"")) |> 
     dplyr::mutate(category = round(size/10)) |> 
     dplyr::ungroup() |> 
     dplyr::group_by(category)
   
   data_df[["text"]][sample(c(TRUE, FALSE), size = nrow(data_df), prob = c(0.1, 
0.9), replace = TRUE)] <- ""
   
   ### Store in a temp folder
   test_arrow_path <- file.path(tempdir(), "test_arrow")
   write_dataset(dataset = data_df,
                 path = test_arrow_path)
   
   ### Read from temp folder
   arrow_from_disk <- open_dataset(test_arrow_path)
   ### Read from memory
   arrow_from_memory <- arrow_table(data_df)
   
   
   
   filtered_from_disk_df <- arrow_from_disk |> 
     dplyr::filter(stringr::str_detect(text, "a")) |> 
     dplyr::collect()
   filtered_from_disk_df
   #> # A tibble: 5,652 × 3
   #>     size text                                                           
category
   #>    <int> <chr>                                                            
 <int>
   #>  1   995 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #>  2   996 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #>  3   998 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #>  4  1000 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #>  5  1001 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #>  6  1002 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #>  7  1003 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #>  8  1004 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #>  9  1005 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   100
   #> 10    95 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
    10
   #> # ℹ 5,642 more rows
   
   
   filtered_from_memory_df <- arrow_from_memory |> 
     dplyr::filter(stringr::str_detect(text, "a")) |> 
     dplyr::collect()
   filtered_from_memory_df
   #> # A tibble: 9,000 × 3
   #> # Groups:   category [1,001]
   #>     size text        category
   #>    <int> <chr>          <dbl>
   #>  1     2 ad                 0
   #>  2     3 adM                0
   #>  3     4 adMa               0
   #>  4     5 adMaH              0
   #>  5     6 adMaHw             1
   #>  6     7 adMaHwQ            1
   #>  7     8 adMaHwQn           1
   #>  8     9 adMaHwQnr          1
   #>  9    10 adMaHwQnrY         1
   #> 10    11 adMaHwQnrYG        1
   #> # ℹ 8,990 more rows
   
   dplyr::anti_join(filtered_from_memory_df,
                    filtered_from_disk_df,
                    by = "size") |> 
     dplyr::arrange(size)
   #> # A tibble: 3,348 × 3
   #> # Groups:   category [392]
   #>     size text                                                           
category
   #>    <int> <chr>                                                            
 <dbl>
   #>  1  4095 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #>  2  4097 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #>  3  4098 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #>  4  4099 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #>  5  4100 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #>  6  4101 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #>  7  4102 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #>  8  4103 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #>  9  4104 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #> 10  4105 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…   
   410
   #> # ℹ 3,338 more rows
   
   nrow(filtered_from_disk_df)==nrow(filtered_from_memory_df)
   
   ```
   
   <sup>Created on 2024-04-15 with [reprex 
v2.1.0](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] Filtering based on grepl/str_detect columns with long strings of text and occasional empty cells not working correctly when reading from disk with arrow [arrow]

Reply via email to