Re: [I] [R] Filtering based on str_detect character columns with more than 4000 characters and occasional empty cells not working correctly when reading from disk with arrow [arrow]

via GitHub Thu, 25 Apr 2024 12:15:17 -0700


paleolimbot commented on issue #41175:
URL: https://github.com/apache/arrow/issues/41175#issuecomment-2078004940


   I did a quick pass with the reprex (thank you!) and ensure that even with an 
identical query plan (except the source node), there are a different number of 
rows that are selected. The next step would be to reproduce in Python since the 
people who know how to fix it are better at debugging it there (I may get there 
in the next few minutes but just leaving this here in case I don't!).
   
   <details>
   
   ``` r
   library("tibble")
   library("dplyr")
   #> 
   #> Attaching package: 'dplyr'
   #> The following objects are masked from 'package:stats':
   #> 
   #>     filter, lag
   #> The following objects are masked from 'package:base':
   #> 
   #>     intersect, setdiff, setequal, union
   library("stringr")
   library("arrow")
   #> Warning: package 'arrow' was built under R version 4.3.3
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   
   set.seed(1)
   
   data_df <- tibble::tibble(size = 2:10000) |> 
     dplyr::mutate(text = paste(c("a", 
                                  sample(x = c(letters, LETTERS),
                                         size = 10000,
                                         replace = TRUE)),
                                collapse = "")) |> 
     dplyr::group_by(size) |> 
     dplyr::mutate(text = stringr::str_trunc(text, width = size, ellipsis = 
"")) |> 
     dplyr::mutate(category = round(size/10)) |> 
     dplyr::ungroup() |> 
     dplyr::group_by(category)
   
   data_df[["text"]][sample(c(TRUE, FALSE), size = nrow(data_df), prob = c(0.1, 
0.9), replace = TRUE)] <- ""
   
   ### Store in a temp folder
   test_arrow_path <- file.path(tempdir(), "test_arrow")
   write_dataset(dataset = data_df,
                 path = test_arrow_path)
   
   ### Read from temp folder
   arrow_from_disk <- open_dataset(test_arrow_path)
   ### Read from memory
   arrow_from_memory <- dplyr::compute(arrow_from_disk)
   
   # Used select(size, text) to ensure that the query plans were identical
   arrow_from_disk |> 
     dplyr::filter(stringr::str_detect(text, "a")) |> 
     dplyr::select(size, text) |> 
     show_query()
   #> ExecPlan with 4 nodes:
   #> 3:SinkNode{}
   #>   2:ProjectNode{projection=[size, text]}
   #>     1:FilterNode{filter=match_substring_regex(text, {pattern="a", 
ignore_case=false})}
   #>       0:SourceNode{}
   
   arrow_from_memory |> 
     dplyr::filter(stringr::str_detect(text, "a")) |> 
     select(size, text) |> 
     show_query()
   #> ExecPlan with 4 nodes:
   #> 3:SinkNode{}
   #>   2:ProjectNode{projection=[size, text]}
   #>     1:FilterNode{filter=match_substring_regex(text, {pattern="a", 
ignore_case=false})}
   #>       0:TableSourceNode{}
   
   arrow_from_disk |> 
     dplyr::filter(stringr::str_detect(text, "a")) |> 
     dplyr::select(size, text) |> 
     nrow()
   #> [1] 5652
   
   arrow_from_memory |> 
     dplyr::filter(stringr::str_detect(text, "a")) |> 
     select(size, text) |> 
     nrow()
   #> [1] 9000
   ```
   
   <sup>Created on 2024-04-25 with [reprex 
v2.1.0](https://reprex.tidyverse.org)</sup>
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] Filtering based on str_detect character columns with more than 4000 characters and occasional empty cells not working correctly when reading from disk with arrow [arrow]

Reply via email to