paleolimbot commented on issue #41175:
URL: https://github.com/apache/arrow/issues/41175#issuecomment-2078004940
I did a quick pass with the reprex (thank you!) and ensure that even with an
identical query plan (except the source node), there are a different number of
rows that are selected. The next step would be to reproduce in Python since the
people who know how to fix it are better at debugging it there (I may get there
in the next few minutes but just leaving this here in case I don't!).
<details>
``` r
library("tibble")
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library("stringr")
library("arrow")
#> Warning: package 'arrow' was built under R version 4.3.3
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
set.seed(1)
data_df <- tibble::tibble(size = 2:10000) |>
dplyr::mutate(text = paste(c("a",
sample(x = c(letters, LETTERS),
size = 10000,
replace = TRUE)),
collapse = "")) |>
dplyr::group_by(size) |>
dplyr::mutate(text = stringr::str_trunc(text, width = size, ellipsis =
"")) |>
dplyr::mutate(category = round(size/10)) |>
dplyr::ungroup() |>
dplyr::group_by(category)
data_df[["text"]][sample(c(TRUE, FALSE), size = nrow(data_df), prob = c(0.1,
0.9), replace = TRUE)] <- ""
### Store in a temp folder
test_arrow_path <- file.path(tempdir(), "test_arrow")
write_dataset(dataset = data_df,
path = test_arrow_path)
### Read from temp folder
arrow_from_disk <- open_dataset(test_arrow_path)
### Read from memory
arrow_from_memory <- dplyr::compute(arrow_from_disk)
# Used select(size, text) to ensure that the query plans were identical
arrow_from_disk |>
dplyr::filter(stringr::str_detect(text, "a")) |>
dplyr::select(size, text) |>
show_query()
#> ExecPlan with 4 nodes:
#> 3:SinkNode{}
#> 2:ProjectNode{projection=[size, text]}
#> 1:FilterNode{filter=match_substring_regex(text, {pattern="a",
ignore_case=false})}
#> 0:SourceNode{}
arrow_from_memory |>
dplyr::filter(stringr::str_detect(text, "a")) |>
select(size, text) |>
show_query()
#> ExecPlan with 4 nodes:
#> 3:SinkNode{}
#> 2:ProjectNode{projection=[size, text]}
#> 1:FilterNode{filter=match_substring_regex(text, {pattern="a",
ignore_case=false})}
#> 0:TableSourceNode{}
arrow_from_disk |>
dplyr::filter(stringr::str_detect(text, "a")) |>
dplyr::select(size, text) |>
nrow()
#> [1] 5652
arrow_from_memory |>
dplyr::filter(stringr::str_detect(text, "a")) |>
select(size, text) |>
nrow()
#> [1] 9000
```
<sup>Created on 2024-04-25 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]