giocomai commented on issue #41175:
URL: https://github.com/apache/arrow/issues/41175#issuecomment-2057781993
Here's a more revealing reprex. By creating a data frame with strings of
growing length, the issue is much clearer.
Conditions for reproducing:
- a dataset is written to parquet with `write_dataset`
- it has a character column, with strings longer than about 4000 characters
- in the same character column, at least some rows have empty values ("")
- the dataset is read from disk with `open_dataset`
- the dataset is filtered on that character column with
`stringr::str_detect()`
If all these conditions are met, then arrow returns an empty data frame.
If the dataset is stored partitioned, partitions where both conditions are
met return zero rows, while other partitions return data as expected.
With non-ASCII characters, e.g. with cyrillic letters such as б, г, д, etc.,
the issue emerges with text size of just over 2000 characters.
A separate issue with non-ASCII characters that may lead to inconsistencies
in testing related to how `arrow` parses regex (with
[re2](https://github.com/google/re2/wiki/Syntax), if I understand well).
compared to standard stringr::str_detect(), hence same code with/without arrow
may give different results. Mentioning here just in case this may be somehow
related.
``` r
library("tibble")
library("dplyr")
library("stringr")
library("arrow")
set.seed(1)
data_df <- tibble::tibble(size = 2:10000) |>
dplyr::mutate(text = paste(c("a",
sample(x = c(letters, LETTERS),
size = 10000,
replace = TRUE)),
collapse = "")) |>
dplyr::group_by(size) |>
dplyr::mutate(text = stringr::str_trunc(text, width = size, ellipsis =
"")) |>
dplyr::mutate(category = round(size/10)) |>
dplyr::ungroup() |>
dplyr::group_by(category)
data_df[["text"]][sample(c(TRUE, FALSE), size = nrow(data_df), prob = c(0.1,
0.9), replace = TRUE)] <- ""
### Store in a temp folder
test_arrow_path <- file.path(tempdir(), "test_arrow")
write_dataset(dataset = data_df,
path = test_arrow_path)
### Read from temp folder
arrow_from_disk <- open_dataset(test_arrow_path)
### Read from memory
arrow_from_memory <- arrow_table(data_df)
filtered_from_disk_df <- arrow_from_disk |>
dplyr::filter(stringr::str_detect(text, "a")) |>
dplyr::collect()
filtered_from_disk_df
#> # A tibble: 5,652 × 3
#> size text
category
#> <int> <chr>
<int>
#> 1 995 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 2 996 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 3 998 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 4 1000 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 5 1001 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 6 1002 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 7 1003 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 8 1004 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 9 1005 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
100
#> 10 95 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
10
#> # ℹ 5,642 more rows
filtered_from_memory_df <- arrow_from_memory |>
dplyr::filter(stringr::str_detect(text, "a")) |>
dplyr::collect()
filtered_from_memory_df
#> # A tibble: 9,000 × 3
#> # Groups: category [1,001]
#> size text category
#> <int> <chr> <dbl>
#> 1 2 ad 0
#> 2 3 adM 0
#> 3 4 adMa 0
#> 4 5 adMaH 0
#> 5 6 adMaHw 1
#> 6 7 adMaHwQ 1
#> 7 8 adMaHwQn 1
#> 8 9 adMaHwQnr 1
#> 9 10 adMaHwQnrY 1
#> 10 11 adMaHwQnrYG 1
#> # ℹ 8,990 more rows
dplyr::anti_join(filtered_from_memory_df,
filtered_from_disk_df,
by = "size") |>
dplyr::arrange(size)
#> # A tibble: 3,348 × 3
#> # Groups: category [392]
#> size text
category
#> <int> <chr>
<dbl>
#> 1 4095 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 2 4097 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 3 4098 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 4 4099 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 5 4100 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 6 4101 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 7 4102 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 8 4103 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 9 4104 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> 10 4105 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn…
410
#> # ℹ 3,338 more rows
nrow(filtered_from_disk_df)==nrow(filtered_from_memory_df)
```
<sup>Created on 2024-04-15 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]