sbashevkin commented on issue #31796:
URL: https://github.com/apache/arrow/issues/31796#issuecomment-1489351463
Hello! This issue has reemerged with Arrow v11.0.0.3. It was previously
fixed in an older arrow version but we are now encountering it again. Please
let me know if you'd prefer I open a new issue.
The issue is a bit stranger now in that it seems to require multiple dplyr
and base-R functions to be triggered. I played around with it a little and was
only able to trigger the issue with a combination of a `join`, `head`, and
`collect` call that you'll see in the reprex below. If I delete any of those
function calls, the issue disappears. Similar to its earlier iteration, it only
occurs with larger datasets and only on Windows. Lastly, this issue occurs with
Arrow v11.0.0.3, but not with Arrow v10.0.1. Please find a reprex below.
``` r
library(arrow)
#> Warning: package 'arrow' was built under R version 4.2.3
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
test_data1 <- expand.grid(A=1:150, B=1:10, C=1:50)
test_data2 <- data.frame(A=1:100, D=1:100)
write_dataset(test_data1, "test_data1", partitioning = "A")
write_dataset(test_data2, "test_data2", partitioning = "A")
test<-open_dataset("test_data1")%>%
dplyr::inner_join(open_dataset("test_data2")) %>%
head()%>%
collect()
rm(list=ls())
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 1184245 63.3 2479418 132.5 1365350 73.0
#> Vcells 2047220 15.7 8388608 64.0 3304441 25.3
files1 <- dir("test_data1", full.names = TRUE, recursive = TRUE)
files2 <- dir("test_data2", full.names = TRUE, recursive = TRUE)
files1_leftover<-lapply(files1, file.remove)
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data1/A=96/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data1/A=97/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data1/A=98/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data1/A=99/part-0.parquet', reason 'Permission denied'
which(!unlist(files1_leftover))
#> [1] 147 148 149 150
files1[!unlist(files1_leftover)]
#> [1] "test_data1/A=96/part-0.parquet" "test_data1/A=97/part-0.parquet"
#> [3] "test_data1/A=98/part-0.parquet" "test_data1/A=99/part-0.parquet"
files2_leftover<-lapply(files2, file.remove)
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data2/A=96/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data2/A=97/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data2/A=98/part-0.parquet', reason 'Permission denied'
#> Warning in FUN(X[[i]], ...): cannot remove file
#> 'test_data2/A=99/part-0.parquet', reason 'Permission denied'
which(!unlist(files2_leftover))
#> [1] 97 98 99 100
files2[!unlist(files2_leftover)]
#> [1] "test_data2/A=96/part-0.parquet" "test_data2/A=97/part-0.parquet"
#> [3] "test_data2/A=98/part-0.parquet" "test_data2/A=99/part-0.parquet"
```
<sup>Created on 2023-03-29 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
<details style="margin-bottom:10px;">
<summary>
Session info
</summary>
``` r
sessioninfo::session_info()
#> ─ Session info
───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.2 (2022-10-31 ucrt)
#> os Windows 10 x64 (build 19045)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.utf8
#> ctype English_United States.utf8
#> tz America/Los_Angeles
#> date 2023-03-29
#> pandoc 2.19.2 @ C:/Program
Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages
───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> arrow * 11.0.0.3 2023-03-08 [1] CRAN (R 4.2.3)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.1)
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.2)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.2)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.3)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2)
#> dplyr * 1.1.1 2023-03-22 [1] CRAN (R 4.2.3)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.2)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.2)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.1)
#> fs 1.6.0 2023-01-23 [1] CRAN (R 4.2.2)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.2)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2)
#> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.2)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.2)
#> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.2)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.3)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.2)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.2)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.3)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1)
#> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.3)
#> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.2)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
#> styler 1.9.1 2023-03-04 [1] CRAN (R 4.2.3)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.3)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.2)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.2)
#> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.3)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2)
#> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.2)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2)
#>
#> [1] C:/Users/sbashevkin/AppData/Local/R/win-library/4.2
#> [2] C:/Program Files/R/R-4.2.2/library
#>
#>
──────────────────────────────────────────────────────────────────────────────
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]