[
https://issues.apache.org/jira/browse/ARROW-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485149#comment-17485149
]
Dragoș Moldovan-Grünfeld commented on ARROW-15397:
--------------------------------------------------
It could be. I was also testing on macOS. [~Zea] would you be able to provide
the output of {{utils::sessionInfo()}} or {{{}devtools::session_info(){}}}? My
{{devtools}} one is below:
{code:r}
─ Session info
───────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.1.2 (2021-11-01)
os macOS Monterey 12.1
system aarch64, darwin20
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/London
date 2022-02-01
rstudio 2021.09.0+351 Ghost Orchid (desktop)
pandoc NA
─ Packages
───────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
arrow * 6.0.1 2021-11-20 [1] CRAN (R 4.1.1)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
backports 1.4.0 2021-11-23 [1] CRAN (R 4.1.1)
bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.1)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
broom 0.7.10 2021-10-31 [1] CRAN (R 4.1.1)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.1)
callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.1)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.1.0)
colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.1)
crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1)
DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.1)
dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
desc 1.4.0 2021-09-28 [1] CRAN (R 4.1.1)
devtools * 2.4.3 2021-11-30 [1] CRAN (R 4.1.1)
dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
fansi 1.0.0 2022-01-10 [1] CRAN (R 4.1.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
flow * 0.0.2 2021-08-13 [1] CRAN (R 4.1.1)
forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.1)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.1)
generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.1)
ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.1)
glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.1)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.1)
haven 2.4.3 2021-08-04 [1] CRAN (R 4.1.1)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.1.0)
highlite 0.0.0.9000 2021-12-10 [1] Github (jimhester/highlite@767b122)
hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.1)
httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
lookup * 0.0.0.9000 2021-12-10 [1] Github (jimhester/lookup@eba63db)
lubridate * 1.8.0 2021-10-07 [1] CRAN (R 4.1.1)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.1)
modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
pak * 0.2.0 2021-12-01 [1] CRAN (R 4.1.1)
pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.1)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.2.4 2021-11-30 [1] CRAN (R 4.1.1)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0)
prompt * 1.0.1 2021-12-10 [1] Github (gaborcsardi/prompt@7ef0f2e)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
readr * 2.1.1 2021-11-30 [1] CRAN (R 4.1.1)
readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.1)
reprex * 2.0.1 2021-08-05 [1] CRAN (R 4.1.1)
rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
rsthemes * 0.3.1 2021-12-10 [1] Github (gadenbuie/rsthemes@bbe73ca)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
rvest 1.0.2 2021-10-16 [1] CRAN (R 4.1.1)
scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.1)
stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.1)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.1)
testthat * 3.1.1 2021-12-03 [1] CRAN (R 4.1.1)
tibble * 3.1.6 2021-11-07 [1] CRAN (R 4.1.1)
tidyr * 1.1.4 2021-09-27 [1] CRAN (R 4.1.1)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
tzdb 0.2.0 2021-10-27 [1] CRAN (R 4.1.1)
usethis * 2.1.5 2021-12-09 [1] CRAN (R 4.1.1)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
vroom 1.5.7 2021-11-30 [1] CRAN (R 4.1.1)
withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.1)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.1)
[1] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
{code
> [R] Problem with Join in apache arrow in R
> ------------------------------------------
>
> Key: ARROW-15397
> URL: https://issues.apache.org/jira/browse/ARROW-15397
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 6.0.1
> Reporter: José F
> Assignee: Dragoș Moldovan-Grünfeld
> Priority: Minor
> Fix For: 6.0.2
>
>
> Hi dear arrow developers. I tested inner_join with arrow R package but R
> crashed, this is my example with toy dataset iris:
>
> data(iris)
> write.csv(iris, "iris.csv") # write csv file
> # write parket file with write_chunk_data function (below)
> walk("C:/Users/Stats/Desktop/ejemplo_join/iris.csv",
> write_chunk_data, "C:/Users/Stats/Desktop/ejemplo_join/parquet",
> chunk_size = 50)
>
> iris_arrow <- open_dataset("parquet")
> df1_arrow <- iris_arrow %>% select(`...1`, Sepal.Length, Sepal.Width,
> Petal.Length)
> df2_arrow <- iris_arrow %>% select(`...1`, Petal.Width, Species,) d
> df <- tabla1_arrow %>% inner_join(tabla2_arrow, by = "...1") %>%
> group_by(Species) %>% summarise(prom = mean(Sepal.Length)) %>% collect()
> print(df)
>
>
> # Run this function to write parquet files in this example please
> write_chunk_data <- function(data_path, output_dir, chunk_size = 1000000) {
> #If the output_dir do not exist, it is created
> if (!fs::dir_exists(output_dir)) fs::dir_create(output_dir)
> #It gets the name of the file
> data_name <- fs::path_ext_remove(fs::path_file(data_path))
> #It sets the chunk_num to 0
> chunk_num <- 0
> #Read the file using vroom
> data_chunk <- vroom::vroom(data_path)
> #It gets the variable names
> data_names <- names(data_chunk)
> #It gets the number of rows
> rows<-nrow(data_chunk)
>
> #The following loop creates a parquet file for every [chunk_size] rows
> repeat{
> #It checks if we are over the max rows
> if(rows>(chunk_num+1)*chunk_size)
> {
> arrow::write_parquet(data_chunk[(chunk_num*chunk_size+1):((chunk_num+1)*chunk_size),],
> fs::path(output_dir, glue::glue("\\{data_name}
> -\{chunk_num}.parquet")))
> }
> else
> { arrow::write_parquet(data_chunk[(chunk_num*chunk_size+1):rows,],
> fs::path(output_dir, glue::glue("\\{data_name}
> -\{chunk_num}.parquet")))
> break
> }
> chunk_num <- chunk_num + 1
> }
>
> #This is to recover some memory and space in the disk
> rm(data_chunk)
> tmp_file <- tempdir()
> files <- list.files(tmp_file, full.names = T, pattern = "^vroom")
> file.remove(files)
> }
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)