[
https://issues.apache.org/jira/browse/ARROW-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012089#comment-17012089
]
Neal Richardson edited comment on ARROW-7520 at 1/9/20 6:08 PM:
----------------------------------------------------------------
If you give the record batch writer a string, it opens a FileOutputStream:
https://github.com/apache/arrow/blob/master/r/R/record-batch-writer.R#L96-L99
Though if you're calling your function multiple times to write to the same
file, it would make sense to initialize the FileOutputStream outside of your
loop so that you control when the connection is opened and closed.
was (Author: npr):
If you give the record batch writer a string, it opens a FileOutputStream:
https://github.com/apache/arrow/blob/master/r/R/record-batch-writer.R#L96-L99
> [R] Writing many batches causes a crash
> ---------------------------------------
>
> Key: ARROW-7520
> URL: https://issues.apache.org/jira/browse/ARROW-7520
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.15.1
> Environment: - Session info
> -----------------------------------------------------------------------------------------------------------------------------------------------------------
> setting value
> version R version 3.6.1 (2019-07-05)
> os Windows 10 x64
> system x86_64, mingw32
> ui RStudio
> language (EN)
> collate English_United States.1252
> ctype English_United States.1252
> tz America/New_York
> date 2020-01-08
>
> - Packages
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> ! package * version date lib source
>
> acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.1)
>
> arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.2)
>
> askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1)
>
> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1)
>
> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
>
> base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0)
>
> bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0)
>
> bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0)
>
> blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1)
>
> callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1)
>
> cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1)
>
> checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1)
>
> cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1)
>
> cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.1)
>
> codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1)
>
> colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1)
>
> commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1)
>
> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1)
>
> credentials 1.1 2019-03-12 [1] CRAN (R 3.6.2)
>
> curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1)
>
> data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1)
>
> DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1)
>
> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1)
>
> devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1)
>
> digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1)
>
> dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.1)
>
> DT 0.9 2019-09-17 [1] CRAN (R 3.6.1)
>
> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
>
> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1)
>
> foreign 0.8-71 2018-07-20 [2] CRAN (R 3.6.1)
>
> Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0)
>
> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1)
>
> fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1)
>
> future * 1.15.0-9000 2019-11-19 [1] Github
> (HenrikBengtsson/future@bc241c7)
> ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1)
>
> globals 0.12.4 2018-10-11 [1] CRAN (R 3.6.0)
>
> glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1)
>
> gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1)
>
> gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5)
>
> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1)
>
> Hmisc * 4.3-0 2019-11-07 [1] CRAN (R 3.6.1)
>
> htmlTable 1.13.2 2019-09-22 [1] CRAN (R 3.6.1)
>
> D htmltools 0.3.6.9004 2019-09-20 [1] Github
> (rstudio/htmltools@c49b29c)
> htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.6.1)
>
> jsonlite * 1.6 2018-12-07 [1] CRAN (R 3.6.1)
>
> knitr 1.25 2019-09-18 [1] CRAN (R 3.6.1)
>
> lattice * 0.20-38 2018-11-04 [2] CRAN (R 3.6.1)
>
> latticeExtra 0.6-28 2016-02-09 [1] CRAN (R 3.6.1)
>
> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.1)
>
> lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.1)
>
> listenv 0.7.0 2018-01-21 [1] CRAN (R 3.6.1)
>
> lubridate * 1.7.4 2018-04-11 [1] CRAN (R 3.6.1)
>
> magrittr * 1.5 2014-11-22 [1] CRAN (R 3.6.1)
>
> Matrix 1.2-17 2019-03-22 [2] CRAN (R 3.6.1)
>
> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.1)
>
> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.1)
>
> nnet 7.3-12 2016-02-02 [2] CRAN (R 3.6.1)
>
> openssl 1.4.1 2019-07-18 [1] CRAN (R 3.6.1)
>
> outliers * 0.14 2011-01-24 [1] CRAN (R 3.6.0)
>
> pillar 1.4.2 2019-06-29 [1] CRAN (R 3.6.1)
>
> pkgbuild 1.0.5 2019-08-26 [1] CRAN (R 3.6.1)
>
> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.6.1)
>
> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.1)
>
> plyr * 1.8.4 2016-06-08 [1] CRAN (R 3.6.1)
>
> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.1)
>
> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
>
> pryr * 0.1.4 2018-02-18 [1] CRAN (R 3.6.1)
>
> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.1)
>
> purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.6.1)
>
> R6 * 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
>
> RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.6.0)
>
> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1)
>
> readxl * 1.3.1 2019-03-13 [1] CRAN (R 3.6.1)
>
> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1)
>
> rlang * 0.4.2 2019-11-23 [1] CRAN (R 3.6.1)
>
> rmarkdown * 2.0.3 2019-12-19 [1] Github
> (rstudio/rmarkdown@26cc3b1)
> RODBC * 1.3-16 2019-09-03 [1] CRAN (R 3.6.1)
>
> roxygen2 * 6.1.1 2018-11-07 [1] CRAN (R 3.6.1)
>
> rpart 4.1-15 2019-04-12 [2] CRAN (R 3.6.1)
>
> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.1)
>
> RSQLite * 2.1.2 2019-07-24 [1] CRAN (R 3.6.1)
>
> rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.1)
>
> scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.1)
>
> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1)
>
> slide * 0.0.0.9002 2019-11-27 [1] Github
> (DavisVaughan/slide@92e8e02)
> ssh 0.6 2019-04-09 [1] CRAN (R 3.6.2)
>
> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
>
> stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.1)
>
> survival * 2.44-1.1 2019-04-01 [2] CRAN (R 3.6.1)
>
> testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.1)
>
> tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.1)
>
> tidyr * 1.0.0 2019-09-11 [1] CRAN (R 3.6.1)
>
> tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.1)
>
> usethis * 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
>
> varhandle * 2.0.3 2018-07-04 [1] CRAN (R 3.6.0)
>
> vctrs 0.2.0.9007 2019-11-27 [1] Github (r-lib/vctrs@945809e)
>
> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.1)
>
> xfun 0.9 2019-08-21 [1] CRAN (R 3.6.1)
>
> xml2 * 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)
>
> xts * 0.11-2 2018-11-05 [1] CRAN (R 3.6.1)
>
> zoo * 1.8-6 2019-05-28 [1] CRAN (R 3.6.1)
>
>
> [1] C:/Users/cklar/Desktop/R packages
> [2] C:/Program Files/R/R-3.6.1/library
>
> P -- Loaded and on-disk path mismatch.
> D -- DLL MD5 mismatch, broken installation.
> Reporter: Christian
> Priority: Trivial
>
> Hi,
> When creating north of 200-300 batches, the writing to the arrow file crashes
> R - it doesn't even show an error message. Rstudio just aborts.
> I have the feeling that maybe each batch becomes a stream and R has issues
> with the connections, but that's a total guess.
> Any help would be appreciated.
>
> ##
>
> Here is the function. When running it with 3000 it crashes immediately.
> Before that I ran it with 100, and then increased it slowly, and then it
> randomly crashed again.
>
> ##
> Now I received this error message after writing 30 batches.
> Error in ipc___RecordBatchWriter__WriteRecordBatch(self, batch) :
> Invalid: Invalid operation on closed file
> Error in ipc___RecordBatchWriter__WriteRecordBatch(self, batch) :
> Invalid: Invalid operation on closed file
> ##
> write_arrow_custom(data.frame(A=c(1:100000),B=c(1:100000)),'C:/Temp/test.arrow',3000)
>
> write_arrow_custom <- function(df,targetarrow,nrbatches) {
> ct <- nrbatches
> idxs <- c(0:ct)/ct*nrow(df)
> idxs <- round(idxs,0) %>% as.integer()
> idxs[length(idxs)] <- nrow(df)
> df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>%
> mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>%
> filter(!is.na(colto)) %>% mutate(R=row_number())
> stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>%
> sum()==nrow(df))
> table_df <- Table$create(name=rownames(df[1,]),df[1,])
> writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema)
> df_nav %>% dlply(c('R'),function(df_nav)
> { catl(glue('\\{df_nav$colfrom[1]}
> :{df_nav$colto[1]} / {df_nav$R[1]}...'))
> tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],]
> writer$write_batch(record_batch(name = rownames(tmp), tmp))
> NULL
> }) -> batch_lst
> writer$close()
> rm(batch_lst)
> gc()
> }
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)