sorry, typo, 80937 not 809367 On Sun, Jul 16, 2017 at 6:21 AM, Anthony Damico <ajdam...@gmail.com> wrote:
> hi, thank you for attempting this. it looks like your unix machine > unzipped the txt file without corruption -- if you copied over the same txt > file to windows 7, i don't think that would reproduce the problem? i think > it needs to be the corrupted text file where R.utils::countLines( txtfile > ) gives 809367. i am able to reproduce on two distinct windows machines > but no guarantee i'm not doing something dumb > > On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> > wrote: > >> I am not able to reproduce your segfault on a Windows 7 platform either: >> >> ########################## >> fn1 <- "d:/DADOS_ENEM_2009.txt" >> sessionInfo() >> ## R version 3.4.1 (2017-06-30) >> ## Platform: x86_64-w64-mingw32/x64 (64-bit) >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1 >> ## >> ## Matrix products: default >> ## >> ## locale: >> ## [1] LC_COLLATE=English_United States.1252 >> ## [2] LC_CTYPE=English_United States.1252 >> ## [3] LC_MONETARY=English_United States.1252 >> ## [4] LC_NUMERIC=C >> ## [5] LC_TIME=English_United States.1252 >> ## >> ## attached base packages: >> ## [1] stats graphics grDevices utils datasets methods base >> ## >> ## loaded via a namespace (and not attached): >> ## [1] compiler_3.4.1 >> tools::md5sum( fn1 ) >> ## d:/DADOS_ENEM_2009.txt >> ## "83e61c96092285b60d7bf6b0dbc7072e" >> dat <- readLines( fn1 ) >> length( dat ) >> ## [1] 4148721 >> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote: >> >> I am not able to reproduce this on a Linux platform: >>> >>> #######################3 >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >>> 2009/DADOS_ENEM_2009.txt" >>> sessionInfo() >>> ## R version 3.4.1 (2017-06-30) >>> ## Platform: x86_64-pc-linux-gnu (64-bit) >>> ## Running under: Ubuntu 14.04.5 LTS >>> ## >>> ## Matrix products: default >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0 >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 >>> ## >>> ## locale: >>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> ## [9] LC_ADDRESS=C LC_TELEPHONE=C >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> ## >>> ## attached base packages: >>> ## [1] stats graphics grDevices utils datasets methods base >>> ## >>> ## loaded via a namespace (and not attached): >>> ## [1] compiler_3.4.1 >>> tools::md5sum( fn1 ) >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >>> 2009/DADOS_ENEM_2009.txt >>> ## >>> "83e61c96092285b60d7bf6b0dbc7072e" >>> dat <- readLines( fn1 ) >>> length( dat ) >>> ## [1] 4148721 >>> >>> No segfault occurs. >>> >>> On Sat, 15 Jul 2017, Anthony Damico wrote: >>> >>> hi, i realized that the segfault happens on the text file in a new R >>>> session. so, creating the segfault-generating text file requires a >>>> contributed package, but prompting the actual segfault does not -- >>>> pretty >>>> sure that means this is a base R bug? submitted here: >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully >>>> i am >>>> not doing something remarkably stupid. the text file itself is 4GB so >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in >>>> the >>>> previous message, i think most or all of it needs to be there to trigger >>>> the segfault. thanks! >>>> >>>> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdam...@gmail.com> >>>> wrote: >>>> >>>> hi, thanks Dr. Murdoch >>>>> >>>>> >>>>> i'd appreciate if anyone on r-help could help me narrow this down? i >>>>> believe the segfault occurs because there's a single line with 4GB and >>>>> also >>>>> embedded nuls, but i am not sure how to artificially construct that? >>>>> >>>>> >>>>> the lodown package can be removed from my example.. it is just for >>>>> file >>>>> download cacheing, so `lodown::cachaca` can be replaced with >>>>> `download.file` my current example requires a huge download, so sort >>>>> of >>>>> painful to repeat but i'm pretty confident that's not the issue. >>>>> >>>>> >>>>> the archive::archive_extract() function unzips a (probably corrupt) >>>>> .RAR >>>>> file and creates a text file with 80,937 lines. this file is 4GB: >>>>> >>>>> > file.size(infile) >>>>> [1] 4078192743 <(407)%20819-2743> >>>>> >>>>> >>>>> i am pretty sure that nearly all of that 4GB is contained on a single >>>>> line >>>>> in the file. here's what happens when i create a file connection and >>>>> scan >>>>> through.. >>>>> >>>>> > file_con <- file( infile , 'r' ) >>>>> > >>>>> > first_80936_lines <- readLines( file_con , n = 80936 ) >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "1000023930632009" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "36F2924009PAULO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "AFONSO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "BA11" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "00000" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "00" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "2924009PAULO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "AFONSO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "BA1111" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "467.20" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "346.10" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "414.40" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Error in scan(w, n = 1, what = character()) : >>>>> could not allocate memory (2048 Mb) in C function >>>>> 'R_AllocStringBuffer' >>>>> >>>>> >>>>> >>>>> making a huge single-line file does not reproduce the problem, i think >>>>> the >>>>> embedded nuls have something to do with it-- >>>>> >>>>> >>>>> # WARNING do not run with less than 64GB RAM >>>>> tf <- tempfile() >>>>> a <- rep( "a" , 1000000000 ) >>>>> b <- paste( a , collapse = '' ) >>>>> writeLines( b , tf ) ; rm( b ) ; gc() >>>>> d <- readLines( tf ) >>>>> >>>>> >>>>> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch < >>>>> murdoch.dun...@gmail.com> >>>>> wrote: >>>>> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote: >>>>>> >>>>>> hello, the last line of the code below causes a segfault for me on >>>>>>> 3.4.1. >>>>>>> i think i should submit to https://bugs.r-project.org/ unless >>>>>>> others >>>>>>> have >>>>>>> advice? thanks >>>>>>> >>>>>>> >>>>>> Segfaults are usually worth reporting as bugs. Try to come up with a >>>>>> self-contained example, not using the lodown and archive packages. I >>>>>> imagine you can do this by uploading the file you downloaded, or >>>>>> enough of >>>>>> a subset of it to trigger the segfault. If you can't do that, then >>>>>> likely >>>>>> the bug is with one of those packages, not with R. >>>>>> >>>>>> Duncan Murdoch >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> install.packages( "devtools" ) >>>>>>> devtools::install_github("ajdamico/lodown") >>>>>>> devtools::install_github("jimhester/archive") >>>>>>> >>>>>>> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" ) >>>>>>> >>>>>>> tf <- tempfile() >>>>>>> >>>>>>> # large download! cachaca saves on your local disk if already >>>>>>> downloaded >>>>>>> lodown::cachaca( ' >>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , >>>>>>> tf , >>>>>>> mode >>>>>>> = 'wb' ) >>>>>>> >>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) ) >>>>>>> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE , >>>>>>> full.names = >>>>>>> TRUE ) >>>>>>> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE ) >>>>>>> >>>>>>> # works >>>>>>> R.utils::countLines( infile ) >>>>>>> >>>>>>> # works with warning >>>>>>> my_file <- readLines( infile , skipNul = TRUE ) >>>>>>> >>>>>>> # crash >>>>>>> my_file <- readLines( infile ) >>>>>>> >>>>>>> >>>>>>> # run just before crash >>>>>>> sessionInfo() >>>>>>> # R version 3.4.1 (2017-06-30) >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >>>>>>> # Running under: Windows 10 x64 (build 15063) >>>>>>> >>>>>>> # Matrix products: default >>>>>>> >>>>>>> # locale: >>>>>>> # [1] LC_COLLATE=English_United States.1252 >>>>>>> # [2] LC_CTYPE=English_United States.1252 >>>>>>> # [3] LC_MONETARY=English_United States.1252 >>>>>>> # [4] LC_NUMERIC=C >>>>>>> # [5] LC_TIME=English_United States.1252 >>>>>>> >>>>>>> # attached base packages: >>>>>>> # [1] stats graphics grDevices utils datasets methods >>>>>>> base >>>>>>> >>>>>>> # loaded via a namespace (and not attached): >>>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >>>>>>> withr_1.0.2 >>>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >>>>>>> memoise_1.1.0 >>>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >>>>>>> lodown_0.1.0 >>>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 >>>>>>> R.oo_1.21.0 >>>>>>> # [17] archive_0.0.0.9000 >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>>>> ng-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> ------------------------------------------------------------ >>> --------------- >>> Jeff Newmiller The ..... ..... Go >>> Live... >>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>> Go... >>> Live: OO#.. Dead: OO#.. Playing >>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> /Software/Embedded Controllers) .OO#. .OO#. >>> rocks...1k >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> ------------------------------------------------------------ >> --------------- >> Jeff Newmiller The ..... ..... Go >> Live... >> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >> Go... >> Live: OO#.. Dead: OO#.. Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >> rocks...1k >> ------------------------------------------------------------ >> --------------- >> > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.