Jennifer, Why do you try Sparkr? https://spark.apache.org/docs/1.6.1/api/R/read.json.html
On 2 September 2017 at 23:15, Jennifer Lyon <jennifer.s.l...@gmail.com> wrote: > Thank you for your suggestion. Unfortunately, while R doesn't segfault > calling readr::read_file() on the test file I described, I get the error > message: > > Error in read_file_(ds, locale) : negative length vectors are not allowed > > Jen > > On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn <istaz...@gmail.com> wrote: > >> As s work-around I suggest readr::read_file. >> >> --Ista >> >> >> On Sep 2, 2017 2:58 PM, "Jennifer Lyon" <jennifer.s.l...@gmail.com> wrote: >> >>> Hi: >>> >>> I have a 2.1GB JSON file. Typically I use readLines() and >>> jsonlite:fromJSON() to extract data from a JSON file. >>> >>> When I try and read in this file using readLines() R segfaults. >>> >>> I believe the two salient issues with this file are >>> 1). Its size >>> 2). It is a single line (no line breaks) >>> >>> I can reproduce this issue as follows >>> #Generate a big file with no line breaks >>> # In R >>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") >>> >>> # in unix shell >>> cp alpha.txt file.txt >>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt >>> file.txt; done >>> >>> This generates a 2.3GB file with no line breaks >>> >>> in R: >>> > moo <- readLines("file.txt") >>> >>> *** caught segfault *** >>> address 0x7cffffff, cause 'memory not mapped' >>> >>> Traceback: >>> 1: readLines("file.txt") >>> >>> Possible actions: >>> 1: abort (with core dump, if enabled) >>> 2: normal R exit >>> 3: exit R without saving workspace >>> 4: exit R saving workspace >>> Selection: 3 >>> >>> I conclude: >>> I am potentially running up against a limit in R, which should give a >>> reasonable error, but currently just segfaults. >>> >>> My question: >>> Most of the content of the JSON is an approximately 100K x 6K JSON >>> equivalent of a dataframe, and I know R can handle much bigger than this >>> size. I am expecting these JSON files to get even larger. My R code lives >>> in a bigger system, and the JSON comes in via stdin, so I have absolutely >>> no control over the data format. I can imagine trying to incrementally >>> parse the JSON so I don't bump up against the limit, but I am eager for >>> suggestions of simpler solutions. >>> >>> Also, I apologize for the timing of this bug report, as I know folks are >>> working to get out the next release of R, but like so many things I have >>> no >>> control over when bugs leap up. >>> >>> Thanks. >>> >>> Jen >>> >>> > sessionInfo() >>> R version 3.4.1 (2017-06-30) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> Running under: Ubuntu 14.04.5 LTS >>> >>> Matrix products: default >>> BLAS: R-3.4.1/lib/libRblas.so >>> LAPACK:R-3.4.1/lib/libRlapack.so >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> loaded via a namespace (and not attached): >>> [1] compiler_3.4.1 >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel