On 01/16/2015 10:21 PM, Mike Miller wrote:
First, a very easy question:  What is the difference between using
what="character" and what=character() in scan()?  What is the reason for the
character() syntax?

I am working with some character vectors that are up to about 27.5 million
elements long.  The elements are always unique.  Specifically, these are names
of genetic markers.  This is how much memory those names take up:

snps <- scan("SNPs.txt", what=character())
Read 27446736 items
object.size(snps)
1756363648 bytes
object.size(snps)/length(snps)
63.9917128215173 bytes

As you can see, that's about 1.76 GB of memory for the vector at an average of
64 bytes per element.  The longest string is only 14 bytes, though.  The file
takes up 313 MB.

Using 64 bytes per element instead of 14 bytes per element is costing me a total
of 1,372,336,800 bytes.  In a different example where the longest string is 4
characters, the elements each use 8 bytes.  So it looks like I'm stuck with
either 8 bytes or 64 bytes.  Is that true?  There is no way to modify that?

Hi Mike --

R represents the atomic vector types as so-called S-expressions, which in addition to the actual data contain information about whether they have been referenced by one or more symbols etc.; you can get a sense of this with

    > x <- 1:5
    > .Internal(inspect(x))
    @4c732940 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5

where the number after @ is the memory location, INTSXP indicates that the type of data is an integer, etc. So a vector requires memory for the S-expression, and for the actual data.

A character vector is represented by an S-expression for the vector itself, and an S-expression for each element of the vector, and of course the data itself

    > .Internal(inspect(y))
    @4ce72090 16 STRSXP g0c3 [NAM(1)] (len=3, tl=0)
      @137ccd8 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "a"
      @137ccd8 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "a"
      @15a6698 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "b"

The large S-expression overhead is recouped by long (in the nchar() sense) or re-used strings, but that's not the case for your data.

There is no way around this in base R. There are general-purpose solutions like the data.table package, or retaining your large data in a data base (like SQLite) that you interface from within R using e.g., sqldf or dplyr to do as much data reduction in the data base (and out of R) as possible. In your particular case the Bioconductor Biostrings package BStringSet() might be relevant

  http://bioconductor.org/packages/release/bioc/html/Biostrings.html

This will consume memory more along the lines of 1 byte per character + 1 byte per string, and is of particular relevance because you are likely doing other genetic operations for which the Bioconductor project has relevant packages (see especially the GenomicRanges package).

If your work is not particularly domain-specific, data.table would be a good bet (it also has an implementation for working with overlapping ranges, which is a very common task with SNPs). A lot of SNP data management is really relational, for which the SQL representation (and dplyr, for me) is the obvious choice. Bioconductor would be the choice if there is to be extensive domain-specific work. I am involved in the Bioconductor project, so not exactly impartial.

Martin


By the way...

It turns out that 99.72% of those character strings are of the form paste("rs",
Int) where Int is an integer of no more than 9 digits.  So if I use only those
markers, drop the "rs" off, and load them as integers, I see a huge improvement:

snps <- scan("SNPs_rs.txt", what=integer())
Read 27369706 items
object.size(snps)
109478864 bytes
object.size(snps)/length(snps)
4.00000146146985 bytes

That saves 93.8% of the memory by dropping 0.28% of the markers and encoding as
integers instead of strings.  I might end up doing this by encoding the other
characters as negative integers.

Mike

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to