Re: [R] Manage huge database

Thomas Lumley Mon, 22 Sep 2008 13:39:18 -0700

On Mon, 22 Sep 2008, Martin Morgan wrote:

"José E. Lozano" <[EMAIL PROTECTED]> writes:

Maybe you've not lurked on R-help for long enough :) Apologies!


Probably.

So, how much "design" is in this data? If none, and what you've
basically got is a 2000x500000 grid of numbers, then maybe a more raw


Exactly, raw data, but a little more complex since all the 500000 variables
are in text format, so the width is around 2,500,000.

<snip>>

Is genetic DNA data (individuals genotyped), hence the large amount of
columns to analyze.


The Bioconductor package snpMatrix is designed for this type of
data. See

http://www.bioconductor.org/packages/2.2/bioc/html/snpMatrix.html

and if that looks promising

source('http://bioconductor.org/biocLite.R')
biocLite('snpMatrix')


Likely you'll quickly want a 64 bit (linux or Mac) machine.

netCDF is another useful option -- we have been using the ncdf package forlarge genomic datasets. We read the data in one person at a time andwrite to netCDF. For analysis we can then read any subsets. Since wehave imputed SNP data as well as measured this comes to about 2.5 millionvariables on 4000 people for one of our data sets.



        -thomas

Thomas Lumley                   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]       University of Washington, Seattle

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Manage huge database

Reply via email to