Vince,
thanks a lot for the example streaming dbSNP over the internet and how
this is even faster than accessing the data locally. to me, this just
confirms that the current performance of the
SNPlocs.Hsapiens.dbSNP144.GRCh37 annotation package can be improved.
Hervé will look at it and hopefully will find a fix, if there is a bug,
or a way to speed it up.
cheers,
robert.
On 06/17/2016 09:28 PM, Vincent Carey wrote:
I think you can get relevant information rapidly from the dbsnp vcf.
You would acquire
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz.tbi
and wrap in a TabixFile
tf
class: TabixFile
path: 00-common_all.vcf.gz
index: 00-common_all.vcf.gz.tbi
isOpen: FALSE
yieldSize: NA
rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19"))
then returns fairly quickly. Perhaps AnnotationHub can address this
issue. If you have the file locally,
system.time(
+ rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19")))
user system elapsed
0.187 0.009 0.222
If instead you read from NCBI
tf2 =
"ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz"
system.time(
+ rowRanges(readVcf(tf2, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19")))
)
user system elapsed
0.237 0.055 16.476
faster than a speeding snplocs? but perhaps there is information loss
or other diminished functionality
On Fri, Jun 17, 2016 at 12:53 PM, Robert Castelo <robert.cast...@upf.edu
<mailto:robert.cast...@upf.edu>> wrote:
hi,
the performance of snpsByOverlaps() in terms of time and memory
consumption is quite poor and i wonder whether there is some bug in
the code. here's one example:
library(GenomicRanges)
library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
snps <- SNPlocs.Hsapiens.dbSNP144.GRCh37
gr <- GRanges(seqnames="ch10", IRanges(123276830, 123276830))
system.time(ov <- snpsByOverlaps(snps, gr))
user system elapsed
33.768 0.124 33.955
system.time(ov <- snpsByOverlaps(snps, gr))
user system elapsed
33.150 0.281 33.494
i've shown the call to snpsByOverlaps() twice to account for the
fact that maybe the first call was caching data and the second could
be much faster, but it is not the case.
if i do the same but with a larger GRanges object, for instance the
one attached to this email, then the memory consumption grows until
about 20 Gbytes. to me this in conjunction with the previous
observation, suggests something wrong about the caching of the data.
i look forward to your comments and possible solutions,
thanks!!!
robert.
_______________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel