Re: [Bioc-devel] poor performance of snpsByOverlaps()

Robert Castelo Tue, 21 Jun 2016 10:22:47 -0700

Vince,

thanks a lot for the example streaming dbSNP over the internet and howthis is even faster than accessing the data locally. to me, this justconfirms that the current performance of theSNPlocs.Hsapiens.dbSNP144.GRCh37 annotation package can be improved.Hervé will look at it and hopefully will find a fix, if there is a bug,or a way to speed it up.


cheers,

robert.

On 06/17/2016 09:28 PM, Vincent Carey wrote:

I think you can get relevant information rapidly from the dbsnp vcf.
You would acquire

ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz

ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz.tbi

and wrap in a TabixFile

tf


class: TabixFile

path: 00-common_all.vcf.gz

index: 00-common_all.vcf.gz.tbi

isOpen: FALSE

yieldSize: NA


rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19"))

then returns fairly quickly.  Perhaps AnnotationHub can address this
issue.  If you have the file locally,

 system.time(


+ rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19")))

    user  system elapsed

   0.187   0.009   0.222


If instead you read from NCBI

 tf2 =

"ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz";

 system.time(


+ rowRanges(readVcf(tf2, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19")))

)

    user  system elapsed

   0.237   0.055  16.476


faster than a speeding snplocs?  but perhaps there is information loss
or other diminished functionality


On Fri, Jun 17, 2016 at 12:53 PM, Robert Castelo <robert.cast...@upf.edu
<mailto:robert.cast...@upf.edu>> wrote:

    hi,

    the performance of snpsByOverlaps() in terms of time and memory
    consumption is quite poor and i wonder whether there is some bug in
    the code. here's one example:

    library(GenomicRanges)
    library(SNPlocs.Hsapiens.dbSNP144.GRCh37)

    snps <- SNPlocs.Hsapiens.dbSNP144.GRCh37

    gr <- GRanges(seqnames="ch10", IRanges(123276830, 123276830))

    system.time(ov <- snpsByOverlaps(snps, gr))
        user  system elapsed
      33.768   0.124  33.955

    system.time(ov <- snpsByOverlaps(snps, gr))
        user  system elapsed
      33.150   0.281  33.494


    i've shown the call to snpsByOverlaps() twice to account for the
    fact that maybe the first call was caching data and the second could
    be much faster, but it is not the case.

    if i do the same but with a larger GRanges object, for instance the
    one attached to this email, then the memory consumption grows until
    about 20 Gbytes. to me this in conjunction with the previous
    observation, suggests something wrong about the caching of the data.



    i look forward to your comments and possible solutions,


    thanks!!!


    robert.
    _______________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] poor performance of snpsByOverlaps()

Reply via email to