Good day, file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation") anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = "VariantAnnotation") aSet <- readVcf(file, "hg19") system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet))) user system elapsed 209.120 16.628 226.083
Reading in the Exome chromosome 22 VCF and intersecting it with the other file in the data directory takes almost 4 minutes. However, reading in the whole file is much faster. > system.time(anotherSet <- readVcf(anotherFile, "hg19")) user system elapsed 0.376 0.016 0.392 and doing the intersection manually takes a fraction of a second > system.time(fastCommonMutations <- intersect(rowRanges(aSet), > rowRanges(anotherSet))) user system elapsed 0.128 0.000 0.129 This comparison ignores the finer details such as the identities of the alleles, but does it have to be so slow ? My real use case is intersecting dozens of VCF files of cancer samples with the ExAC consortium's VCF file that is 4 GB in size when compressed. I can't imagine how long that would take. Can the code of readVcf be optimised ? -------------------------------------- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel