Hi, When I use the 'filterVcf' function to process a VCF file in chunks, the output file contains as many copies of the header as there are chunks. Consider the example, adapted from the 'filterVcf' man page:
library(VariantAnnotation) fl <- system.file(package = "VariantAnnotation", "extdata", "chr22.vcf.gz") filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP")) ## Control: Processing without chunking t0 <- TabixFile(fl) ## yieldSize -> NA out0 <- filterVcf(t0, "hg19", tempfile(), filters = filt) tx0 = readLines(out0) header_lines0 = grep("^##", tx0) ## 30 header lines, in line 1,..,30 ## Case: Processing in 3 chunks t1 <- TabixFile(fl, yieldSize = 5e3) ## gives us 3 chunks here out1 <- filterVcf(t1, "hg19", tempfile(), filters = filt) tx1 = readLines(out1) header_lines1 = grep("^##", tx1) ## 90 header lines, header 3 times duplicated ## 1) 1,..,30, 2) 4827,..,4856, 3) 9673,..,9702 It seems that for each chunk, a complete VCF file including the header gets written. See the relevant part of VariantFiltering:::.filter: while (nrow(vcfChunk <- readVcf(tbxFile, genome, ..., param = param))) { vcfChunk <- subsetByFilter(vcfChunk, filters) writeVcf(vcfChunk, filtered) } For the processing of VCFs with large headers in many chunks (e.g. 1000genomes callsets), this can result in the paradox situation that the filtered file ends up being significantly larger than the original. Best wishes Julian R version 3.2.1 (2015-06-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] VariantAnnotation_1.14.6 Rsamtools_1.20.4 Biostrings_2.36.1 [4] XVector_0.8.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1 [7] IRanges_2.2.5 S4Vectors_0.6.3 BiocGenerics_0.14.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.30.1 zlibbioc_1.14.0 GenomicAlignments_1.4.1 [4] BiocParallel_1.2.9 BSgenome_1.36.3 tools_3.2.1 [7] Biobase_2.28.0 DBI_0.3.1 lambda.r_1.1.7 [10] futile.logger_1.4.1 rtracklayer_1.28.6 futile.options_1.0.0 [13] bitops_1.0-6 RCurl_1.95-4.7 biomaRt_2.24.0 [16] RSQLite_1.0.0 GenomicFeatures_1.20.1 XML_3.98-1.3 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel