Re: [Bioc-devel] VariantAnnotation::readVcf() sets the wrong seqlevelsStyle in devel

Hervé Pagès Thu, 06 Aug 2020 10:19:11 -0700

Hi Robert,

Yes seqlevelsStyle's new behavior is slightly different and lessforgiving. The thing is that it will generally reveal dormant issueswhich is not such a bad thing after all.

Note that it doesn't seem completely straightforward to retrieve thereference genome/assembly directly from the VCF header. AFAICT thisinformation is either missing or weirdly formatted. For example theheaders of the 1000genomes VCF files located atftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/contain


  ##reference=1000Genomes-NCBI37

or in the ex2.vcf file included in VariantAnnotation it's:

  ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

so not clear that importing this in the genome field of the returned VCFobject would be that helpful.

Thanks for pointing me to the VariantAnnotation vignette. I'll fix thecalls to readVcf() to use GRCh37 instead of hg19. Seems like one call(on ex2.vcf) is using the wrong genome: ex2.vcf is based on hg18/NCBI36,not on hg19/GRCh37. Will fix that too.

Sure readVcf() could probably be improved to perform some sanity checksby making sure that the user-supplied genome is compatible with thechromosome names. However that still won't prevent the user fromspecifying the wrong genome (e.g. GRCh37 instead of NCBI36) like in theex2.vcf case. Anyway this is a feature request for readVcf().

In the end I'm not sure what's the purpose of specifying the genomeanyway. What does it give us? Maybe the vignette and examples inVariantAnnotation should stop doing that? Better to not specify thegenome than specifying the wrong one.


Best,
H.

On 8/6/20 07:42, Robert Castelo wrote:

hi Hervé,
thank you very much for your clarifications, but this behavior isdifferent in release and has been different until now, this is BioC 3.11:
library(VariantAnnotation)

fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf)
[1] "22"
seqlevelsStyle(vcf)
[1] "NCBI"    "Ensembl"
i appreciate that the behavior now in devel is more consistent, iactually never understood the need to specify the 'genome="hg19"'argument since this in principle can be figured out from the VCF headerinformation. However, the documentation has become right now confusing,if you go to subsection 2.1 and 2.1.2 from the introductory vignette, itshows using readVcf() with "hg19" but then the sequence names areliterally what they are in the VCF file (NCBI style)
because of the large user base of VariantAnnotation (top-49 download)and the many possible reverse dependencies downstream, i'd suggest thateither readVcf() issues an error or, maybe even better, overrides thesequence level style in the VCF file maybe with a warning, when the'genome' argument does not match the sequence style of the VCF file.
cheers,

robert.


On 04/08/2020 18:29, Hervé Pagès wrote:
Hi Robert,
The VCF file uses "22" for the chromosome name which is the name usedby NCBI. So explicitly specifying "hg19" in the readVcf() call is likesaying that this chromosome name is a UCSC name which is whyseqlevelsStyle() gets confused later.
If you specify the name of the NCBI assembly, things work as expected:
fl <- system.file("extdata", "chr22.vcf.gz",package="VariantAnnotation")
  vcf <- readVcf(fl, "GRCh37")
  seqlevels(vcf)
  # [1] "22"
  seqlevelsStyle(vcf)
  # [1] "NCBI"
  seqlevelsStyle(vcf) <- "UCSC"
  seqlevels(vcf)
  # [1] "chr22"
Or, if you don't know what reference genome the file is based on,don't specify it:
fl <- system.file("extdata", "chr22.vcf.gz",package="VariantAnnotation")
  vcf <- readVcf(fl)
  seqlevels(vcf)
  # [1] "22"
  seqlevelsStyle(vcf)
  # [1] "NCBI"    "Ensembl"
  seqlevelsStyle(vcf) <- "UCSC"
  seqlevels(vcf)
  # [1] "chr22"

or specify it later:

  genome(vcf) <- "hg19"
  seqinfo(vcf)
  # Seqinfo object with 1 sequence from hg19 genome; no seqlengths:
  #   seqnames seqlengths isCircular genome
  #   chr22            NA         NA   hg19

Hope this helps,
H.


On 7/29/20 08:30, Robert Castelo wrote:
hi,
it looks like either VariantAnnotation::readVcf() or something in theCollapsedVCF class broke in devel with respect to reading and settingsequence styles:
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz",package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf)
[1] "22"
seqlevelsStyle(vcf)
[1] "UCSC"
seqlevelsStyle(vcf) <- "UCSC"
seqlevels(vcf)
[1] "22"
you can find my session information below. let me know if you want meto open an issue at the GitHub repo (VariantAnnotatoin orGenomeInfoDb?).
thanks!

robert.

BiocManager::version()
[1] ‘3.12’
sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils datasets
[8] methods   base

other attached packages:
  [1] VariantAnnotation_1.35.3 Rsamtools_2.5.3
  [3] Biostrings_2.57.2 XVector_0.29.3
  [5] SummarizedExperiment_1.19.6 DelayedArray_0.15.7
  [7] matrixStats_0.56.0 Matrix_1.2-18
  [9] Biobase_2.49.0 GenomicRanges_1.41.5
[11] GenomeInfoDb_1.25.8 IRanges_2.23.10
[13] S4Vectors_0.27.12 BiocGenerics_0.35.4
[15] BiocManager_1.30.10

loaded via a namespace (and not attached):
  [1] progress_1.2.2           tidyselect_1.1.0 purrr_0.3.4
  [4] lattice_0.20-41          vctrs_0.3.1 generics_0.0.2
  [7] BiocFileCache_1.13.0     rtracklayer_1.49.4 GenomicFeatures_1.41.2
[10] blob_1.2.1               XML_3.99-0.4 rlang_0.4.6
[13] pillar_1.4.4             glue_1.4.1 DBI_1.1.0
[16] rappdirs_0.3.1           BiocParallel_1.23.2 bit64_0.9-7.1
[19] dbplyr_1.4.4             GenomeInfoDbData_1.2.3 lifecycle_0.2.0
[22] stringr_1.4.0            zlibbioc_1.35.0 memoise_1.1.0
[25] biomaRt_2.45.2           curl_4.3 AnnotationDbi_1.51.3
[28] Rcpp_1.0.4.6             BSgenome_1.57.5 openssl_1.4.1
[31] bit_1.1-15.2             hms_0.5.3 askpass_1.1
[34] digest_0.6.25            stringi_1.4.6 dplyr_1.0.0
[37] grid_4.0.0               tools_4.0.0 bitops_1.0-6
[40] magrittr_1.5             RCurl_1.98-1.2 RSQLite_2.2.0
[43] tibble_3.0.1             crayon_1.3.4 pkgconfig_2.0.3
[46] ellipsis_0.3.1           prettyunits_1.1.1 assertthat_0.2.1
[49] httr_1.4.1               R6_2.4.1 GenomicAlignments_1.25.3
[52] compiler_4.0.0

_______________________________________________
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=gp0KKC6W1uS1YnyFI5iSuxF5WSUpOhbHwL94GRP8yu0&s=Co1P5SErF64uPYhHMltM3De48hQLl-XHK3gfZOEnSKc&e=


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] VariantAnnotation::readVcf() sets the wrong seqlevelsStyle in devel

Reply via email to