I don't think a seqname style implies a specific genome build.
But the inverse might make sense. Given a genome build
identifier, we could check for consistent naming. Perhaps an
option on "genome<-" could support this?
On Fri, Oct 24, 2014 at 11:52 AM, Valerie Obenchain
<voben...@fhcrc.org <mailto:voben...@fhcrc.org>> wrote:
This is a good question. I'm not sure we want
seqlevelsStyle() to also alter the genome value. I think it's
a reasonable request but I'd like to open it up to
discussion. I've cc'd a few others for input.
Valerie
On 10/24/14 09:05, Robert Castelo wrote:
hi Valerie,
thanks for the quick fix and updating the documentation,
i have a
further question about the seqinfo slot and particularly
the use of
seqlevelsStyle(). Let me illustrate it with an example again:
==============
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## read again the same VCF file
fl <- file.path(system.file("extdata",
package="VariantFiltering"),
"CEUtrio.vcf.bgz")
vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## select the standard chromosomes
vcf <- keepStandardChromosomes(vcf)
## since the input VCF file had NCBI style, let's match
## the style of the TxDb annotations
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
## drop the mitochondrial chromosome (b/c of the
different lengths
## between b37 and hg19
vcf <- dropSeqlevels(vcf, "chrM")
## try to annotate the location of the variants. it
prompts an
## error because the 'genome' slot of the Seqinfo object
still
## has b37 after running seqlevelsStyle
vcf_annot <- locateVariants(vcf, txdb, AllVariants())
Error in mergeNamedAtomicVectors(genome(x), genome(y), what =
c("sequence", :
sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7,
chr8, chr9,
chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17,
chr18, chr19,
chr20, chr21, chr22, chrX, chrY, chrM have incompatible
genomes:
- in 'x': b37, b37, b37, b37, b37, b37, b37, b37, b37,
b37, b37, b37,
b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37,
b37, b37
- in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19,
hg19, hg19, hg19,
hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
hg19, hg19, hg19,
hg19, hg19, hg19
## this can be fixed by setting the 'genome' slot to the
values of
## the TxDb object
genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
names(genome(txdb)))]
## now this works
vcf_annot <- locateVariants(vcf, txdb, AllVariants())
=================
so my question is, should not seqlevelsStyle() also
change the 'genome'
slot of the Seqinfo object in the updated object?
if not, would the solution be updating the 'genome' slot
in the way i
did it?
thanks!
robert.
On 10/23/2014 11:14 PM, Valerie Obenchain wrote:
Hi Robert,
Thanks for the bug report and reproducible example.
Now fixed in release
1.12.2 and devel 1.13.4.
I've also updated the docs to better explain how the
Seqinfo objects are
propagated / merged when supplied as 'genome'.
Valerie
On 10/23/2014 06:45 AM, Robert Castelo wrote:
hi there,
in my package VariantFiltering i have an example
VCF file from a Hapmap
CEU trio including three chromosomes only to
illustrate its vignette.
i've come across a problem with the function
readVcf() in
VariantAnnotation that may be specific of the
situation of a VCF file
not having all chromosomes, but which it will be
great for me if this
could be addressed.
the problem is reproduced as follows:
===========================
library(VariantAnnotation)
fl <- file.path(system.file("extdata",
package="VariantFiltering"),
"CEUtrio.vcf.bgz")
vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
Error in GenomeInfoDb:::makeNewSeqnames(x,
new2old = new2old,
seqlevels(value)) :
when 'new2old' is NULL, the first elements in the
supplied 'seqlevels' must be identical to
'seqlevels(x)'
====================
this is caused because although i'm providing the
Seqinfo object derived
from the header of the VCF file itself, at some
point the ordering of
the seqlevels between the header and the rest of
the VCF file differs
due to the smaller subset of chromosomes in the
VCF file.
This can be easily fixed by replacing the line:
if (length(newsi) > length(oldsi)) {
within the .scanVcfToVCF() function in
methods-readVcf.R, by
if (length(newsi) >= length(oldsi)) {
this is happening both in release and devel. i'm
pasting below my
sessionInfo() for the release.
let me know if you think this fix is feasible or
i'm wrongly using the
function readVcf(). i'm basically trying to use
readVcf() without having
to figure out the appropriate value for the
argument 'genome', i.e.,
without knowing beforehand what version of the
genome was used to
produce the VCF file.
thanks!!
robert.
sessionInfo()
R version 3.1.1 Patched (2014-10-13 r66751)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
[5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
[7] LC_PAPER=en_US.UTF8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices
[6] utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
[3] Biostrings_2.34.0 XVector_0.6.0
[5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
[7] IRanges_2.0.0 S4Vectors_0.4.0
[9] BiocGenerics_0.12.0 vimcom_1.0-0
[11] setwidth_1.0-3 colorout_1.0-3
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.0 base64enc_0.1-2
[3] BatchJobs_1.4 BBmisc_1.7
[5] Biobase_2.26.0 BiocParallel_1.0.0
[7] biomaRt_2.22.0 bitops_1.0-6
[9] brew_1.0-6 BSgenome_1.34.0
[11] checkmate_1.4 codetools_0.2-9
[13] DBI_0.3.1 digest_0.6.4
[15] fail_1.2 foreach_1.4.2
[17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
[19] iterators_1.0.7 RCurl_1.95-4.3
[21] RSQLite_0.11.4 rtracklayer_1.26.0
[23] sendmailR_1.2-1 stringr_0.6.2
[25] tools_3.1.1 XML_3.98-1.1
[27] zlibbioc_1.12.0
_______________________________________________
Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel