This changed recently. VariantAnnotation in devel no longer enforces a strand on VRanges, or at least it allows the "*" case.
On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.cast...@upf.edu> wrote: > Hi, > > I have encountered myself in a strange situation when using the function > locateVariants() from VariantAnnotation with an input VRanges object. The > problem is that some of the expected coding annotations are not showing up > when using locateVariants() with default parameters. > > After investigating this situation I think I found the reason, which does > not look like a bug but I would like that you give me some clarification > about the logic behind using locateVariants() with VRanges objects. > > The documentation of the VRanges-class says that in this class of objects > "The strand is always constrained to be positive (+).". I guess there may > be a good reason for this but I could not find it in the documentation or > googling about it. > > This means that when you coerce a CollapsedVCF object (obtained, for > example, from a VCF file via readVcf()) to a VRanges object, even though > variants in the VCF may have no strand, they get a positive strand in the > VRanges object. > > The problem arises then, when you call locateVariants() with this VRanges > object, because features on the negative strand are never going to overlap > with the variants since, by default, the argument ignore.strand=FALSE. > > Let me illustrate this with a toy example. Consider the SNP rs1129038 ( > http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at > chr15:28356859 with allels A/G. It is located on the 3' UTR of the gene > HERC2 coded on the negative strand of the human reference genome. Let's > build a toy VRanges object having this variant: > > library(VariantAnnotation) > vr <- VRanges(seqnames="chr15", > ranges=IRanges(28356859, 28356859), > ref="A", alt="G", > refDepth=5, altDepth=7, > totalDepth=12, sampleNames="A") > strand(vr) > factor-Rle of length 1 with 1 run > Lengths: 1 > Values : + > Levels(3): + - * > > Let's build now its CollapsedVCF counterpart by using the corresponding > coercion method and set the strand to "*": > > vcf <- asVCF(vr) > strand(vcf) <- "*" > > Now run locateVariants() on both objects with UCSC annotations: > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > > locateVariants(vcf, txdb, region=AllVariants()) > GRanges object with 2 ranges and 9 metadata columns: > seqnames ranges strand | LOCATION LOCSTART LOCEND > QUERYID TXID CDSID > <Rle> <IRanges> <Rle> | <factor> <integer> <integer> > <integer> <character> <IntegerList> > [1] chr15 [28356859, 28356859] * | threeUTR 50 50 > 1 55386 > [2] chr15 [28356859, 28356859] * | threeUTR 50 50 > 1 55387 > GENEID PRECEDEID FOLLOWID > <character> <CharacterList> <CharacterList> > [1] 8924 > [2] 8924 > ------- > seqinfo: 1 sequence from an unspecified genome; no seqlengths > > locateVariants(vr, txdb, region=AllVariants()) > GRanges object with 1 range and 9 metadata columns: > seqnames ranges strand | LOCATION LOCSTART LOCEND > QUERYID TXID CDSID > <Rle> <IRanges> <Rle> | <factor> <integer> > <integer> <integer> <integer> <IntegerList> > [1] chr15 [28356859, 28356859] + | intergenic <NA> <NA> > 1 <NA> > GENEID PRECEDEID FOLLOWID > <character> <CharacterList> <CharacterList> > [1] <NA> 100132565,100289656,100616223,... 2567 > ------- > seqinfo: 1 sequence from an unspecified genome; no seqlengths > > Note that while we get the 3' UTR annotation from the strandless VCF > object we do not get it from the VRanges object with the positive strand. > To make my point clear: this positive strand shows up when you coerce a > strandless VCF object to a VRanges one, because positive strandness seems > to be the convention for VRanges objects: > > as(vcf, VRanges) > VRanges object with 1 range and 1 metadata column: > seqnames ranges strand ref alt > totalDepth refDepth altDepth > <Rle> <IRanges> <Rle> <character> <characterOrRle> > <integerOrRle> <integerOrRle> <integerOrRle> > [1] chr15 [28356859, 28356859] + A G > 12 5 7 > sampleNames softFilterMatrix | QUAL > <factorOrRle> <matrix> | <numeric> > [1] A | <NA> > ------- > seqinfo: 1 sequence from an unspecified genome; no seqlengths > hardFilters: NULL > > > Of course, if I run locateVariants() with the argument ignore.strand=TRUE, > then I get the expected annotation: > > locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE) > GRanges object with 2 ranges and 9 metadata columns: > seqnames ranges strand | LOCATION LOCSTART LOCEND > QUERYID TXID CDSID > <Rle> <IRanges> <Rle> | <factor> <integer> <integer> > <integer> <character> <IntegerList> > [1] chr15 [28356859, 28356859] + | threeUTR 677 677 > 1 55386 > [2] chr15 [28356859, 28356859] + | threeUTR 677 677 > 1 55387 > GENEID PRECEDEID FOLLOWID > <character> <CharacterList> <CharacterList> > [1] 8924 > [2] 8924 > ------- > seqinfo: 1 sequence from an unspecified genome; no seqlengths > > > So, my question is, given that VRanges objects are enforced to have a > positive strand, would not be better to have ignore.strand=TRUE as default > in locateVariants? > > Alternatively, I would recommend that locateVariants() issues a warning, > or maybe an error, when the input object is VRanges and ignore.strand=FALSE. > > Finally, out of curiosity, why a VRanges object enforces the positive > strand in all its genomic ranges? Would not be better just taking the > strand of the CollapsedVCF object when coercing the CollapsedVCF object to > VRanges? > > > thanks!! > > > robert. > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel