Val, I wasn't suggesting that LOCSTRAND be added to the locateVariants() output. Rather, it would be added to the VRanges during the merge.
Michael On Thu, Jun 11, 2015 at 10:10 AM, Valerie Obenchain <voben...@fredhutch.org> wrote: > locateVariants(), predictCoding() and the family of mapToTranscripts() > functions all return strand according to the annotation matched. The only > time the strand of the output could possibly be different from the strand of > the input 'query' is when 'ignore.strand = TRUE' (FALSE by default). > > I wouldn't think you (Robert) are using 'ignore.strand = TRUE', are you? By > just using the default, the output will have the same strand as the input > 'query' (unless 'query' is '*' of course). > > That said, do you still feel it's necessary to add a LOCSTRAND column to the > output? > > Val > > > On 06/11/2015 09:38 AM, Michael Lawrence wrote: >> >> I didn't realize that locateVariants() returned an object with its >> strand matching that of the subject. I would have expected the subject >> strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it >> sounds like you want to merge the locateVariants() output with the >> input. Merging the output strand as LOCSTRAND on the VRanges sounds >> like a reasonable approach, for now. I don't know if Val is listening, >> but it sounds like it would be nice to have convenient functions for >> merging locateVariants() output with its input. The one for VRanges >> might do something like the above. >> >> Michael >> >> On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo <robert.cast...@upf.edu> >> wrote: >>> >>> Of course, the inclusion of strand would imply an interpretation of the >>> variant and its strand (e.g., "-") with respect to an annotated feature. >>> I >>> can see a practical problem of integrity of the information on a VRanges >>> object, by which a mandatory column, such as strand, depends on a >>> non-mandatory column, such as some feature annotation stored as a >>> metadata >>> column. >>> >>> A solution would be to add the transcript identifier (TXID) as mandatory >>> column on the VRanges object but I suspect this is a big change to do, so >>> adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by >>> locateVariants) in the metadata columns of the VRanges object would allow >>> me >>> to use a VRanges object as a container of variant x allele x sample x >>> annotation. >>> >>> Just to clear up the issue of merging strand and variant: a noisy variant >>> (a >>> variant that is not silent) and has a, e.g., loss-of-function effect such >>> as >>> the gain of a stop codon, is usually interpreted in the strand of the >>> transcript and coding sequence in which the stop codon is gained, saying >>> something like and A changed to a T producting the stop codon TAA. Ref >>> and >>> alt alleles are called in the strand of the reference chromosome, so if >>> the >>> transcript was annotated in the negative strand, we would know that we >>> need >>> to reverse-complement ref and alt to interpret the variant, although I >>> see >>> no need to do anything on the VRanges object to ref and alt because we >>> know >>> they are always in the strand of the reference chromosome. Only if you >>> want >>> to detect this stop-gain event (with predictCoding) then you would have >>> to >>> reverse-complement the ref and alt alleles. Conversely, if the variant >>> falls >>> in an intergenic region, then obviously the strand plays no role in the >>> interpretation of the variant and nothing needs to be done when >>> interpreting >>> the ref and alt alleles. >>> >>> >>> On 6/11/15 5:47 PM, Michael Lawrence wrote: >>>> >>>> >>>> The fact that the position describes the variant, but the strand >>>> refers to the transcript is confusing to me. What is the concrete use >>>> case for merging the two features like that? VRanges constrains its >>>> strand for at least 2 reasons: (1) to be less error prone [of course >>>> this runs completely counter to flexibility] and (2) simplicity [we >>>> don't have to worry about what "-" means for ref/alt, overlap, etc]. >>>> >>>> On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo <robert.cast...@upf.edu> >>>> wrote: >>>>> >>>>> >>>>> one option for me is just to add a metadata column with the strand of >>>>> the >>>>> overlapping feature. however, i'm interested to fully understand the >>>>> rationale behind this aspect of the design of the VRanges object. >>>>> >>>>> a VRanges object unrolls variants in a VCF file per alternative allele >>>>> and >>>>> sample. variants in VCF files are obtained from tallying reads aligned >>>>> on >>>>> a >>>>> reference genome. so, my understanding is that the reference allele is >>>>> the >>>>> allele of the reference genome against which the reads were aligned >>>>> while >>>>> the alternate allele(s) are allele calls different from the reference. >>>>> from >>>>> this perspective, my interpretation is that ref and alt alleles have >>>>> already >>>>> a strand, which is the strand of the reference chromosome against which >>>>> the >>>>> reads were aligned to. i'm interested in this interpretation of the >>>>> strand >>>>> of the variants because i'm interested in the interpretation of >>>>> sequence-features containing the reference and the alternate alleles, >>>>> such >>>>> as differences in a binding site with the reference and the alternate >>>>> allele. >>>>> >>>>> if we relax the meaning of elements in a VRanges object to, not only >>>>> variants x allele x sample, but to variants x allele x sample x >>>>> annotated-feature, then i think it would make sense to have the >>>>> strand-specific annotation in the strand slot of the VRanges object. >>>>> >>>>> while this idea may be good or not for a number of reasons, i'm now >>>>> mostly >>>>> interested in knowing whether i'm misinterpreting the design of VRanges >>>>> objects, and maybe variant calling in general or i'm in a more or less >>>>> right >>>>> path in using a VRanges object to hold variant annotations. >>>>> >>>>> >>>>> thanks!!! >>>>> >>>>> robert. >>>>> >>>>> >>>>> On 06/11/2015 04:30 AM, Michael Lawrence wrote: >>>>>> >>>>>> >>>>>> I guess it depends on what the strand should mean. Would having a >>>>>> negative strand indicate that the ref/alt should be complemented? I'm >>>>>> not sure it's a good idea to conflate the strand of the variant itself >>>>>> with the strand of an overlapping feature. >>>>>> >>>>>> On Wed, Jun 10, 2015 at 1:17 PM, Robert >>>>>> Castelo<robert.cast...@upf.edu> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> my understanding was that VRanges is a container for variants and >>>>>>> variant >>>>>>> annotations and strand is just one annotation more. when we use >>>>>>> locateVariants() a variant can be annotated to multiple transcripts >>>>>>> where >>>>>>> in >>>>>>> one overlaps an exon, in another an intron and so on. In all those >>>>>>> transcripts annotations there is a strand annotation, the strand of >>>>>>> the >>>>>>> transcript. if the transcript is annotated in the negative strand of >>>>>>> the >>>>>>> reference chromosome then the annotation of a transcript region to a >>>>>>> variant >>>>>>> is going to be also on the negative strand. >>>>>>> >>>>>>> both locateVariants() and predictCoding() return GRanges objects with >>>>>>> strand >>>>>>> annotations according to the transcripts being annotated. I thought >>>>>>> it >>>>>>> made >>>>>>> sense in VariantFiltering to use VRanges objects as a container for >>>>>>> variants and annotations and, for this reason, I would like to carry >>>>>>> on >>>>>>> the >>>>>>> strand annotation into the VRanges object. Is there a strong reason >>>>>>> for >>>>>>> a >>>>>>> VRanges object, derived from GRanges, not to have strand? >>>>>>> >>>>>>> >>>>>>> On 6/10/15 6:26 PM, Michael Lawrence wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> VRanges is supposed to enforce strand. The goal is to use "*" >>>>>>>> always, >>>>>>>> for simplicity and consistency with the result of readVcf(). Is >>>>>>>> there >>>>>>>> a use case for negative strand variants? >>>>>>>> >>>>>>>> On Wed, Jun 10, 2015 at 5:54 AM, Robert >>>>>>>> Castelo<robert.cast...@upf.edu> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Michael, >>>>>>>>> >>>>>>>>> regarding our email exchange three weeks ago, I found a couple of >>>>>>>>> places >>>>>>>>> in >>>>>>>>> VariantAnnotation that IMO need to be updated to avoid enforcing >>>>>>>>> strand >>>>>>>>> on >>>>>>>>> VRanges. >>>>>>>>> >>>>>>>>> these places occur when constructing and validating VRanges >>>>>>>>> objects, >>>>>>>>> concretely at: >>>>>>>>> >>>>>>>>> 1. file R/methods-VRanges-class.R at the VRanges class constructor: >>>>>>>>> >>>>>>>>> VRanges<- >>>>>>>>> function(seqnames = Rle(), ranges = IRanges(), >>>>>>>>> ref = character(), alt = NA_character_, >>>>>>>>> totalDepth = NA_integer_, refDepth = NA_integer_, >>>>>>>>> altDepth = NA_integer_, ..., sampleNames = >>>>>>>>> NA_character_, >>>>>>>>> softFilterMatrix = FilterMatrix(matrix(nrow = >>>>>>>>> length(gr), >>>>>>>>> ncol = >>>>>>>>> 0L), >>>>>>>>> FilterRules()), >>>>>>>>> hardFilters = FilterRules()) >>>>>>>>> { >>>>>>>>> gr<- GRanges(seqnames, ranges, >>>>>>>>> strand = .rleRecycleVector("*", >>>>>>>>> length(ranges)), >>>>>>>>> ...) >>>>>>>>> [...] >>>>>>>>> >>>>>>>>> that precludes setting the strand at construction time: >>>>>>>>> >>>>>>>>> library(VariantAnnotation) >>>>>>>>> VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C", >>>>>>>>> strand="-") >>>>>>>>> Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*", >>>>>>>>> length(ranges)), : >>>>>>>>> formal argument "strand" matched by multiple actual arguments >>>>>>>>> >>>>>>>>> >>>>>>>>> 2. R/AllClasses.R at the VRanges class validity function >>>>>>>>> .valid.VRanges(): >>>>>>>>> >>>>>>>>> .valid.VRanges.strand<- function(object) { >>>>>>>>> if (any(object@strand == "-")) >>>>>>>>> paste("'strand' must always be '+' or '*'") >>>>>>>>> } >>>>>>>>> >>>>>>>>> [...] >>>>>>>>> >>>>>>>>> .valid.VRanges<- function(object) >>>>>>>>> { >>>>>>>>> c(.valid.VRanges.ref(object), >>>>>>>>> .valid.VRanges.alt(object), >>>>>>>>> .valid.VRanges.strand(object), >>>>>>>>> .valid.VRanges.depth(object)) >>>>>>>>> } >>>>>>>>> >>>>>>>>> that prompts an error when variants annotated on the negative >>>>>>>>> strand >>>>>>>>> are >>>>>>>>> detected: >>>>>>>>> >>>>>>>>> library(VariantAnnotation) >>>>>>>>> example(VRanges) >>>>>>>>> strand(vr)<- "-" >>>>>>>>> c(vr) >>>>>>>>> Error in validObject(.Object) : >>>>>>>>> invalid class “VRanges” object: 'strand' must always be '+' >>>>>>>>> or >>>>>>>>> '*' >>>>>>>>> >>>>>>>>> >>>>>>>>> cheers, >>>>>>>>> >>>>>>>>> robert. >>>>>>>>> >>>>>>>>> On 05/22/2015 09:49 PM, Michael Lawrence wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This changed recently. VariantAnnotation in devel no longer >>>>>>>>>> enforces >>>>>>>>>> a >>>>>>>>>> strand on VRanges, or at least it allows the "*" case. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, May 22, 2015 at 11:33 AM, Robert >>>>>>>>>> Castelo<robert.cast...@upf.edu >>>>>>>>>> <mailto:robert.cast...@upf.edu>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I have encountered myself in a strange situation when >>>>>>>>>> using >>>>>>>>>> the >>>>>>>>>> function locateVariants() from VariantAnnotation with an >>>>>>>>>> input >>>>>>>>>> VRanges object. The problem is that some of the expected >>>>>>>>>> coding >>>>>>>>>> annotations are not showing up when using locateVariants() >>>>>>>>>> with >>>>>>>>>> default parameters. >>>>>>>>>> >>>>>>>>>> After investigating this situation I think I found the >>>>>>>>>> reason, >>>>>>>>>> which >>>>>>>>>> does not look like a bug but I would like that you give me >>>>>>>>>> some >>>>>>>>>> clarification about the logic behind using >>>>>>>>>> locateVariants() >>>>>>>>>> with >>>>>>>>>> VRanges objects. >>>>>>>>>> >>>>>>>>>> The documentation of the VRanges-class says that in this >>>>>>>>>> class >>>>>>>>>> of >>>>>>>>>> objects "The strand is always constrained to be positive >>>>>>>>>> (+).". >>>>>>>>>> I >>>>>>>>>> guess there may be a good reason for this but I could not >>>>>>>>>> find >>>>>>>>>> it >>>>>>>>>> in >>>>>>>>>> the documentation or googling about it. >>>>>>>>>> >>>>>>>>>> This means that when you coerce a CollapsedVCF object >>>>>>>>>> (obtained, >>>>>>>>>> for >>>>>>>>>> example, from a VCF file via readVcf()) to a VRanges >>>>>>>>>> object, >>>>>>>>>> even >>>>>>>>>> though variants in the VCF may have no strand, they get a >>>>>>>>>> positive >>>>>>>>>> strand in the VRanges object. >>>>>>>>>> >>>>>>>>>> The problem arises then, when you call locateVariants() >>>>>>>>>> with >>>>>>>>>> this >>>>>>>>>> VRanges object, because features on the negative strand >>>>>>>>>> are >>>>>>>>>> never >>>>>>>>>> going to overlap with the variants since, by default, the >>>>>>>>>> argument >>>>>>>>>> ignore.strand=FALSE. >>>>>>>>>> >>>>>>>>>> Let me illustrate this with a toy example. Consider the >>>>>>>>>> SNP >>>>>>>>>> rs1129038 >>>>>>>>>> >>>>>>>>>> (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) >>>>>>>>>> at >>>>>>>>>> chr15:28356859 with allels A/G. It is located on the 3' >>>>>>>>>> UTR >>>>>>>>>> of >>>>>>>>>> the >>>>>>>>>> gene HERC2 coded on the negative strand of the human >>>>>>>>>> reference >>>>>>>>>> genome. Let's build a toy VRanges object having this >>>>>>>>>> variant: >>>>>>>>>> >>>>>>>>>> library(VariantAnnotation) >>>>>>>>>> vr<- VRanges(seqnames="chr15", >>>>>>>>>> ranges=IRanges(28356859, 28356859), >>>>>>>>>> ref="A", alt="G", >>>>>>>>>> refDepth=5, altDepth=7, >>>>>>>>>> totalDepth=12, sampleNames="A") >>>>>>>>>> strand(vr) >>>>>>>>>> factor-Rle of length 1 with 1 run >>>>>>>>>> Lengths: 1 >>>>>>>>>> Values : + >>>>>>>>>> Levels(3): + - * >>>>>>>>>> >>>>>>>>>> Let's build now its CollapsedVCF counterpart by using the >>>>>>>>>> corresponding coercion method and set the strand to "*": >>>>>>>>>> >>>>>>>>>> vcf<- asVCF(vr) >>>>>>>>>> strand(vcf)<- "*" >>>>>>>>>> >>>>>>>>>> Now run locateVariants() on both objects with UCSC >>>>>>>>>> annotations: >>>>>>>>>> >>>>>>>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>>>>>>>> txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene >>>>>>>>>> >>>>>>>>>> locateVariants(vcf, txdb, region=AllVariants()) >>>>>>>>>> GRanges object with 2 ranges and 9 metadata columns: >>>>>>>>>> seqnames ranges strand | LOCATION >>>>>>>>>> LOCSTART >>>>>>>>>> LOCEND QUERYID TXID CDSID >>>>>>>>>> <Rle> <IRanges> <Rle> |<factor> <integer> <integer> >>>>>>>>>> <integer> >>>>>>>>>> <character> <IntegerList> >>>>>>>>>> [1] chr15 [28356859, 28356859] * | threeUTR >>>>>>>>>> 50 >>>>>>>>>> 50 >>>>>>>>>> 1 55386 >>>>>>>>>> [2] chr15 [28356859, 28356859] * | threeUTR >>>>>>>>>> 50 >>>>>>>>>> 50 >>>>>>>>>> 1 55387 >>>>>>>>>> GENEID PRECEDEID FOLLOWID >>>>>>>>>> <character> <CharacterList> <CharacterList> >>>>>>>>>> [1] 8924 >>>>>>>>>> [2] 8924 >>>>>>>>>> ------- >>>>>>>>>> seqinfo: 1 sequence from an unspecified genome; no >>>>>>>>>> seqlengths >>>>>>>>>> >>>>>>>>>> locateVariants(vr, txdb, region=AllVariants()) >>>>>>>>>> GRanges object with 1 range and 9 metadata columns: >>>>>>>>>> seqnames ranges strand | LOCATION >>>>>>>>>> LOCSTART >>>>>>>>>> LOCEND QUERYID TXID CDSID >>>>>>>>>> <Rle> <IRanges> <Rle> |<factor> <integer> <integer> >>>>>>>>>> <integer> >>>>>>>>>> <integer> <IntegerList> >>>>>>>>>> [1] chr15 [28356859, 28356859] + | >>>>>>>>>> intergenic<NA> >>>>>>>>>> <NA> >>>>>>>>>> 1<NA> >>>>>>>>>> GENEID PRECEDEID >>>>>>>>>> FOLLOWID >>>>>>>>>> <character> <CharacterList> <CharacterList> >>>>>>>>>> [1]<NA> 100132565,100289656,100616223,... >>>>>>>>>> 2567 >>>>>>>>>> ------- >>>>>>>>>> seqinfo: 1 sequence from an unspecified genome; no >>>>>>>>>> seqlengths >>>>>>>>>> >>>>>>>>>> Note that while we get the 3' UTR annotation from the >>>>>>>>>> strandless >>>>>>>>>> VCF >>>>>>>>>> object we do not get it from the VRanges object with the >>>>>>>>>> positive >>>>>>>>>> strand. To make my point clear: this positive strand shows >>>>>>>>>> up >>>>>>>>>> when >>>>>>>>>> you coerce a strandless VCF object to a VRanges one, >>>>>>>>>> because >>>>>>>>>> positive strandness seems to be the convention for VRanges >>>>>>>>>> objects: >>>>>>>>>> >>>>>>>>>> as(vcf, VRanges) >>>>>>>>>> VRanges object with 1 range and 1 metadata column: >>>>>>>>>> seqnames ranges strand ref >>>>>>>>>> alt totalDepth refDepth altDepth >>>>>>>>>> <Rle> <IRanges> <Rle> <character> <characterOrRle> >>>>>>>>>> <integerOrRle> >>>>>>>>>> <integerOrRle> <integerOrRle> >>>>>>>>>> [1] chr15 [28356859, 28356859] + A >>>>>>>>>> G 12 5 7 >>>>>>>>>> sampleNames softFilterMatrix | QUAL >>>>>>>>>> <factorOrRle> <matrix> |<numeric> >>>>>>>>>> [1] A |<NA> >>>>>>>>>> ------- >>>>>>>>>> seqinfo: 1 sequence from an unspecified genome; no >>>>>>>>>> seqlengths >>>>>>>>>> hardFilters: NULL >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Of course, if I run locateVariants() with the argument >>>>>>>>>> ignore.strand=TRUE, then I get the expected annotation: >>>>>>>>>> >>>>>>>>>> locateVariants(vr, txdb, region=AllVariants(), >>>>>>>>>> ignore.strand=TRUE) >>>>>>>>>> GRanges object with 2 ranges and 9 metadata columns: >>>>>>>>>> seqnames ranges strand | LOCATION >>>>>>>>>> LOCSTART >>>>>>>>>> LOCEND QUERYID TXID CDSID >>>>>>>>>> <Rle> <IRanges> <Rle> |<factor> <integer> <integer> >>>>>>>>>> <integer> >>>>>>>>>> <character> <IntegerList> >>>>>>>>>> [1] chr15 [28356859, 28356859] + | threeUTR >>>>>>>>>> 677 >>>>>>>>>> 677 1 55386 >>>>>>>>>> [2] chr15 [28356859, 28356859] + | threeUTR >>>>>>>>>> 677 >>>>>>>>>> 677 1 55387 >>>>>>>>>> GENEID PRECEDEID FOLLOWID >>>>>>>>>> <character> <CharacterList> <CharacterList> >>>>>>>>>> [1] 8924 >>>>>>>>>> [2] 8924 >>>>>>>>>> ------- >>>>>>>>>> seqinfo: 1 sequence from an unspecified genome; no >>>>>>>>>> seqlengths >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> So, my question is, given that VRanges objects are >>>>>>>>>> enforced >>>>>>>>>> to >>>>>>>>>> have >>>>>>>>>> a positive strand, would not be better to have >>>>>>>>>> ignore.strand=TRUE >>>>>>>>>> as >>>>>>>>>> default in locateVariants? >>>>>>>>>> >>>>>>>>>> Alternatively, I would recommend that locateVariants() >>>>>>>>>> issues >>>>>>>>>> a >>>>>>>>>> warning, or maybe an error, when the input object is >>>>>>>>>> VRanges >>>>>>>>>> and >>>>>>>>>> ignore.strand=FALSE. >>>>>>>>>> >>>>>>>>>> Finally, out of curiosity, why a VRanges object enforces >>>>>>>>>> the >>>>>>>>>> positive strand in all its genomic ranges? Would not be >>>>>>>>>> better >>>>>>>>>> just >>>>>>>>>> taking the strand of the CollapsedVCF object when coercing >>>>>>>>>> the >>>>>>>>>> CollapsedVCF object to VRanges? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> thanks!! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> robert. >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> >>>>>>>>>> mailing >>>>>>>>>> list >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Robert Castelo, PhD >>>>>>>>> Associate Professor >>>>>>>>> Dept. of Experimental and Health Sciences >>>>>>>>> Universitat Pompeu Fabra (UPF) >>>>>>>>> Barcelona Biomedical Research Park (PRBB) >>>>>>>>> Dr Aiguader 88 >>>>>>>>> E-08003 Barcelona, Spain >>>>>>>>> telf: +34.933.160.514 >>>>>>>>> fax: +34.933.160.550 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> -- >>>>> Robert Castelo, PhD >>>>> Associate Professor >>>>> Dept. of Experimental and Health Sciences >>>>> Universitat Pompeu Fabra (UPF) >>>>> Barcelona Biomedical Research Park (PRBB) >>>>> Dr Aiguader 88 >>>>> E-08003 Barcelona, Spain >>>>> telf: +34.933.160.514 >>>>> fax: +34.933.160.550 >>> >>> >>> >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, Seattle, WA 98109 > > Email: voben...@fredhutch.org > Phone: (206) 667-3158 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel