Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Robert Castelo Thu, 11 Jun 2015 09:54:52 -0700

In fact, a feature like locateVariants() returning an object of the sameclass of the input query argument would be (IMO) a useful feature atleast in the case of an input CollapsedVCF object, because the usercould then dump directly the annotations to a VCF file with writeVcf(),which is one way to store variant annotations into files.


On 6/11/15 6:38 PM, Michael Lawrence wrote:

I didn't realize that locateVariants() returned an object with its
strand matching that of the subject. I would have expected the subject
strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
sounds like you want to merge the locateVariants() output with the
input. Merging the output strand as LOCSTRAND on the VRanges sounds
like a reasonable approach, for now. I don't know if Val is listening,
but it sounds like it would be nice to have convenient functions for
merging locateVariants() output with its input. The one for VRanges
might do something like the above.


Michael

On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo <robert.cast...@upf.edu> wrote:

Of course, the inclusion of strand would imply an interpretation of the
variant and its strand (e.g., "-") with respect to an annotated feature. I
can see a practical problem of integrity of the information on a VRanges
object, by which a mandatory column, such as strand, depends on a
non-mandatory column, such as some feature annotation stored as a metadata
column.

A solution would be to add the transcript identifier (TXID) as mandatory
column on the VRanges object but I suspect this is a big change to do, so
adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
locateVariants) in the metadata columns of the VRanges object would allow me
to use a VRanges object as a container of variant x allele x sample x
annotation.

Just to clear up the issue of merging strand and variant: a noisy variant (a
variant that is not silent) and has a, e.g., loss-of-function effect such as
the gain of a stop codon, is usually interpreted in the strand of the
transcript and coding sequence in which the stop codon is gained, saying
something like and A changed to a T producting the stop codon TAA. Ref and
alt alleles are called in the strand of the reference chromosome, so if the
transcript was annotated in the negative strand, we would know that we need
to reverse-complement ref and alt to interpret the variant, although I see
no need to do anything on the VRanges object to ref and alt because we know
they are always in the strand of the reference chromosome. Only if you want
to detect this stop-gain event (with predictCoding) then you would have to
reverse-complement the ref and alt alleles. Conversely, if the variant falls
in an intergenic region, then obviously the strand plays no role in the
interpretation of the variant and nothing needs to be done when interpreting
the ref and alt alleles.


On 6/11/15 5:47 PM, Michael Lawrence wrote:

The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo <robert.cast...@upf.edu>
wrote:

one option for me is just to add a metadata column with the strand of the
overlapping feature. however, i'm interested to fully understand the
rationale behind this aspect of the design of the VRanges object.

a VRanges object unrolls variants in a VCF file per alternative allele
and
sample. variants in VCF files are obtained from tallying reads aligned on
a
reference genome. so, my understanding is that the reference allele is
the
allele of the reference genome against which the reads were aligned while
the alternate allele(s) are allele calls different from the reference.
from
this perspective, my interpretation is that ref and alt alleles have
already
a strand, which is the strand of the reference chromosome against which
the
reads were aligned to. i'm interested in this interpretation of the
strand
of the variants because i'm interested in the interpretation of
sequence-features containing the reference and the alternate alleles,
such
as differences in a binding site with the reference and the alternate
allele.

if we relax the meaning of elements in a VRanges object to, not only
variants x allele x sample, but to variants x allele x sample x
annotated-feature, then i think it would make sense to have the
strand-specific annotation in the strand slot of the VRanges object.

while this idea may be good or not for a number of reasons, i'm now
mostly
interested in knowing whether i'm misinterpreting the design of VRanges
objects, and maybe variant calling in general or i'm in a more or less
right
path in using a VRanges object to hold variant annotations.


thanks!!!

robert.


On 06/11/2015 04:30 AM, Michael Lawrence wrote:

I guess it depends on what the strand should mean. Would having a
negative strand indicate that the ref/alt should be complemented? I'm
not sure it's a good idea to conflate the strand of the variant itself
with the strand of an overlapping feature.

On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo<robert.cast...@upf.edu>
wrote:

my understanding was that VRanges is a container for variants and
variant
annotations and strand is just one annotation more. when we use
locateVariants() a variant can be annotated to multiple transcripts
where
in
one overlaps an exon, in another an intron and so on. In all those
transcripts annotations there is a strand annotation, the strand of the
transcript. if the transcript is annotated in the negative strand of
the
reference chromosome then the annotation of a transcript region to a
variant
is going to be also on the negative strand.

both locateVariants() and predictCoding() return GRanges objects with
strand
annotations according to the transcripts being annotated. I thought it
made
sense in VariantFiltering to use VRanges objects as a  container for
variants and annotations and, for this reason, I would like to carry on
the
strand annotation into the VRanges object. Is there a strong reason for
a
VRanges object, derived from GRanges, not to have strand?


On 6/10/15 6:26 PM, Michael Lawrence wrote:


VRanges is supposed to enforce strand. The goal is to use "*" always,
for simplicity and consistency with the result of readVcf(). Is there
a use case for negative strand variants?

On Wed, Jun 10, 2015 at 5:54 AM, Robert
Castelo<robert.cast...@upf.edu>
wrote:


Michael,

regarding our email exchange three weeks ago, I found a couple of
places
in
VariantAnnotation that IMO need to be updated to avoid enforcing
strand
on
VRanges.

these places occur when constructing and validating VRanges objects,
concretely at:

1. file R/methods-VRanges-class.R at the VRanges class constructor:

VRanges<-
      function(seqnames = Rle(), ranges = IRanges(),
               ref = character(), alt = NA_character_,
               totalDepth = NA_integer_, refDepth = NA_integer_,
               altDepth = NA_integer_, ..., sampleNames =
NA_character_,
               softFilterMatrix = FilterMatrix(matrix(nrow =
length(gr),
ncol =
0L),
                 FilterRules()),
               hardFilters = FilterRules())
{
      gr<- GRanges(seqnames, ranges,
                    strand = .rleRecycleVector("*", length(ranges)),
...)
[...]

that precludes setting the strand at construction time:

library(VariantAnnotation)
VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C",
strand="-")
Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
length(ranges)),  :
      formal argument "strand" matched by multiple actual arguments


2. R/AllClasses.R at the VRanges class validity function
.valid.VRanges():

.valid.VRanges.strand<- function(object) {
      if (any(object@strand == "-"))
        paste("'strand' must always be '+' or '*'")
}

[...]

.valid.VRanges<- function(object)
{
      c(.valid.VRanges.ref(object),
        .valid.VRanges.alt(object),
        .valid.VRanges.strand(object),
        .valid.VRanges.depth(object))
}

that prompts an error when variants annotated on the negative strand
are
detected:

library(VariantAnnotation)
example(VRanges)
strand(vr)<- "-"
c(vr)
Error in validObject(.Object) :
      invalid class “VRanges” object: 'strand' must always be '+' or
'*'


cheers,

robert.

On 05/22/2015 09:49 PM, Michael Lawrence wrote:


This changed recently. VariantAnnotation in devel no longer enforces
a
strand on VRanges, or at least it allows the "*" case.


On Fri, May 22, 2015 at 11:33 AM, Robert
Castelo<robert.cast...@upf.edu
<mailto:robert.cast...@upf.edu>>  wrote:

        Hi,

        I have encountered myself in a strange situation when using
the
        function locateVariants() from VariantAnnotation with an
input
        VRanges object. The problem is that some of the expected
coding
        annotations are not showing up when using locateVariants()
with
        default parameters.

        After investigating this situation I think I found the
reason,
which
        does not look like a bug but I would like that you give me
some
        clarification about the logic behind using locateVariants()
with
        VRanges objects.

        The documentation of the VRanges-class says that in this
class
of
        objects "The strand is always constrained to be positive
(+).".
I
        guess there may be a good reason for this but I could not
find
it
in
        the documentation or googling about it.

        This means that when you coerce a CollapsedVCF object
(obtained,
for
        example, from a VCF file via readVcf()) to a VRanges object,
even
        though variants in the VCF may have no strand, they get a
positive
        strand in the VRanges object.

        The problem arises then, when you call locateVariants() with
this
        VRanges object, because features on the negative strand are
never
        going to overlap with the variants since, by default, the
argument
        ignore.strand=FALSE.

        Let me illustrate this with a toy example. Consider the SNP
        rs1129038

(http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038)
at
        chr15:28356859 with allels A/G. It is located on the 3' UTR
of
the
        gene HERC2 coded on the negative strand of the human
reference
        genome. Let's build a toy VRanges object having this variant:

        library(VariantAnnotation)
        vr<- VRanges(seqnames="chr15",
                       ranges=IRanges(28356859, 28356859),
                       ref="A", alt="G",
                       refDepth=5, altDepth=7,
                       totalDepth=12, sampleNames="A")
        strand(vr)
        factor-Rle of length 1 with 1 run
           Lengths: 1
           Values : +
        Levels(3): + - *

        Let's build now its CollapsedVCF counterpart by using the
        corresponding coercion method and set the strand to "*":

        vcf<- asVCF(vr)
        strand(vcf)<- "*"

        Now run locateVariants() on both objects with UCSC
annotations:

        library(TxDb.Hsapiens.UCSC.hg19.knownGene)
        txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene

        locateVariants(vcf, txdb, region=AllVariants())
        GRanges object with 2 ranges and 9 metadata columns:
               seqnames               ranges strand | LOCATION
LOCSTART
        LOCEND   QUERYID        TXID         CDSID
        <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
<integer>
        <character>  <IntegerList>
           [1]    chr15 [28356859, 28356859]      * | threeUTR
50
50
                1       55386
           [2]    chr15 [28356859, 28356859]      * | threeUTR
50
50
                1       55387
                    GENEID       PRECEDEID        FOLLOWID
        <character>  <CharacterList>  <CharacterList>
           [1]        8924
           [2]        8924
           -------
           seqinfo: 1 sequence from an unspecified genome; no
seqlengths

        locateVariants(vr, txdb, region=AllVariants())
        GRanges object with 1 range and 9 metadata columns:
               seqnames               ranges strand |   LOCATION
LOCSTART
        LOCEND   QUERYID      TXID         CDSID
        <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
<integer>
        <integer>  <IntegerList>
           [1]    chr15 [28356859, 28356859]      + | intergenic<NA>
<NA>
                1<NA>
                    GENEID                         PRECEDEID
FOLLOWID
        <character>  <CharacterList>  <CharacterList>
           [1]<NA>  100132565,100289656,100616223,...            2567
           -------
           seqinfo: 1 sequence from an unspecified genome; no
seqlengths

        Note that while we get the 3' UTR annotation from the
strandless
VCF
        object we do not get it from the VRanges object with the
positive
        strand. To make my point clear: this positive strand shows up
when
        you coerce a strandless VCF object to a VRanges one, because
        positive strandness seems to be the convention for VRanges
objects:

        as(vcf, VRanges)
        VRanges object with 1 range and 1 metadata column:
               seqnames               ranges strand         ref
        alt     totalDepth       refDepth       altDepth
        <Rle>  <IRanges>  <Rle>  <character>  <characterOrRle>
<integerOrRle>
        <integerOrRle>  <integerOrRle>
           [1]    chr15 [28356859, 28356859]      +           A
           G             12              5              7
                 sampleNames softFilterMatrix |      QUAL
        <factorOrRle>  <matrix>  |<numeric>
           [1]             A                  |<NA>
           -------
           seqinfo: 1 sequence from an unspecified genome; no
seqlengths
           hardFilters: NULL


        Of course, if I run locateVariants() with the argument
        ignore.strand=TRUE, then I get the expected annotation:

        locateVariants(vr, txdb, region=AllVariants(),
ignore.strand=TRUE)
        GRanges object with 2 ranges and 9 metadata columns:
               seqnames               ranges strand | LOCATION
LOCSTART
        LOCEND   QUERYID        TXID         CDSID
        <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
<integer>
        <character>  <IntegerList>
           [1]    chr15 [28356859, 28356859]      + | threeUTR
677
        677         1       55386
           [2]    chr15 [28356859, 28356859]      + | threeUTR
677
        677         1       55387
                    GENEID       PRECEDEID        FOLLOWID
        <character>  <CharacterList>  <CharacterList>
           [1]        8924
           [2]        8924
           -------
           seqinfo: 1 sequence from an unspecified genome; no
seqlengths


        So, my question is, given that VRanges objects are enforced
to
have
        a positive strand, would not be better to have
ignore.strand=TRUE
as
        default in locateVariants?

        Alternatively, I would recommend that locateVariants() issues
a
        warning, or maybe an error, when the input object is VRanges
and
        ignore.strand=FALSE.

        Finally, out of curiosity, why a VRanges object enforces the
        positive strand in all its genomic ranges? Would not be
better
just
        taking the strand of the CollapsedVCF object when coercing
the
        CollapsedVCF object to VRanges?


        thanks!!


        robert.

        _______________________________________________
        Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>
mailing
list
        https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Reply via email to