ok, thanks for the clarification, I was using release and did not try devel.

actually, this also affects the function predictCoding() and there, using ignore.strand=TRUE is not appropriate, so my workaround by now in release is to coerce the input VRanges to GRanges and set strand to '*' before calling locateVariants() or predictCoding(). Just mentioning in case somebody encountered the same situation in release.

cheers,

robert.

On 05/22/2015 09:49 PM, Michael Lawrence wrote:
This changed recently. VariantAnnotation in devel no longer enforces a
strand on VRanges, or at least it allows the "*" case.


On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.cast...@upf.edu
<mailto:robert.cast...@upf.edu>> wrote:

    Hi,

    I have encountered myself in a strange situation when using the
    function locateVariants() from VariantAnnotation with an input
    VRanges object. The problem is that some of the expected coding
    annotations are not showing up when using locateVariants() with
    default parameters.

    After investigating this situation I think I found the reason, which
    does not look like a bug but I would like that you give me some
    clarification about the logic behind using locateVariants() with
    VRanges objects.

    The documentation of the VRanges-class says that in this class of
    objects "The strand is always constrained to be positive (+).". I
    guess there may be a good reason for this but I could not find it in
    the documentation or googling about it.

    This means that when you coerce a CollapsedVCF object (obtained, for
    example, from a VCF file via readVcf()) to a VRanges object, even
    though variants in the VCF may have no strand, they get a positive
    strand in the VRanges object.

    The problem arises then, when you call locateVariants() with this
    VRanges object, because features on the negative strand are never
    going to overlap with the variants since, by default, the argument
    ignore.strand=FALSE.

    Let me illustrate this with a toy example. Consider the SNP
    rs1129038
    (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
    chr15:28356859 with allels A/G. It is located on the 3' UTR of the
    gene HERC2 coded on the negative strand of the human reference
    genome. Let's build a toy VRanges object having this variant:

    library(VariantAnnotation)
    vr <- VRanges(seqnames="chr15",
                   ranges=IRanges(28356859, 28356859),
                   ref="A", alt="G",
                   refDepth=5, altDepth=7,
                   totalDepth=12, sampleNames="A")
    strand(vr)
    factor-Rle of length 1 with 1 run
       Lengths: 1
       Values : +
    Levels(3): + - *

    Let's build now its CollapsedVCF counterpart by using the
    corresponding coercion method and set the strand to "*":

    vcf <- asVCF(vr)
    strand(vcf) <- "*"

    Now run locateVariants() on both objects with UCSC annotations:

    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

    locateVariants(vcf, txdb, region=AllVariants())
    GRanges object with 2 ranges and 9 metadata columns:
           seqnames               ranges strand | LOCATION  LOCSTART
    LOCEND   QUERYID        TXID         CDSID
    <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
    <character> <IntegerList>
       [1]    chr15 [28356859, 28356859]      * | threeUTR        50 50
            1       55386
       [2]    chr15 [28356859, 28356859]      * | threeUTR        50 50
            1       55387
                GENEID       PRECEDEID        FOLLOWID
    <character> <CharacterList> <CharacterList>
       [1]        8924
       [2]        8924
       -------
       seqinfo: 1 sequence from an unspecified genome; no seqlengths

    locateVariants(vr, txdb, region=AllVariants())
    GRanges object with 1 range and 9 metadata columns:
           seqnames               ranges strand |   LOCATION  LOCSTART
    LOCEND   QUERYID      TXID         CDSID
    <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
    <integer> <IntegerList>
       [1]    chr15 [28356859, 28356859]      + | intergenic <NA> <NA>
            1 <NA>
                GENEID                         PRECEDEID        FOLLOWID
    <character> <CharacterList> <CharacterList>
       [1] <NA> 100132565,100289656,100616223,...            2567
       -------
       seqinfo: 1 sequence from an unspecified genome; no seqlengths

    Note that while we get the 3' UTR annotation from the strandless VCF
    object we do not get it from the VRanges object with the positive
    strand. To make my point clear: this positive strand shows up when
    you coerce a strandless VCF object to a VRanges one, because
    positive strandness seems to be the convention for VRanges objects:

    as(vcf, VRanges)
    VRanges object with 1 range and 1 metadata column:
           seqnames               ranges strand         ref
    alt     totalDepth       refDepth       altDepth
    <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
    <integerOrRle> <integerOrRle>
       [1]    chr15 [28356859, 28356859]      +           A
       G             12              5              7
             sampleNames softFilterMatrix |      QUAL
    <factorOrRle> <matrix> | <numeric>
       [1]             A                  | <NA>
       -------
       seqinfo: 1 sequence from an unspecified genome; no seqlengths
       hardFilters: NULL


    Of course, if I run locateVariants() with the argument
    ignore.strand=TRUE, then I get the expected annotation:

    locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE)
    GRanges object with 2 ranges and 9 metadata columns:
           seqnames               ranges strand | LOCATION  LOCSTART
    LOCEND   QUERYID        TXID         CDSID
    <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
    <character> <IntegerList>
       [1]    chr15 [28356859, 28356859]      + | threeUTR       677
    677         1       55386
       [2]    chr15 [28356859, 28356859]      + | threeUTR       677
    677         1       55387
                GENEID       PRECEDEID        FOLLOWID
    <character> <CharacterList> <CharacterList>
       [1]        8924
       [2]        8924
       -------
       seqinfo: 1 sequence from an unspecified genome; no seqlengths


    So, my question is, given that VRanges objects are enforced to have
    a positive strand, would not be better to have ignore.strand=TRUE as
    default in locateVariants?

    Alternatively, I would recommend that locateVariants() issues a
    warning, or maybe an error, when the input object is VRanges and
    ignore.strand=FALSE.

    Finally, out of curiosity, why a VRanges object enforces the
    positive strand in all its genomic ranges? Would not be better just
    taking the strand of the CollapsedVCF object when coercing the
    CollapsedVCF object to VRanges?


    thanks!!


    robert.

    _______________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to