Michael,
regarding our email exchange three weeks ago, I found a couple of
places
in
VariantAnnotation that IMO need to be updated to avoid enforcing
strand
on
VRanges.
these places occur when constructing and validating VRanges objects,
concretely at:
1. file R/methods-VRanges-class.R at the VRanges class constructor:
VRanges<-
function(seqnames = Rle(), ranges = IRanges(),
ref = character(), alt = NA_character_,
totalDepth = NA_integer_, refDepth = NA_integer_,
altDepth = NA_integer_, ..., sampleNames =
NA_character_,
softFilterMatrix = FilterMatrix(matrix(nrow =
length(gr),
ncol =
0L),
FilterRules()),
hardFilters = FilterRules())
{
gr<- GRanges(seqnames, ranges,
strand = .rleRecycleVector("*", length(ranges)),
...)
[...]
that precludes setting the strand at construction time:
library(VariantAnnotation)
VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C",
strand="-")
Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
length(ranges)), :
formal argument "strand" matched by multiple actual arguments
2. R/AllClasses.R at the VRanges class validity function
.valid.VRanges():
.valid.VRanges.strand<- function(object) {
if (any(object@strand == "-"))
paste("'strand' must always be '+' or '*'")
}
[...]
.valid.VRanges<- function(object)
{
c(.valid.VRanges.ref(object),
.valid.VRanges.alt(object),
.valid.VRanges.strand(object),
.valid.VRanges.depth(object))
}
that prompts an error when variants annotated on the negative strand
are
detected:
library(VariantAnnotation)
example(VRanges)
strand(vr)<- "-"
c(vr)
Error in validObject(.Object) :
invalid class “VRanges” object: 'strand' must always be '+' or
'*'
cheers,
robert.
On 05/22/2015 09:49 PM, Michael Lawrence wrote:
This changed recently. VariantAnnotation in devel no longer enforces
a
strand on VRanges, or at least it allows the "*" case.
On Fri, May 22, 2015 at 11:33 AM, Robert
Castelo<robert.cast...@upf.edu
<mailto:robert.cast...@upf.edu>> wrote:
Hi,
I have encountered myself in a strange situation when using
the
function locateVariants() from VariantAnnotation with an
input
VRanges object. The problem is that some of the expected
coding
annotations are not showing up when using locateVariants()
with
default parameters.
After investigating this situation I think I found the
reason,
which
does not look like a bug but I would like that you give me
some
clarification about the logic behind using locateVariants()
with
VRanges objects.
The documentation of the VRanges-class says that in this
class
of
objects "The strand is always constrained to be positive
(+).".
I
guess there may be a good reason for this but I could not
find
it
in
the documentation or googling about it.
This means that when you coerce a CollapsedVCF object
(obtained,
for
example, from a VCF file via readVcf()) to a VRanges object,
even
though variants in the VCF may have no strand, they get a
positive
strand in the VRanges object.
The problem arises then, when you call locateVariants() with
this
VRanges object, because features on the negative strand are
never
going to overlap with the variants since, by default, the
argument
ignore.strand=FALSE.
Let me illustrate this with a toy example. Consider the SNP
rs1129038
(http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038)
at
chr15:28356859 with allels A/G. It is located on the 3' UTR
of
the
gene HERC2 coded on the negative strand of the human
reference
genome. Let's build a toy VRanges object having this variant:
library(VariantAnnotation)
vr<- VRanges(seqnames="chr15",
ranges=IRanges(28356859, 28356859),
ref="A", alt="G",
refDepth=5, altDepth=7,
totalDepth=12, sampleNames="A")
strand(vr)
factor-Rle of length 1 with 1 run
Lengths: 1
Values : +
Levels(3): + - *
Let's build now its CollapsedVCF counterpart by using the
corresponding coercion method and set the strand to "*":
vcf<- asVCF(vr)
strand(vcf)<- "*"
Now run locateVariants() on both objects with UCSC
annotations:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
locateVariants(vcf, txdb, region=AllVariants())
GRanges object with 2 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION
LOCSTART
LOCEND QUERYID TXID CDSID
<Rle> <IRanges> <Rle> |<factor> <integer> <integer>
<integer>
<character> <IntegerList>
[1] chr15 [28356859, 28356859] * | threeUTR
50
50
1 55386
[2] chr15 [28356859, 28356859] * | threeUTR
50
50
1 55387
GENEID PRECEDEID FOLLOWID
<character> <CharacterList> <CharacterList>
[1] 8924
[2] 8924
-------
seqinfo: 1 sequence from an unspecified genome; no
seqlengths
locateVariants(vr, txdb, region=AllVariants())
GRanges object with 1 range and 9 metadata columns:
seqnames ranges strand | LOCATION
LOCSTART
LOCEND QUERYID TXID CDSID
<Rle> <IRanges> <Rle> |<factor> <integer> <integer>
<integer>
<integer> <IntegerList>
[1] chr15 [28356859, 28356859] + | intergenic<NA>
<NA>
1<NA>
GENEID PRECEDEID
FOLLOWID
<character> <CharacterList> <CharacterList>
[1]<NA> 100132565,100289656,100616223,... 2567
-------
seqinfo: 1 sequence from an unspecified genome; no
seqlengths
Note that while we get the 3' UTR annotation from the
strandless
VCF
object we do not get it from the VRanges object with the
positive
strand. To make my point clear: this positive strand shows up
when
you coerce a strandless VCF object to a VRanges one, because
positive strandness seems to be the convention for VRanges
objects:
as(vcf, VRanges)
VRanges object with 1 range and 1 metadata column:
seqnames ranges strand ref
alt totalDepth refDepth altDepth
<Rle> <IRanges> <Rle> <character> <characterOrRle>
<integerOrRle>
<integerOrRle> <integerOrRle>
[1] chr15 [28356859, 28356859] + A
G 12 5 7
sampleNames softFilterMatrix | QUAL
<factorOrRle> <matrix> |<numeric>
[1] A |<NA>
-------
seqinfo: 1 sequence from an unspecified genome; no
seqlengths
hardFilters: NULL
Of course, if I run locateVariants() with the argument
ignore.strand=TRUE, then I get the expected annotation:
locateVariants(vr, txdb, region=AllVariants(),
ignore.strand=TRUE)
GRanges object with 2 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION
LOCSTART
LOCEND QUERYID TXID CDSID
<Rle> <IRanges> <Rle> |<factor> <integer> <integer>
<integer>
<character> <IntegerList>
[1] chr15 [28356859, 28356859] + | threeUTR
677
677 1 55386
[2] chr15 [28356859, 28356859] + | threeUTR
677
677 1 55387
GENEID PRECEDEID FOLLOWID
<character> <CharacterList> <CharacterList>
[1] 8924
[2] 8924
-------
seqinfo: 1 sequence from an unspecified genome; no
seqlengths
So, my question is, given that VRanges objects are enforced
to
have
a positive strand, would not be better to have
ignore.strand=TRUE
as
default in locateVariants?
Alternatively, I would recommend that locateVariants() issues
a
warning, or maybe an error, when the input object is VRanges
and
ignore.strand=FALSE.
Finally, out of curiosity, why a VRanges object enforces the
positive strand in all its genomic ranges? Would not be
better
just
taking the strand of the CollapsedVCF object when coercing
the
CollapsedVCF object to VRanges?
thanks!!
robert.
_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>
mailing
list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel