The second mapping in .localCoordinates() in AllUtils.R was using a listed 'to' instead of unlisted 'to'. This means the mapping was at the level of the outer list elements vs the inner. We want to map/overlap with an unlisted object because each range can have a different CDSID. This second mapping provided the index for retrieving the CDSIDs which is why you saw a difference between the two.

This was a bug for predictCoding only.

Valerie


On 03/17/2015 11:39 AM, Robert Castelo wrote:
Thanks! Do you have an explanation for the apparent disagreement in
CDSID annotations that i described below the bug, between
predictCoding() and locateVariants()?

robert.


-------- Original message --------
From: Valerie Obenchain
Date:17/03/2015 19:18 (GMT+01:00)
To: bioc-devel@r-project.org
Subject: Re: [Bioc-devel] little tiny bug in CDSID annotations from
predictCoding()

Hi Robert,

Thanks for reporting the typo and bug. Now fixed in 1.13.41.

Valerie

On 03/17/2015 10:58 AM, Robert Castelo wrote:
 > in my message below, the line that it says:
 >
 > head(loc_all$CDSID)
 >
 > it should say
 >
 > head(coding2$CDSID)
 >
 > cheers,
 >
 > robert.
 >
 > ==========================================
 > hi,
 >
 > there is a little tiny bug in the current devel version of
 > VariantAnnotation::predictCoding(), and more concretely within
 > VariantAnnotation:::.localCoordinates(), that precludes the correct
 > annotation of the CDSID column:
 >
 > library(VariantAnnotation)
 > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
 > library(BSgenome.Hsapiens.UCSC.hg19)
 >
 > vcf <- readVcf(system.file("extdata", "CEUtrio.vcf.bgz",
 > package="VariantFiltering"), genome="hg19")
 > seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
 > vcf <- dropSeqlevels(vcf, "chrM")
 > coding1 <- predictCoding(vcf, txdb, Hsapiens)
 > head(coding1$CDSID)
 > IntegerList of length 6
 > [[1]] integer(0)
 > [[2]] integer(0)
 > [[3]] integer(0)
 > [[4]] integer(0)
 > [[5]] integer(0)
 > [[6]] integer(0)
 > table(elementLengths(coding1$CDSID))
 >
 >     0
 > 6038
 >
 > my sessionInfo() is at the end of the message.
 >
 > here is the patch, just replacing 'map2$trancriptHits' by
 > 'map2$transcriptsHits':
 >
 > --- R/AllUtilities.R    (revision 100756)
 > +++ R/AllUtilities.R    (working copy)
 > @@ -284,7 +284,7 @@
 >       cdsid <- IntegerList(integer(0))
 >       map2 <- mapToTranscripts(unname(from)[xHits], to,
 >                                ignore.strand=ignore.strand)
 > -    cds <- mcols(unlist(to,
use.names=FALSE))$cds_id[map2$trancriptsHits]
 > +    cds <- mcols(unlist(to,
use.names=FALSE))$cds_id[map2$transcriptsHits]
 >       if (length(cds)) {
 >           cdslst <- unique(splitAsList(cds, map2$xHits))
 >           cdsid <- cdslst
 >
 > with this fix then things seem to work again:
 >
 > coding1 <- predictCoding(vcf, txdb, Hsapiens)
 >> head(coding1$CDSID)
 > IntegerList of length 6
 > [["1"]] 21771
 > [["2"]] 21771
 > [["3"]] 21771
 > [["4"]] 21771
 > [["5"]] 21428
 > [["6"]] 21428
 > table(elementLengths(coding1$CDSID))
 >
 >     1    2    3    4    5    6    7    8    9   10   12   13   14
16   19
 >   873 1229 1024  993  615  524  324  168   82   21   12   15   42
76   40
 >
 > while investigating this bug i used VariantAnnotation::locateVariants()
 > which also annotates the CDSID column, and it seemed to be working.
 > however, i noticed that both, predictCoding() and locateVariants(), do
 > not give an identical annotation for the CDSID column in coding variants:
 >
 > coding2 <- locateVariants(vcf, txdb, CodingVariants())
 > head(loc_all$CDSID)
 > IntegerList of length 6
 > [["1"]] 210777
 > [["2"]] 210777
 > [["3"]] 210777
 > [["4"]] 210778
 > [["5"]] 208140
 > [["6"]] 208141
 > table(elementLengths(coding2$CDSID))
 >
 >     1    2    3    4
 > 4987  901  138   12
 >
 > in principle, it seems that both are annotating valid CDSID keys:
 >
 > allcdsinfo <- select(txdb, keys=keys(txdb, keytype="CDSID"),
 > columns="CDSID", keytype="CDSID")
 > sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in%
 > allcdsinfo$CDSID)
 > [1] 0
 > sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in%
 > allcdsinfo$CDSID)
 > [1] 0
 >
 > but predictCoding() annotates CDSID values that are not present in
 > locateVariants() annotations and viceversa:
 >
 > sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in%
 > as.character(unlist(coding2$CDSID, use.names=FALSE)))
 > [1] 24057
 > sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in%
 > as.character(unlist(coding1$CDSID, use.names=FALSE)))
 > [1] 7251
 >
 > length(unique(intersect(as.character(unlist(coding2$CDSID,
 > use.names=FALSE)), as.character(unlist(coding1$CDSID,
use.names=FALSE)))))
 > [1] 0
 >
 > should not both annotate the same CDSID values on coding variants?
 >
 >
 > thanks!
 > robert.
 > ps: sessionInfo()
 > R Under development (unstable) (2014-10-14 r66765)
 > Platform: x86_64-unknown-linux-gnu (64-bit)
 >
 > locale:
 >   [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C
 >   [3] LC_TIME=en_US.UTF8        LC_COLLATE=en_US.UTF8
 >   [5] LC_MONETARY=en_US.UTF8    LC_MESSAGES=en_US.UTF8
 >   [7] LC_PAPER=en_US.UTF8       LC_NAME=C
 >   [9] LC_ADDRESS=C              LC_TELEPHONE=C
 > [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
 >
 > attached base packages:
 > [1] stats4    parallel  stats     graphics  grDevices
 > [6] utils     datasets  methods   base
 >
 > other attached packages:
 >   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.1
 >   [2] GenomicFeatures_1.19.31
 >   [3] AnnotationDbi_1.29.17
 >   [4] Biobase_2.27.2
 >   [5] BSgenome.Hsapiens.UCSC.hg19_1.4.0
 >   [6] BSgenome_1.35.17
 >   [7] rtracklayer_1.27.8
 >   [8] VariantAnnotation_1.13.40
 >   [9] Rsamtools_1.19.44
 > [10] Biostrings_2.35.11
 > [11] XVector_0.7.4
 > [12] GenomicRanges_1.19.46
 > [13] GenomeInfoDb_1.3.13
 > [14] IRanges_2.1.43
 > [15] S4Vectors_0.5.22
 > [16] BiocGenerics_0.13.6
 > [17] vimcom_1.0-0
 > [18] setwidth_1.0-3
 > [19] colorout_1.0-3
 >
 > loaded via a namespace (and not attached):
 >   [1] BiocParallel_1.1.15      biomaRt_2.23.5
 >   [3] bitops_1.0-6             DBI_0.3.1
 >   [5] GenomicAlignments_1.3.31 RCurl_1.95-4.5
 >   [7] RSQLite_1.0.0            tools_3.2.0
 >   [9] XML_3.98-1.1             zlibbioc_1.13.2
 >
 > _______________________________________________
 > Bioc-devel@r-project.org mailing list
 > https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to