Thanks Hervé and others for looking into this. Leonard
On Sun, Oct 29, 2017 at 10:19 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > Thanks Dario and Mike for looking into this. > > In the mean time I added makeTxDbFromEnsembl() for creating a TxDb > object by querying directly an Ensembl MySQL server. Only lightly > tested but so far seems to faster and more reliable than > makeTxDbFromBiomart(): > > > library(GenomicFeatures) > > > system.time(txdb <- makeTxDbFromEnsembl("Homo sapiens", server=" > useastdb.ensembl.org")) > Fetch transcripts and genes from Ensembl ... OK > Fetch exons and CDS from Ensembl ... OK > Fetch chromosome names and lengths from Ensembl ...OK > Gather the metadata ... OK > Make the TxDb object ... OK > user system elapsed > 46.420 1.271 58.270 > > > txdb > TxDb object: > # Db type: TxDb > # Supporting package: GenomicFeatures > # Data source: Ensembl > # Organism: Homo sapiens > # Ensembl release: 90 > # Ensembl database: homo_sapiens_core_90_38 > # MySQL server: useastdb.ensembl.org > # transcript_nrow: 220144 > # exon_nrow: 757782 > # cds_nrow: 789614 > # Db created by: GenomicFeatures package from Bioconductor > # Creation time: 2017-10-29 22:13:21 -0700 (Sun, 29 Oct 2017) > # GenomicFeatures version at creation time: 1.29.16 > # RSQLite version at creation time: 2.0 > # DBSCHEMAVERSION: 1.2 > > It's in GenomicFeatures 1.29.16. > > Cheers, > H. > > > On 10/28/2017 03:32 PM, Mike Smith wrote: > >> My feeling is that this isn't related to RCurl, since the example >> transcript is missing if you use httr to submit the query instead. You >> can >> check this with the code at >> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.gi >> thub.com_grimbough_7e7a47b7a4f64915220ce35cc1ce8f39&d= >> DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0W >> YiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7F >> WBM&s=orkmK9y7kVCLbLB6tlMJfgT5V_GKzVysZ00DMevuAm4&e= >> >> >> I wonder if this is related to BioMart's instability when you submit large >> queries. The web interface suggests no more than 500 entries when >> filtering on something like a gene ID, but in these examples we're >> applying >> no filter at all. Problems related to this have been reported before ( >> https://urldefense.proofpoint.com/v2/url?u=https-3A__support >> .bioconductor.org_p_86358_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfh >> Q&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV- >> BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=CxVV4WArQc4xcf74Az8V8vfQ >> Mr80Q8K2um6sKTggHw0&e=) and I patched the devel version >> >> of biomaRt to submit queries in batches of 500 - although this currently >> has no effect if there isn't a set of 'values' to split into batches. >> >> Interestingly, if I first obtain a list of all gene IDs in Ensembl, and >> then submit those in a batched query I get more exons than your first >> approach. >> >> all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart) >> attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id") >> attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start", >> "5_utr_start") >> df3 <- getBM(attributes1, filters = "ensembl_gene_id", values = >> all_genes[,1], mart=mart) >> df4 <- getBM(attributes2, filters = "ensembl_gene_id", values = >> all_genes[,1], mart=mart) >> >> dim(df3) >>> >> [1] 1309356 6 >> >> The number of returned results is consistent for even with the larger >> number of attributes >> >> identical(df3[,1], df4[,1]) >>> >> [1] TRUE >> >> And some of these transcripts are clearly not present in the first query: >> >> length(which( !unique(df3[,"ensembl_transcript_id"]) %in% df1[, >>> >> "ensembl_transcript_id"] )) >> [1] 28128 >> >> It could be that my batched code is doing something wrong, but I'm >> starting >> to suspect that even the first query is dropping data silently! >> >> Mike >> >> >> On 28 October 2017 at 13:00, Dario Strbenac <dstr7...@uni.sydney.edu.au> >> wrote: >> >> Good day, >>> >>> I stepped through the code until execution reached the end of postForm in >>> RCurl which is called by getBM and obtains the textual result from the >>> server. If I check the contents of write$value(), the example missing >>> transcript is not there. >>> >>> Browse[3]> grep("ENST00000485971", write$value()) >>> integer(0) >>> >>> write$value is a weird function. It's prototype is function (collapse = >>> "", ...) but its body contains code such as >>> >>> if (is.null(collapse)) >>> return(txt) >>> >>> I wonder where txt is created. It's not passed as an extra variable. >>> >>> Browse[7]> print(list(...)) >>> list() >>> >>> Searching the R code reveals that txt is created as a global variable in >>> another function named dynCurlReader by the code statement txt <<- >>> character(). >>> >>> RCurl also uses functions that don't begin with a dot but are >>> undocumented. >>> >>> ans = encode(ans) >>> Browse[7]> ?encode >>> No documentation for ‘encode’ in specified packages and libraries >>> >>> Anyway, the transcript ID is also missing from txt. >>> >>> Browse[7]> grep("ENST00000485971", txt) >>> integer(0) >>> >>> It's hard to know what the obfuscated code of RCurl is doing. >>> >>> -------------------------------------- >>> Dario Strbenac >>> University of Sydney >>> Camperdown NSW 2050 >>> Australia >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=nj >>> pXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJD >>> MeyF5e0R7MZWxNlxOxbXiH1zLGA&e= >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=nj >> pXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJD >> MeyF5e0R7MZWxNlxOxbXiH1zLGA&e= >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel