Hi, Thanks for the code snippet, it's working like a charm.
Yours, Steffen Am Friday, den 16.04.2010, 14:17 -0400 schrieb Kasper Daniel Hansen: > It has been incorporated into Biostrings a while ago. But it turns > out that it only really works if you call it like I did in an earlier > email. In other words, there were severe problems with argument > checking etc. And don't use write.XString - that is pretty > inefficient. I will get it fixed so that write.XStringSet just calls > my new writeFASTA or similar. > > You can test my "new" patch below. It is very fast for the genome > case (a few, very long sequences) and (I believe) horrible inefficient > for the short read case (many, short sequences). I know what I need > to do to fix up the short read case, but I don't have time to do so > until later this weekend/early next week. For your use case it may be > fast, depending on how many sequences you have. Otherwise you will > have to wait for me fixing the short read case. > > Please write back with possible bugs. Below 'x' could be a XStringSet > (your use case) and the names of the sequences will be used as the > FASTA record names, unless you give it a "desc" argument (character > vector of record names). > > writeFASTA <- function(x, file="", desc=NULL, append=FALSE, width=80) > { > if (!isTRUEorFALSE(append)) > stop("'append' must be TRUE or FALSE") > if (isSingleString(file)) { > if (file == "") { > file <- stdout() > } else { > file <- file(file, ifelse(append, "a", "w")) > on.exit(close(file)) > } > } else if (inherits(file, "connection")) { > if (!isOpen(file)) { > file <- file(file, ifelse(append, "a", "w")) > on.exit(close(file)) > } > } else { > stop("'file' must be a single string or connection") > } > if (!isSingleNumber(width)) > stop("'width' must be an integer >= 1") > if (!is.integer(width)) > width <- as.integer(width) > if (width < 1L) > stop("'width' must be an integer >= 1") > if(!(is.character(x) || is(x, "XString") || is(x, "XStringSet") || > is(x, "BSgenome") || (is.list(x) && "seq" %in% names(x)))) > stop("'x' does not have the appropriate type") > #browser() > if(is.character(x)) > x <- BStringSet(x, use.names = TRUE) > if(is.list(x)) { > if(is.null(desc)) > desc <- x$desc > x <- BStringSet(x$seq) > } > if(is(x, "XString")) { > nLengths <- length(x) > } > if(is(x, "XStringSet")) { > nLengths <- width(x) > } > if(is(x, "BSgenome")) { > nLengths <- seqlengths(x) > } > > if (!is.null(desc) && !(is.character(desc) && length(desc) == > length(nLengths))) > stop("when specified, 'desc' must be a character vector of the > same length as the 'x' object") > if(is.null(desc)) > desc <- names(x) > if(is.null(desc)) > desc <- rep("", length(nLengths)) > if(length(nLengths) != length(desc)) > stop("wrong length of 'desc'") > > writeBString <- function(bstring) > { > if (length(bstring) == 0L) > return() > nlines <- (length(bstring) - 1L) %/% width + 1L > lineIdx <- seq_len(nlines) > start <- (lineIdx - 1L) * width + 1L > end <- start + width - 1L > if (end[nlines] > length(bstring)) > end[nlines] <- length(bstring) > bigstring <- paste( > as.character(Views(bstring, start = start, end = end)), > collapse="\n" > ) > cat(bigstring, "\n", file=file, sep="") > } > > if(is(x, "XString")) { > cat(">", desc, "\n", file = file, sep = "") > writeBString(x) > } else { > for (ii in seq_len(length(nLengths))) { > cat(">", desc[ii], "\n", file = file, sep = "") > writeBString(x[[ii]]) > } > } > } > > > On Fri, Apr 16, 2010 at 1:55 PM, Neumann, Steffen <sneum...@ipb-halle.de> > wrote: > > Hi, > > > > thanks for your blazing fast answers. > > Where did you send the patch ? > > Can I find that somewhere ? > > > > Steffen > > > > > > -----Original Message----- > > From: Kasper Daniel Hansen [mailto:kasperdanielhan...@gmail.com] > > Sent: Fri 4/16/2010 15:55 > > To: Neumann, Steffen > > Cc: bioc-sig-sequenc...@stat.math.ethz.ch > > Subject: Re: [Bioc-sig-seq] write.XStringSet() terribly slow > > > > I don't know if there has been a refactoring of the code, but I while > > ago I send a patch to writeFASTA making it magnitudes faster, so you > > should perhaps try that one. The patch makes it pretty fast to dump > > entire bsgenomes into fasta files. > > > > Kasper > > > > On Fri, Apr 16, 2010 at 9:17 AM, Steffen Neumann <sneum...@ipb-halle.de> > > wrote: > >> Hi, > >> > >> I have some major performance problems writing fasta files > >> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one > >> DNAString, > >> and writing that to a file takes ages, as you see from the strace output > >> below: I obtain ~5 lines (80 chars each) per second. The runtime > >> of the system call <in brackets> is neglectible. > >> > >> library(Biostrings) > >> chromosome <-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta") > >> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta") > >> > >> Is there a fundamental flaw in my thinking ? > >> Is there an alternative to write.XStringSet() ? > >> This happens both on my laptop and a beefy server. > >> > >> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22, > >> and get ~11 lines per second. > >> > >> Yours, > >> Steffen > >> > >> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80 > >> <0.000137> > >> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80 > >> <0.000142> > >> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80 > >> <0.000133> > >> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80 > >> <0.000159> > >> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80 > >> <0.000133> > >> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80 > >> <0.000136> > >> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80 > >> <0.000594> > >> > >> sessionInfo() > >> R version 2.10.0 (2009-10-26) > >> x86_64-unknown-linux-gnu > >> > >> locale: > >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] Biostrings_2.14.12 IRanges_1.4.16 > >> > >> loaded via a namespace (and not attached): > >> [1] Biobase_2.6.0 > >> > >> -- > >> IPB Halle AG Massenspektrometrie & Bioinformatik > >> Dr. Steffen Neumann http://www.IPB-Halle.DE > >> Weinberg 3 http://msbi.bic-gh.de > >> 06120 Halle Tel. +49 (0) 345 5582 - 1470 > >> +49 (0) 345 5582 - 0 > >> sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409 > >> > >> _______________________________________________ > >> Bioc-sig-sequencing mailing list > >> Bioc-sig-sequencing@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >> > > > > > > -- IPB Halle AG Massenspektrometrie & Bioinformatik Dr. Steffen Neumann http://www.IPB-Halle.DE Weinberg 3 http://msbi.bic-gh.de 06120 Halle Tel. +49 (0) 345 5582 - 1470 +49 (0) 345 5582 - 0 sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409 _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing