Hi,

I have have the same problem. I want to write ~ 4Mio small (25bps) sequences into one fasta file. write.XStringSet() is very slow. Also, writeFASTA() is very low. Only about 1500 sequences are written per minute.

Are there any alternatives?

Best wishes,
Hans-Ulrich


> sessionInfo()
R version 2.11.0 RC (2010-04-19 r51778)
x86_64-pc-linux-gnu

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.6.2     Rsamtools_1.0.1     lattice_0.18-5
[4] Biostrings_2.16.0   GenomicRanges_1.0.1 IRanges_1.6.0

loaded via a namespace (and not attached):
[1] Biobase_2.8.0 grid_2.11.0   hwriter_1.2   tools_2.11.0





Steffen Neumann wrote:
Hi,

I have some major performance problems writing fasta files
with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one DNAString,
and writing that to a file takes ages, as you see from the strace output
below: I obtain ~5 lines (80 chars each) per second. The runtime
of the system call<in brackets>  is neglectible.

library(Biostrings)
chromosome<-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")

Is there a fundamental flaw in my thinking ?
Is there an alternative to write.XStringSet() ?
This happens both on my laptop and a beefy server.

I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
and get ~11 lines per second.

Yours,
Steffen

13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 
80<0.000137>
13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 
80<0.000142>
13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 
80<0.000133>
13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 
80<0.000159>
13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 
80<0.000133>
13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 
80<0.000136>
13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 
80<0.000594>

sessionInfo()
R version 2.10.0 (2009-10-26)
x86_64-unknown-linux-gnu

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.14.12 IRanges_1.4.16

loaded via a namespace (and not attached):
[1] Biobase_2.6.0



--
Hans-Ulrich Klein
Department of Medical Informatics and Biomathematics
University of Münster
Domagkstrasse 9
48149 Münster, Germany
Tel.: +49 (0)251 83-58405

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to