Hi,
I have have the same problem. I want to write ~ 4Mio small (25bps)
sequences into one fasta file. write.XStringSet() is very slow. Also,
writeFASTA() is very low. Only about 1500 sequences are written per minute.
Are there any alternatives?
Best wishes,
Hans-Ulrich
> sessionInfo()
R version 2.11.0 RC (2010-04-19 r51778)
x86_64-pc-linux-gnu
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.6.2 Rsamtools_1.0.1 lattice_0.18-5
[4] Biostrings_2.16.0 GenomicRanges_1.0.1 IRanges_1.6.0
loaded via a namespace (and not attached):
[1] Biobase_2.8.0 grid_2.11.0 hwriter_1.2 tools_2.11.0
Steffen Neumann wrote:
Hi,
I have some major performance problems writing fasta files
with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one DNAString,
and writing that to a file takes ages, as you see from the strace output
below: I obtain ~5 lines (80 chars each) per second. The runtime
of the system call<in brackets> is neglectible.
library(Biostrings)
chromosome<-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
Is there a fundamental flaw in my thinking ?
Is there an alternative to write.XStringSet() ?
This happens both on my laptop and a beefy server.
I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
and get ~11 lines per second.
Yours,
Steffen
13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) =
80<0.000137>
13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) =
80<0.000142>
13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) =
80<0.000133>
13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) =
80<0.000159>
13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) =
80<0.000133>
13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) =
80<0.000136>
13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) =
80<0.000594>
sessionInfo()
R version 2.10.0 (2009-10-26)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.14.12 IRanges_1.4.16
loaded via a namespace (and not attached):
[1] Biobase_2.6.0
--
Hans-Ulrich Klein
Department of Medical Informatics and Biomathematics
University of Münster
Domagkstrasse 9
48149 Münster, Germany
Tel.: +49 (0)251 83-58405
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing