Dear r-helpers,
I know that there has already been enough questions on IO performance these last days, but I came accross the following situation today. I was comparing the performance of R with that of SAS's Risk Dimensions at generating random "scenarios". My dataset --all numeric entries-- would nicely fit into RAM and R would outperform SAS until... I wanted to export the results to a .csv file using the write.table() function. For reference, this output file was of about 30MB. Moreover, the memory needed by R would increase sharply during the writing process.
I had a look at the code for the write.table() function and I found out that, basically, what it does is to create a very long text string from the data using paste() and then to print it using writeLines(). Rprof() showed that writeLines() would only use a mere 3% of the computing time, the rest being taken almost entirely by paste().
There are two directions in which performance could potentially be improved:
1.- Writing speed. 2.- Memory usage.
Regarding memory usage, I thought that perhaps a little rewriting of the write.table() function could be considered: instead of writing in RAM a single long text string, with a little overhead, the data frame to be printed could be splitted into shorter, recyclable, chunks, then paste()-ing them into shorter "buffer" strings and print them sequentially into the the output file. (Note: I am a complete ignorant on R's memory recycling rules and this could perhaps not work as intended because of them).
Regarding speed considerations, I see little hope as long as the paste() function is implicitly called by write.table(). Most likely, its execution time scales linearly with the number of lines in the data frame, so splitting it would render no benefits. Are there any hints on how could a performance improvement (other than linking external, ad hoc C code) be achieved? Do we really need to go through parse()? Would it perhaps be beneficial to include in R some specialized functions that achieved high output performance for writing out, say, only numeric values (this happens to be the case for me most of the time)?
Sorry for the long posting.
Carlos J. Gil Bellosta
______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
