Hello Dirk,
Got it sorted, the basic problem was that the output matrix's dimensions
has to be defined precisely.
I had some problems with first line (col names) and first columns (row
names).
But it works now.
Benchmarks against fread shows the code i use returns a lighter object
(a simple matrix) and thus processes faster.
reading 400 16,2Mb files with a 6 cores took 177,949 seconds with the
cpp function and 228.231 seconds with fread.
Neadless to say both are considerably faster than read.table (took 21669
seconds!) and read_csv from readr package (took about the same).
I know it would be better to contribute an rcpp gallery but for now i
just have time to post the code here:
#include <Rcpp.h>
#include <fstream>
#include <sstream>
#include <string>
using namespace Rcpp;
//Function is taking a path to a numeric file and return the same data
in a NumericMatrix object
// [[Rcpp::export]]
NumericMatrix readfilecpp(std::string path)
{
NumericMatrix output(20,46749);// output matrix (specifying the size is
critical otherwise R crashes)
std::ifstream myfile(path.c_str()); //Opens the file. c_str is mandatory
here so that ifstream accepts the string path
std::string line;
std::getline(myfile,line,'\n'); //skip the first line (col names in our
case). Remove those lines if note necessary
for (int row=0; row<20; ++row) // basic idea: getline() will read lines
row=0:19 and for each line will put the value separated by ',' into
46749 columns
{
std::string line;
std::getline(myfile,line,'\n'); //Starts at the second line because
the first one was ditched previously
if(!myfile.good() ) //If end of rows then break
break;
std::stringstream iss(line); // take the line into a stringstream
std::string val;
std::getline(iss,val,','); ///skips the first column (row names)
for (int col=0; col<46749; ++col )
{
std::string val;
std::getline(iss,val,','); //reads the stringstream line and
separate it into 49749 values (that were delimited by a ',' in the
stringstream)
std::stringstream convertor(val); //get the results into another
stringstream 'convertor'
convertor >> output(row,col); //put the result into our output
matrix at for the actual row and col
}
}
return(output);
}
On 20/04/15 13:16, Dirk Eddelbuettel wrote:
On 20 April 2015 at 12:01, ogami musashi wrote:
| Problem is..i have 400 object of 16,5 Mb each. and it take about 6 hours
| to reimport in R! I use the readr package as this is the fastest base
| function in R.
a) readr != base R
b) fread in package data.table is considered the fastest reader function
| I adapted a C++ code to use Rcpp, it compiles but when using it it
| crashes R:
I fear you may have to debug that yourself. As for speed, you won't be able
to beat fread which has been optimised for this for years and uses mmap and
other tricks.
Dirk
_______________________________________________
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel