Hello Dirk,

Got it sorted, the basic problem was that the output matrix's dimensions has to be defined precisely.

I had some problems with first line (col names) and first columns (row names).
But it works now.

Benchmarks against fread shows the code i use returns a lighter object (a simple matrix) and thus processes faster.

reading 400 16,2Mb files with a 6 cores took 177,949 seconds with the cpp function and 228.231 seconds with fread.

Neadless to say both are considerably faster than read.table (took 21669 seconds!) and read_csv from readr package (took about the same).


I know it would be better to contribute an rcpp gallery but for now i just have time to post the code here:

#include <Rcpp.h>
#include <fstream>
#include <sstream>
#include <string>
using namespace Rcpp;


//Function is taking a path to a numeric file and return the same data in a NumericMatrix object

// [[Rcpp::export]]
NumericMatrix readfilecpp(std::string path)
{

NumericMatrix output(20,46749);// output matrix (specifying the size is critical otherwise R crashes)

std::ifstream myfile(path.c_str()); //Opens the file. c_str is mandatory here so that ifstream accepts the string path

std::string line;
std::getline(myfile,line,'\n'); //skip the first line (col names in our case). Remove those lines if note necessary


for (int row=0; row<20; ++row) // basic idea: getline() will read lines row=0:19 and for each line will put the value separated by ',' into 46749 columns
{
    std::string line;
std::getline(myfile,line,'\n'); //Starts at the second line because the first one was ditched previously

    if(!myfile.good() ) //If end of rows then break
        break;

    std::stringstream iss(line); // take the line into a stringstream
    std::string val;
    std::getline(iss,val,','); ///skips the first column (row names)

    for (int col=0; col<46749; ++col )
        {
    std::string val;
std::getline(iss,val,','); //reads the stringstream line and separate it into 49749 values (that were delimited by a ',' in the stringstream)


std::stringstream convertor(val); //get the results into another stringstream 'convertor' convertor >> output(row,col); //put the result into our output matrix at for the actual row and col
        }
    }
return(output);
}



On 20/04/15 13:16, Dirk Eddelbuettel wrote:
On 20 April 2015 at 12:01, ogami musashi wrote:
| Problem is..i have 400 object of 16,5 Mb each. and it take about 6 hours
| to reimport in R! I use the readr package as this is the fastest base
| function in R.

a) readr != base R

b) fread in package data.table is considered the fastest reader function

| I adapted a C++ code to use Rcpp, it compiles but when using it it
| crashes R:

I fear you may have to debug that yourself.  As for speed, you won't be able
to beat fread which has been optimised for this for years and uses mmap and
other tricks.

Dirk


_______________________________________________
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

Reply via email to