On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote:
I have been reading large text files with D's csv file reader and have found it slow compared to R's read.table function which is not known to be particularly fast.

FWIW - I've been implementing a few programs manipulating delimited files, e.g. tab-delimited. Simpler than CSV files because there is no escaping inside the data. I've been trying to do this in relatively straightforward ways, e.g. using byLine rather than byChunk. (Goal is to explore the power of D standard libraries).

I've gotten significant speed-ups in a couple different ways:
* DMD libraries 2.068+  -  byLine is dramatically faster
* LDC 0.17 (alpha) - Based on DMD 2.068, and faster than the DMD compiler * Avoid utf-8 to dchar conversion - This conversion often occurs silently when working with ranges, but is generally not needed when manipulating data. * Avoid unnecessary string copies. e.g. Don't gratuitously convert char[] to string.

At this point performance of the utilities I've been writing is quite good. They don't have direct equivalents with other tools (such as gnu core utils), so a head-to-head is not appropriate, but generally it seems the tools are quite competitive without needing to do my own buffer or memory management. And, they are dramatically faster than the same tools written in perl (which I was happy with).

--Jon

Reply via email to