On Thu, Jan 21, 2016 at 10:09:24PM +0000, Jon D via Digitalmars-d-learn wrote: [...] > FWIW - I've been implementing a few programs manipulating delimited > files, e.g. tab-delimited. Simpler than CSV files because there is no > escaping inside the data. I've been trying to do this in relatively > straightforward ways, e.g. using byLine rather than byChunk. (Goal is > to explore the power of D standard libraries). > > I've gotten significant speed-ups in a couple different ways: > * DMD libraries 2.068+ - byLine is dramatically faster > * LDC 0.17 (alpha) - Based on DMD 2.068, and faster than the DMD compiler
While byLine has improved a lot, it's still not the fastest thing in the world, because it still performs (at least) one OS roundtrip per line, not to mention it will auto-reencode to UTF-8. If your data is already in a known encoding, reading in the entire file and casting to (|w|d)string then splitting it by line will be a lot faster, since you can eliminate a lot of I/O roundtrips that way. In any case, it's well-known that gdc/ldc generally produce code that's about 20%-30% faster than dmd-compiled code, sometimes a lot more. While DMD has gotten some improvements in this area recently, it still has a long way to go before it can catch up. For performance-sensitive code I always reach for gdc instead of dmd. > * Avoid utf-8 to dchar conversion - This conversion often occurs > silently when working with ranges, but is generally not needed when > manipulating data. [...] Yet another nail in the coffin of auto-decoding. I wonder how many more nails we will need before Andrei is convinced... T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!