On Friday, 12 May 2017 at 00:18:47 UTC, H. S. Teoh wrote:
On Wed, May 10, 2017 at 11:40:08PM +0000, Jesse Phillips via
Digitalmars-d-learn wrote: [...]
H.S. Teoh mentioned fastcsv but requires all the data to be in
memory.
Or you could use std.mmfile. But if it's decompressed data,
then it would still need to be small enough to fit in memory.
Well, in theory you *could* use an anonymous mapping for
std.mmfile as an OS-backed virtual memory buffer to decompress
into, but it's questionable whether that's really worth the
effort.
If you can get the zip to decompress into a range of dchar
then std.csv will work with it. It is by far not the fastest,
but much speed is lost since it supports input ranges and
doesn't specialize on any other range type.
I actually spent some time today to look into whether fastcsv
can possibly be made to work with general input ranges as long
as they support slicing... and immediately ran into the
infamous autodecoding issue: strings are not random-access
ranges because of autodecoding, so it would require either
extensive code surgery to make it work, or ugly hacks to bypass
autodecoding. I'm quite tempted to attempt the latter, in
fact, but not now since it's getting busier at work and I don't
have that much free time to spend on a major refactoring of
fastcsv.
Alternatively, I could possibly hack together a version of
fastcsv that took a range of const(char)[] as input (rather
than a single string), so that, in theory, it could handle
arbitrarily large input files as long as the caller can provide
a range of data blocks, e.g., File.byChunk, or in this
particular case, a range of decompressed data blocks from
whatever decompressor is used to extract the data. As long as
you consume the individual rows without storing references to
them indefinitely (don't try to make an array of the entire
dataset), fastcsv's optimizations should still work, since
unreferenced blocks will eventually get cleaned up by the GC
when memory runs low.
T
I hacked your code to work with std.experimental.allocator. If I
remember it was a fair bit faster for my use. Let me know if you
would like me to tidy up into a pull request.
Thanks for the library.
Also - sent you an email. Not sure if you got it.
Laeeth