Re: Processing a gzipped csv-file by line-by-line

Laeeth Isharc via Digitalmars-d-learn Thu, 11 May 2017 21:57:08 -0700

On Friday, 12 May 2017 at 00:18:47 UTC, H. S. Teoh wrote:

On Wed, May 10, 2017 at 11:40:08PM +0000, Jesse Phillips viaDigitalmars-d-learn wrote: [...]
H.S. Teoh mentioned fastcsv but requires all the data to be inmemory.
Or you could use std.mmfile. But if it's decompressed data,then it would still need to be small enough to fit in memory.Well, in theory you *could* use an anonymous mapping forstd.mmfile as an OS-backed virtual memory buffer to decompressinto, but it's questionable whether that's really worth theeffort.
If you can get the zip to decompress into a range of dcharthen std.csv will work with it. It is by far not the fastest,but much speed is lost since it supports input ranges anddoesn't specialize on any other range type.
I actually spent some time today to look into whether fastcsvcan possibly be made to work with general input ranges as longas they support slicing... and immediately ran into theinfamous autodecoding issue: strings are not random-accessranges because of autodecoding, so it would require eitherextensive code surgery to make it work, or ugly hacks to bypassautodecoding. I'm quite tempted to attempt the latter, infact, but not now since it's getting busier at work and I don'thave that much free time to spend on a major refactoring offastcsv.
Alternatively, I could possibly hack together a version offastcsv that took a range of const(char)[] as input (ratherthan a single string), so that, in theory, it could handlearbitrarily large input files as long as the caller can providea range of data blocks, e.g., File.byChunk, or in thisparticular case, a range of decompressed data blocks fromwhatever decompressor is used to extract the data. As long asyou consume the individual rows without storing references tothem indefinitely (don't try to make an array of the entiredataset), fastcsv's optimizations should still work, sinceunreferenced blocks will eventually get cleaned up by the GCwhen memory runs low.
T

I hacked your code to work with std.experimental.allocator. If Iremember it was a fair bit faster for my use. Let me know if youwould like me to tidy up into a pull request.


Thanks for the library.

Also - sent you an email.  Not sure if you got it.


Laeeth

Re: Processing a gzipped csv-file by line-by-line

Reply via email to