std.csv Performance Review

Jesse Phillips via Digitalmars-d Fri, 02 Jun 2017 21:31:52 -0700

Author here:

The discussion[1] and articles[2] around "Faster Command LineTools" had me trying out std.csv for the task.

Now I know std.csv isn't fast and it allocates. When I wrote myCSV parser, I'd also left around a parser which used slicinginstead of allocation[3].


I compared these two: LDC -O3 -release

std.csv: over 10 seconds
csv slicing: under 5 seconds

Over 50% improvement isn't bad, but this still wasn't competingwith the other implementations. Now I didn't profile std.csv'simplementation but I did take a look at the one with slicing.

Majority of the time was spent in std.algorithm.startsWith, whichis being called by countUntil. The calls made to empty() also addup from the use in countUntil and startsWith. These functions areby no means slow, startsWith averaged 1 millisecond executiontime while countUntil was up to 5 milliseconds; thing is startswith was called a whopping 384,806,160 times.

Keep in mind that the file itself has 10,512,769 rows of datawith four columns. Now I've talked to std.csv's performance inthe past, probably with the author of the fast command linetools. Essentially it came down to std.csv is restricted toparsing with only the Input Range api, and you can't correctlyparse CSV without allocation. But now I'm working outside thoserestrictions and so I offer an additional point.

Both of these do something none of the other implementation do,it validates the CSV is well formed. If it finds that the file nolonger conforms to the correct CSV layout it makes a choice,either throw an exception or guess and continue on (based on thewhat the user requested). While the Nim implementation doeshandle escaped quotes (and newlines, unlike fast csv) the parsingassumes the file is well formed, which std.csv was quite promptto point out that this file is indeed not well formed.

Even though the issue can be ignored, the overhead of parsing toidentify issues still remains. I haven't attempted write thealgorithm assuming proper data structure so I don't know what theperformance would look like, but I suspect it isn't negligible.There is also likely some overhead for providing the tokensthrough range interfaces.

On another note, including this slicing version of the CSV parsecan and should be included in std.csv as a specialization. But itis by no means ready. The feature requirements need to be spelledout better (hasSlicing!Range fails for strings but is the primaryuse-case for the optimization), escaped quotes remain in thereturned data (like I said proper parsing requires allocation).

1.http://forum.dlang.org/post/chvukhbscgamxecvp...@forum.dlang.org2.https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/

3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize

std.csv Performance Review

Reply via email to