Author here:

The discussion[1] and articles[2] around "Faster Command Line Tools" had me trying out std.csv for the task.

Now I know std.csv isn't fast and it allocates. When I wrote my CSV parser, I'd also left around a parser which used slicing instead of allocation[3].

I compared these two: LDC -O3 -release

std.csv: over 10 seconds
csv slicing: under 5 seconds

Over 50% improvement isn't bad, but this still wasn't competing with the other implementations. Now I didn't profile std.csv's implementation but I did take a look at the one with slicing.

Majority of the time was spent in std.algorithm.startsWith, which is being called by countUntil. The calls made to empty() also add up from the use in countUntil and startsWith. These functions are by no means slow, startsWith averaged 1 millisecond execution time while countUntil was up to 5 milliseconds; thing is starts with was called a whopping 384,806,160 times.

Keep in mind that the file itself has 10,512,769 rows of data with four columns. Now I've talked to std.csv's performance in the past, probably with the author of the fast command line tools. Essentially it came down to std.csv is restricted to parsing with only the Input Range api, and you can't correctly parse CSV without allocation. But now I'm working outside those restrictions and so I offer an additional point.

Both of these do something none of the other implementation do, it validates the CSV is well formed. If it finds that the file no longer conforms to the correct CSV layout it makes a choice, either throw an exception or guess and continue on (based on the what the user requested). While the Nim implementation does handle escaped quotes (and newlines, unlike fast csv) the parsing assumes the file is well formed, which std.csv was quite prompt to point out that this file is indeed not well formed.

Even though the issue can be ignored, the overhead of parsing to identify issues still remains. I haven't attempted write the algorithm assuming proper data structure so I don't know what the performance would look like, but I suspect it isn't negligible. There is also likely some overhead for providing the tokens through range interfaces.

On another note, including this slicing version of the CSV parse can and should be included in std.csv as a specialization. But it is by no means ready. The feature requirements need to be spelled out better (hasSlicing!Range fails for strings but is the primary use-case for the optimization), escaped quotes remain in the returned data (like I said proper parsing requires allocation).

1. http://forum.dlang.org/post/chvukhbscgamxecvp...@forum.dlang.org 2. https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/
3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize

Reply via email to