On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via Digitalmars-d wrote: > On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote: > > Do you know what happened with fastcsv [0], original thread [1]. > > > > [0] https://github.com/quickfur/fastcsv > > [1] > > http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-le...@puremagic.com > > I do not. Rereading that in light of this new article I'm a little > sceptical of the 51 times faster, since I'm seeing only 10x against > these other implications. [...]
You don't have to be skeptical, neither do you have to believe what I claimed. I posted the entire code I used in the original thread, as well as the URLs of the exact data files I used for testing. You can just run it yourself and see the results for yourself. And yes, fastcsv has its limitations (the file has to fit in memory, no validation is done, etc.), which are also documented up-front in the README file. I wrote the code targeting a specific use case mentioned by the OP of the original thread, so I do not expect or claim you will see the same kind of performance for other use cases. If you want validation, then it's a given that you won't get maximum performance, simply because there's just more work to do. For data sets that don't fit into memory, I already have some ideas about how to extend my algorithm to work with it, so some of the performance may still be retained. But obviously it's not going to be as fast as if you can just read the entire file into memory first. (Note that this is much less of a limitation than it seems; for example you could use std.mmfile to memory-map the file into your address space so that it doesn't actually have to fit into memory, and you can still take slices of it. The OS will manage the paging from/to disk for you. Of course, it will be slower when something has to be paged from disk, but IME this is often much faster than if you read the data into memory yourself. Again, you don't have to believe me: the fastcsv code is there, just import std.mmfile, mmap the largest CSV file you can find, call fastcsv on it, and measure the performance yourself. If your code performs better, great, tell us all about it. I'm interested to learn how you did it.) Note that besides slicing, another major part of the performance boost in fastcsv is in minimizing GC allocations. If you allocate a string for each field in a row, it will be much slower than if you either sliced the original string, or if you allocated a large buffer for holding the data and just take slices for each field. Furthermore, if you allocate a new array per row to hold the list of fields, it will be much slower than if you allocate a large array for holding all the fields of all the rows, and merely slice this array to get your rows. Of course, you cannot know ahead of time exactly how many rows there will be, so the next best thing is to allocate a series of large arrays, capable of holding the field slices of k rows, for sufficiently large k. Once the current array runs out of space, copy the (partial) slices of the last row to beginning of a new large array, and continue from there. This way, you will be making n/k allocations, where n is the number of rows and k is the number of rows that fit into each buffer, as opposed to n allocations. For large values of k, this greatly reduces the GC load and significantly speeds things up. Again, don't take my word for it. Run a profiler on the fastcsv code and see for yourself. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG