On Tue, Jan 26, 2016 at 08:54:34PM +0000, Chris Wright via Digitalmars-d-learn wrote: > On Tue, 26 Jan 2016 18:16:28 +0000, Gerald Jansen wrote: > > > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote: > >> > >> While this is no fancy range-based code, and one might say it's > >> more hackish and C-like than idiomatic D, the problem is that > >> current D compilers can't quite optimize range-based code to this > >> extent yet. Perhaps in the future optimizers will improve so that > >> more idiomiatic, range-based code will have comparable performance > >> with fastcsv. (At least in theory this should be possible.) > > > > As a D novice still struggling with the concept that composable > > range-based functions can be more efficient than good-old looping > > (ya, I know, cache friendliness and GC avoidance), I find it > > extremely interesting that someone as expert as yourself would reach > > for a C-like approach for serious data crunching. Given that data > > crunching is the kind of thing I need to do a lot, I'm wondering how > > general your statement above might be at this time w.r.t. this and > > possibly other domains. > > You want to reduce allocations. Ranges often let you do that. However, > it's sometimes unsafe to reuse range values that aren't immutable. > That means, if you want to keep the values around, you need to copy > them -- which introduces an allocation. > > You can get fewer large allocations by reading the whole file at once > manually and using slices into that large allocation.
Yeah, in the course of this exercise, I found that the one thing that has had the biggest impact on performance is the amount of allocations involved. Basically, I noted that the less allocations are made, the more efficient the code. I'm not sure exactly why this is so, but it's probably something to do with the fact that tracing GCs work better with fewer allocations of larger objects, than many allocations of small objects. I have also noted in the past that D's current GC runs collections a little too often; in past projects I've obtained significant speedup (in one case, up to 40% reduction of total runtime) by suppressing automatic collections and scheduling them manually at a lower frequency. In short, I've found that reducing GC load plays a much bigger role in performance than the range vs. loops issue. The reason I chose to write manual loops at first is to eliminate all possibility of unexpected overhead that might hide behind range primitives, as well as compiler limitations, as current optimizers aren't exactly tuned for range-based idioms, and may fail to recognize certain range-based idioms that would lead to much more efficient code. However, in my second iteration when I made the fastcsv parser return an input range instead of an array, I found only negligible performance differences. This suggests that perhaps range-based code may not perform that badly after all. I have yet to test this hypothesis, as the inner loop that parses fields in a single row is still a manual loop; but my suspicion is that it wouldn't do too badly in range-based form either. What might make a big difference, though, is the part where slicing is used, since that is essential for reducing the number of allocations. The current iteration of struct-based parsing code, for instance, went through an initial version that was excruciatingly slow for structs with string fields. Why? Because the function takes const(char)[] as input, and you can't legally get strings out of that unless you make a copy of that data (since const means you cannot modify it, but somebody else still might). So std.conv.to would allocate a new string and copy the contents over, every time a string field was parsed, resulting in a large number of small allocations. To solve this, I decided to use a string buffer: instead of one allocation per string, pre-allocate a large-ish char[] buffer, and every time a string field was parsed, append the data into the buffer. If the buffer becomes full, allocate a new one. Take a slice of the buffer corresponding to that field and cast it to string (this is safe since the algorithm was constructed never to write over previous parts of the buffer). This seemingly trivial optimization won me a performance improvement of an order of magnitude(!). This is particularly enlightening, since it suggests that even the overhead of copying all the string fields out of the original data into a new buffer does not add up to that much. The new struct-based parser also returns an input range rather than an array; I found that constructing the array directly vs. copying from an input range didn't really make that big of a difference either. What did make a huge difference is reducing the number of allocations. So the moral of the story is: avoid large numbers of small allocations. If you have to do it, consider consolidating your allocations into a series of allocations of large(ish) buffers instead, and taking slices of the buffers. (And on a tangential note, this backs up Walter's claim that string manipulation in C/C++ ultimately will lose, because of strcpy() and strlen(). Think of how many times in C/C++ code you have to copy string data just because you can't guarantee the incoming string will still be around after you return, and how many times you have to iterate over strings just because arrays are pointers and thus have no length. You couldn't write the equivalent of fastcsv in C/C++, because you'll leak memory and/or get dangling pointers, since you don't know what will happen to the incoming data after you return, so you can't just take slices of it. You'd be forced to malloc() all your strings, and then somehow ensure the caller will clean up properly. Ultimately you'd need a convoluted, unnatural API just to make sure the memory housekeeping is taken care of. Whereas in D, even though the GC is so atrociously slow, it *does* let you freely slice things to your heart's content with zero API complication, no memory leaks, and when done right, can even rival C/C++ performance, and that at a fraction of the mental load required to write leak-free, pointer-bug-free C/C++ code.) T -- Tell me and I forget. Teach me and I remember. Involve me and I understand. -- Benjamin Franklin