On Mon, Jan 26, 2015 at 07:59:10PM +0000, Laeeth Isharc via Digitalmars-d wrote: [...] > My question is how much is lost by doing it in two steps (sort, > groupby) rather than one. [...]
I'm not sure I understand what you mean by "lost"... I assume you're not talking about data loss, since that's not applicable either way. In any case, as with all things, there's a tradeoff. Building a hash table of groups, as some people demand, has the advantage that the grouping predicate is applied globally, and no sorting (O(n log n)) is needed. The disadvantage is that the entire data set has to fit in memory, and you cannot process it lazily. Before you've seen all of the data, you cannot know whether there are more distinct groups to come, or whether currently known groups have more members. Once the data size gets too large, you run into trouble. Plus, some people are adverse to gratuitous GC use in Phobos algorithms. Perhaps some of this is misplaced, but that's the perception. The current implementation has the advantage that it requires very little memory to work, and can process data lazily. It only needs to see a tiny portion of the entire data set to do its work. Each group returned is also lazy, so it doesn't need to store the whole group in its entirety. It can handle 10GB long groups in a 100GB data set streamed over the network with very little memory, whereas such a monstrous thing would be infeasibly resource-hungry if you need to allocate an on-disk hashtable (very inefficient!), iterate over the entire dataset, and then be I/O bound as you iterate over each group loading group members from disk. The big disadvantage, however, is that if your data is not sorted, then the groupings returned won't be quite what you'd expect, since the predicate is evaluated only over consecutive runs of elements, not globally over the entire data set. Which one is better? That depends on your specific application and its needs. For some applications, there is simply no way around the fact that you have a large dataset with randomly-ordered elements, so the sorting has to be done *somewhere*. For other applications, you *can* make stream your data in sorted order -- or perhaps it doesn't care if groupings aren't global -- so you can take advantage of the fact that the current groupBy is extremely cheap. It requires minimal memory to do its work, and it can do so inside a pipeline without requiring anything more than an input range. Basically, std.algorithm is (currently) primarily focused on linear algorithms. While there *are* some sorting algorithms and sublinear algorithms that take advantage of sorted data, it is still primarily focused on linear processing. There is no good abstraction as yet for more complex processing like multidimensional or graph traversals. As a result, it is best suited for stream-based processing. Generally speaking, most application code is primarily concerned with linear processing (copy this text from here to there, scan this text for some linear string AKA search keyword, render this line of text to the screen, draw this line around the window, move this item along in the processing queue, etc.). Non-linear computations generally take place only in dedicated computing modules of limited scope in the application, where one tends to write dedicated algorithms rather than reuse stock generic algorithms. > In any case, the documentation should be very clear on what groupby > does, and how the user can do what he might be expecting to achieve, > coming from a different framework. [...] As far as I know, the current groupBy docs explain quite clearly what it does. If you find it still inadequate or unclear, please file bugs against it so that we can look into improving the docs. T -- GEEK = Gatherer of Extremely Enlightening Knowledge