If "group by" in other languages refers to the latter function, then that means "groupBy" is poorly-named and we need to come up with a better name for it. Changing it to return tuples and what-not seems to
be beating around the bush to me.


T

T: you are good with algorithms. In many applications you have a bunch of results and want to summarise them. This is often what the corporate manager is doing with Excel pivot tables, and it is what the groupby function is used for in pandas. See here for a simple tutorial.

http://wesmckinney.com/blog/?p=125

And here for a summary of what pandas can do with data:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html

Is there any reason why we shouldn't add to Phobos: median, ranking, stddev, variance, correlation, covariance, skew, kurtosis, quantile, moving average, exp mov average, rolling window (see pandas)?

I personally am fine with the implementation we have (although as Ray Dalio would say. I haven't yet earned the right that you should care what I think). All that it means is that you need to sort on multi key your results first before passing to groupby.

My question is how much is lost by doing it in two steps (sort, groupby) rather than one. I don't think all that much, but it is not my field, I am also not that bothered, because this comes at the end of processing, not within the inner loop, so for me I don't think it makes a difference for now. If data sets reach commoncrawl type sizes then it might be different, although I will take D over java any day, warts and all.

In any case, the documentation should be very clear on what groupby does, and how the user can do what he might be expecting to achieve, coming from a different framework.

It would be interesting to benchmark D against pandas (which is implemented in cython for the key bits) and see how we look.

Reply via email to