Basically, what you want to do is a histogram. Numpy has that
functionality built in. However: the version built in to numpy is about
as suboptimal as yours. The problem is the unnecessary sorting of the data.
In principle, a histogram does not need any sorting, so it could be done
in strictly O(N). I don't see any simply way of doing so efficiently in
numpy without resorting to a costly loop. We had a discussion about this
a while ago but I cannot remember any resolution. The proper thing to do
would probably be a C routine.
If performance is crucial to you, I would suggest using weave. A minimal
example as starting point is given below.
Greetings,
Norbert
---------------------------
#!/usr/bin/env python
import numpy
import scipy.weave
import scipy.weave.converters
def histogram(data, nR):
res = numpy.zeros(nR,int)
dataflat = data.flatten()
Ndata = len(dataflat)
scipy.weave.inline("""
for(int i=0;i<Ndata;i++) {
int d = dataflat(i);
if(d < nR && d >= 0) {
res(d) += 1;
}
}
""",
arg_names = ["dataflat","res","Ndata","nR"],
type_converters = scipy.weave.converters.blitz,
)
return res
a = numpy.array([0,3,1,2,3,5,2,3,2,4,5,1,3,4,2])
print a
print histogram(a,4)
---------------------------
Robin wrote:
> Hello,
>
> As a converting MATLAB user I am bringing over some of my code. A core
> function I need is to sample the probability distribution of a given
> set of data. Sometimes these sets are large so I would like to make it
> as efficient as possible. (the data are integers representing members
> of a discrete space)
>
> In MATLAB the best way I found was the "diff of find of diff" trick
> which resulted in the completely vectorised solution (below). Does it
> make sense to translate this into numpy? I don't have a feel yet for
> what is fast/slow - are the functions below built in and so quick (I
> thought diff might not be).
>
> Is there a better/more pythonic way to do it?
>
> --------
> function Pr=prob(data, nR)
>
> Pr = zeros(nR,1);
> % diff of find of diff trick for counting number of elements
> temp = sort(data(data>0)); % want vector excluding P(0)
> dtemp = diff([temp;max(temp)+1]);
> count = diff(find([1;dtemp]));
> indx = temp(dtemp>0);
> Pr(indx)= count ./ numel(data); % probability
> --------
>
> Thanks
>
> Robin
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> [email protected]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
Numpy-discussion mailing list
[email protected]
http://projects.scipy.org/mailman/listinfo/numpy-discussion