Hi Paul,

independent of the speed issue:
it seems you are trying to estimate the discrete entropy from frequencies, 
as opposed to compute the entropy of a categorical distribution.

The naive plugin frequency estimator you use is a bad estimator, and you 
may want to consider using an entropy estimator with better statistical 
properties.
There is a great decision chart here: 
http://memming.wordpress.com/2014/02/09/a-guide-to-discrete-entropy-estimators/

In practice I found the Grassberger 2003 estimator to be a good tradeoff in 
terms of speed and statistical quality.  A closed form formula is given as 
equation (6) in 
http://www.nowozin.net/sebastian/papers/nowozin2012infogain.pdf
In terms of best statistical accuracy, the NSB Bayes estimator is probably 
the best over a broad range of input distributions.

Good luck,
Sebastian

On Friday, 5 September 2014 05:42:20 UTC+1, paul analyst wrote:
>
> julia> entropy(s)=-sum(x->x*log(2,x), [count(x->x==c,s)/length(s) for c in 
> unique(s)]);
>
> julia> s=rand(10^3);
>
> julia> @time entropy(s)
> elapsed time: 0.167097546 seconds (20255140 bytes allocated)
> 9.965784284662059
>
> julia> s=rand(10^4);
>
> julia> @time entropy(s)
> elapsed time: 3.62008077 seconds (1602061320 bytes allocated, 21.81% gc 
> time)
> 13.287712379549843
>
> julia> s=rand(10^5);
>
> julia> @time entropy(s)
> elapsed time: 366.181311932 seconds (160021245832 bytes allocated, 21.89% 
> gc time)
> 16.609640474434073
>
> julia> s=rand(10^6);
>
> julia> @time entropy(s)
> ................................
> After 12 h not yet counted :/
>
> Paul
>

Reply via email to