Hi Paul, independent of the speed issue: it seems you are trying to estimate the discrete entropy from frequencies, as opposed to compute the entropy of a categorical distribution.
The naive plugin frequency estimator you use is a bad estimator, and you may want to consider using an entropy estimator with better statistical properties. There is a great decision chart here: http://memming.wordpress.com/2014/02/09/a-guide-to-discrete-entropy-estimators/ In practice I found the Grassberger 2003 estimator to be a good tradeoff in terms of speed and statistical quality. A closed form formula is given as equation (6) in http://www.nowozin.net/sebastian/papers/nowozin2012infogain.pdf In terms of best statistical accuracy, the NSB Bayes estimator is probably the best over a broad range of input distributions. Good luck, Sebastian On Friday, 5 September 2014 05:42:20 UTC+1, paul analyst wrote: > > julia> entropy(s)=-sum(x->x*log(2,x), [count(x->x==c,s)/length(s) for c in > unique(s)]); > > julia> s=rand(10^3); > > julia> @time entropy(s) > elapsed time: 0.167097546 seconds (20255140 bytes allocated) > 9.965784284662059 > > julia> s=rand(10^4); > > julia> @time entropy(s) > elapsed time: 3.62008077 seconds (1602061320 bytes allocated, 21.81% gc > time) > 13.287712379549843 > > julia> s=rand(10^5); > > julia> @time entropy(s) > elapsed time: 366.181311932 seconds (160021245832 bytes allocated, 21.89% > gc time) > 16.609640474434073 > > julia> s=rand(10^6); > > julia> @time entropy(s) > ................................ > After 12 h not yet counted :/ > > Paul >
